Phase stats sync assertion error

Wed May 10 00:02:57 UTC 2017

Hi Alex,

> > Running polygraph 4.9.0 on CentOS 6.8 (compiled from tarball).  While
the
> > simple tests, etc seem to run ok, a more gung-ho configuration that
ran ok
> > on an earlier version doesn't want to play ball with the new
configuration.
> >
> > I'm getting random assertion fails on both polygraph-client and
> > polygraph-server:
> >
> > StatPhaseSync.cc:97: assertion failed: 'false'
>
> Is it possible that you have multiple tests running concurrently?  For
> example, perhaps your test setup does not always kill servers (or
> clients) from the old test and they prevent some servers from the new
> test starting without you noticing? You can work around this problem by
> disabling phase synchronization, but that is not a proper fix, of
course.

I was always careful to make sure that any previous clients and servers
had been nuked before I started the next run.  In the end though most of
the instances would fail with an assertion anyway.

> Is it possible that you have more than ~30 Polygraph processes
> participating in a test? There is a hard-coded limit (that we should
> remove) in phase synchronization code. You can work around this problem
> by increasing the limit (search for 37 in src/runtime/StatPhaseSync.cc)
> and recompiling Polygraph. Please let us know if that helps!

I was running more than that and ended up reducing the count to try and
stabilise it.  I think it finally steadied at about 20 server processes.
We are looking to *really* stress proxy/firewall devices (e.g. >10Gbit
hopefully) so really cranking this up on both client and server side.

> Why are you using --worker if you are not running SMP tests?

That was worker xx out of 70 :D  We have a script that spawns a heap of
servers and clients with a defined test config, then cleans up afterwards.
This is called from a web page to make launching, monitoring and browsing
the results a lot easier.

We're running a pair of Dell R830's with 80 logical cores and multiple
10GbE interfaces as the client/server machines.  We then shove the poor
unsuspecting DUT in the middle.  Currently just have a VM doing routing
but it definitely isn't the bottleneck (9.6Gbit/s throughput without
tuning via iperf).

Once we work out what headroom we have and server/client instance
requirements we are probably going to look at using linux namespaces, etc
to segment networks and run multiple tests in parallel with multiple
DUT's.  Have resources, must eat them!

Kind Regards,

William Law