From william.law at tesserent.com  Tue May  9 05:30:13 2017
From: william.law at tesserent.com (William Law)
Date: Tue, 9 May 2017 15:30:13 +1000
Subject: Phase stats sync assertion error
In-Reply-To: <586bd7eb216ad9c8bee756231bfee67d@mail.gmail.com>
References: <586bd7eb216ad9c8bee756231bfee67d@mail.gmail.com>
Message-ID: <7673b931f426b84e2993aaf9c54e92f4@mail.gmail.com>

Hi all,
Running polygraph 4.9.0 on CentOS 6.8 (compiled from tarball).  While the
simple tests, etc seem to run ok, a more gung-ho configuration that ran ok
on an earlier version doesn't want to play ball with the new configuration.

I'm getting random assertion fails on both polygraph-client and
polygraph-server:

StatPhaseSync.cc:97: assertion failed: 'false'

Sometimes it may be after a single request of a robot to one of the servers,
or randomly on the servers.

Server ip ranges: 2.2.2-3.10-250/22 (40 nodes occupying that range)
Client ip ranges: 1.1.1-2.10-250/22 (70 nodes here)

I get less errors when using GET vs POST.

Test environment is pretty basic currently:

Client <-> router <-> Server

Please let me know if you want any further information and I'll see what I
can do.

TIA

Kind Regards,
William Law


Polygraph was built with the following versions:
gcc             4.4.7-18.el6
gcc-c++         4.4.7-18.el6
gnuplot         4.2.6-2.el6
gnuplot-common  4.2.6-2.el6
ldns-devel      1.6.16-7.el6.2
libgcc          4.4.7-18.el6
make            1:3.81-23.el6
ncurses-devel   5.7-4.20090207.el6
openssl-devel   1.0.1e-57.el6
patch           2.6-6.el6
zlib-devel      1.2.3-29.el6

Example client command:
polygraph-client --worker 61 --ports 3000:65535 --fake_hosts
1.1.2.185-191 --cfg_dirs /usr/local/share/polygraph/polytests --config
LargeUploads.pg --log /var/log/polygraph/clt.61.log --verb_lvl 10

Server command:
polygraph-server --worker 40 --fake_hosts 2.2.3.238-249 --cfg_dirs
/usr/local/share/polygraph/polytests --config "LargeUploads.pg" --log
/var/log/polygraph/srv.40.log --idle_tout 300sec
 >>/var/log/polygraph/pserver.log 2>&1

Test config:
Phase ph2 = {
        name = "MultiUser-Download";
        goal.duration = 5min;
        primary = true;
};

Content DownloadContent = {
        size = const (20MB);
        mime = {
                type = undef();
                prefixes = [ "page" ];
                extensions = [ ".dat" ];
        };

        cachable =0%;
};

Server S = {
        kind = "S101";
        contents = [DownloadContent];
        direct_access = contents;
        addresses = ['2.2.2-3.10-250:80/22' ];
};

Content cntSimple = {
    size = unif(50KB, 100KB);
};

Robot R = {
        kind = "R101";
        pop_model = { pop_distr = popUnif(); };
        //req_methods = ["POST":100% ];
        req_methods = ["GET":100% ];
        origins = S.addresses;
        addresses = ['1.1.1-2.10-250/22' ];
        interests = [ "public" ];
};

schedule(ph2);
use(S, R);

From Nagaraja_Gundurao at symantec.com  Tue May  9 16:19:23 2017
From: Nagaraja_Gundurao at symantec.com (Nagaraja Gundurao)
Date: Tue, 9 May 2017 16:19:23 +0000
Subject: Questions regarding WPG 
Message-ID: <7196CEA3-C5C1-4C82-8E21-6CD316AE9E07@symantec.com>

Hi WPG team,
  As we are using WPG tool as our traffic generator, we have come across a lot of hurdles in getting WPG to work for our requirement, here I am listing out some questions, it would be great if you answer and even better if we have solution to all the questions.

* Is there a way to have only some number of the configured robots be active at any point in time? And, have this active set change over time, so we can cycle through all configured robots?
* Is there a way to stitch together WPG log files from a number of different WPG clients?
* In our lab setup, we tested configuring .pg file and found out the max. robots that could run from a single .pg file to be 3000 robots per vm.  Do you agree?.  Is it dependent on PC/vm or WPG limitation?.
* Can the reports be generated if we have only Server side logs and WPG client is not run(client will be another tool).

Here are some error messages we saw and need some explanation on when it occurs and what are it?s effect on overall performance/report.

OLog.cc:118: soft assertion failed: theZStream->write(buf, size)

000.72| Xaction.cc:112: error: 4/4 (c14) premature end of msg body

000.83| Xaction.cc:112: error: 1/6 (c15) premature end of msg header

1482456915.764782# size: 0/-1 xact: 21fb326e.07e125e9:00000580 start: 1482456915.463216
[no data to dump]

OLog.cc:118: (s11) Resource temporarily unavailable

013.34| Connection.cc:701: error: 3/225 (s104) Connection reset by peer
013.34| error: raw write on SSL connection failed on connection with 10.0.26.170:21521 at 3 reads, 159 writes, and 1 transactions

AddrParsers.cc:31: soft assertion failed: defaultPort >= 0

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.web-polygraph.org/pipermail/users/attachments/20170509/2ec8a19e/attachment-0001.html>

From rousskov at measurement-factory.com  Tue May  9 19:22:53 2017
From: rousskov at measurement-factory.com (Alex Rousskov)
Date: Tue, 9 May 2017 13:22:53 -0600
Subject: Phase stats sync assertion error
In-Reply-To: <7673b931f426b84e2993aaf9c54e92f4@mail.gmail.com>
References: <586bd7eb216ad9c8bee756231bfee67d@mail.gmail.com>
 <7673b931f426b84e2993aaf9c54e92f4@mail.gmail.com>
Message-ID: <6e96cb30-e878-c3df-397c-4d958d30a0c1@measurement-factory.com>

On 05/08/2017 11:30 PM, William Law wrote:

> Running polygraph 4.9.0 on CentOS 6.8 (compiled from tarball).  While the
> simple tests, etc seem to run ok, a more gung-ho configuration that ran ok
> on an earlier version doesn't want to play ball with the new configuration.
> 
> I'm getting random assertion fails on both polygraph-client and
> polygraph-server:
> 
> StatPhaseSync.cc:97: assertion failed: 'false'

Is it possible that you have multiple tests running concurrently?  For
example, perhaps your test setup does not always kill servers (or
clients) from the old test and they prevent some servers from the new
test starting without you noticing? You can work around this problem by
disabling phase synchronization, but that is not a proper fix, of course.

Is it possible that you have more than ~30 Polygraph processes
participating in a test? There is a hard-coded limit (that we should
remove) in phase synchronization code. You can work around this problem
by increasing the limit (search for 37 in src/runtime/StatPhaseSync.cc)
and recompiling Polygraph. Please let us know if that helps!

Overlapping tests and more than ~30 processes are the known cases that
may cause those assertions in your Polygraph version.


> Example client command:
> polygraph-client --worker 61 ...
> 
> Server command:
> polygraph-server --worker 40 ...

Why are you using --worker if you are not running SMP tests?


Thank you,

Alex.


From rousskov at measurement-factory.com  Tue May  9 20:53:13 2017
From: rousskov at measurement-factory.com (Alex Rousskov)
Date: Tue, 9 May 2017 14:53:13 -0600
Subject: Questions regarding WPG
In-Reply-To: <7196CEA3-C5C1-4C82-8E21-6CD316AE9E07@symantec.com>
References: <7196CEA3-C5C1-4C82-8E21-6CD316AE9E07@symantec.com>
Message-ID: <b7fb5b3d-eaa9-97fe-ee84-824f03526857@measurement-factory.com>

On 05/09/2017 10:19 AM, Nagaraja Gundurao wrote:

> * Is there a way to have only some number of the configured robots be
> active at any point in time? And, have this active set change over time,

Yes, see populus_factor_beg and populus_factor_end in PGL Phase
object[1]. You can build arbitrary population growth/decline patterns by
scheduling[2] appropriately configured test Phases. For example,
PolyMix-4 workload uses[3] that feature.

[1]
http://www.web-polygraph.org/docs/reference/pgl/types.html#type:docs/reference/pgl/types/Phase

[2]
http://www.web-polygraph.org/docs/reference/pgl/calls.html#call:docs/reference/pgl/calls/schedule

[3] http://www.web-polygraph.org/docs/workloads/polymix-4/#Sect:3.1


> so we can cycle through all configured robots?

This aspect is different from the two aspects you have mentioned above.
The Phase-driven test schedule does not focus on which configured robots
are used. It focuses on the number of used robots. For example, there is
currently no way to use 10% of maximum robot population at any given
time but eventually use all configured robots. If you describe the
details of your use case, and the desirable parameters, we would
consider adding support for it.


> * Is there a way to stitch together WPG log files from a number of
> different WPG clients?

Yes, both Polygraph reporter and lx tools merge all given logs
automatically by default. This merging was designed for concurrent
client runs within one test, but it may work for sequential runs as
well, especially if there are no gaps between tests. If something does
not work well when you merge, please discuss.


> * In our lab setup, we tested configuring .pg file and found out the
> max. robots that could run from a single .pg file to be 3000 robots per
> vm.  Do you agree?.  Is it dependent on PC/vm or WPG limitation?.

There is no artificial limit on the number of robots. The practical
limit heavily depends on your hardware, OS, their configuration, and
Polygraph workload.


> * Can the reports be generated if we have only Server side logs and WPG
> client is not run(client will be another tool).

I believe so. Naturally, client-dependent stats will not be available.


> Here are some error messages we saw and need some explanation on when it
> occurs and what are it?s effect on overall performance/report.


> OLog.cc:118: soft assertion failed: theZStream->write(buf, size)
> OLog.cc:118: (s11) Resource temporarily unavailable

Polygraph cannot write its test log file. Perhaps there is something
wrong with the log storage device? It is supposed to be reliable. Are
you using some kind of network-dependent storage that may get
overwhelmed with log and/or test traffic and fail?

Test log corruption is likely until you fix this unusual problem.


> 000.72| Xaction.cc:112: error: 4/4 (c14) premature end of msg body

> 000.83| Xaction.cc:112: error: 1/6 (c15) premature end of msg header

These custom Polygraph errors are documented[4].

[4] http://www.web-polygraph.org/docs/reference/output/messages.html


> 013.34| Connection.cc:701: error: 3/225 (s104) Connection reset by peer
> 013.34| error: raw write on SSL connection failed on connection with
> 10.0.26.170:21521 at 3 reads, 159 writes, and 1 transactions

This is a standard system call error -- Polygraph could not write(2) to
a TCP socket when talking to an SSL peer (because the peer closed the
connection or disappeared). The affected transaction will be counted as
failed, of course.

Such errors are possible if HTTP agents have mismatching persistent
connection settings/defaults, leading to HTTP race conditions. If you do
not use persistent HTTP connections or carefully configure them to avoid
race conditions, then the peer that closed the connection prematurely
should know why it did that.


> AddrParsers.cc:31: soft assertion failed: defaultPort >= 0

This is either a Polygraph bug or some kind of PGL misconfiguration.
Need more info to classify: Does this happen at startup or during a
test? Just once or many times? Is this problem easy to reproduce?


HTH,

Alex.


From william.law at tesserent.com  Wed May 10 00:02:57 2017
From: william.law at tesserent.com (William Law)
Date: Wed, 10 May 2017 10:02:57 +1000
Subject: Phase stats sync assertion error
In-Reply-To: <6e96cb30-e878-c3df-397c-4d958d30a0c1@measurement-factory.com>
References: <586bd7eb216ad9c8bee756231bfee67d@mail.gmail.com>
 <7673b931f426b84e2993aaf9c54e92f4@mail.gmail.com>
 <6e96cb30-e878-c3df-397c-4d958d30a0c1@measurement-factory.com>
Message-ID: <2ed9afc46e1bdf7fdd1bf90463cd28b6@mail.gmail.com>

Hi Alex,

> > Running polygraph 4.9.0 on CentOS 6.8 (compiled from tarball).  While
the
> > simple tests, etc seem to run ok, a more gung-ho configuration that
ran ok
> > on an earlier version doesn't want to play ball with the new
configuration.
> >
> > I'm getting random assertion fails on both polygraph-client and
> > polygraph-server:
> >
> > StatPhaseSync.cc:97: assertion failed: 'false'
>
> Is it possible that you have multiple tests running concurrently?  For
> example, perhaps your test setup does not always kill servers (or
> clients) from the old test and they prevent some servers from the new
> test starting without you noticing? You can work around this problem by
> disabling phase synchronization, but that is not a proper fix, of
course.

I was always careful to make sure that any previous clients and servers
had been nuked before I started the next run.  In the end though most of
the instances would fail with an assertion anyway.

> Is it possible that you have more than ~30 Polygraph processes
> participating in a test? There is a hard-coded limit (that we should
> remove) in phase synchronization code. You can work around this problem
> by increasing the limit (search for 37 in src/runtime/StatPhaseSync.cc)
> and recompiling Polygraph. Please let us know if that helps!

I was running more than that and ended up reducing the count to try and
stabilise it.  I think it finally steadied at about 20 server processes.
We are looking to *really* stress proxy/firewall devices (e.g. >10Gbit
hopefully) so really cranking this up on both client and server side.

> Why are you using --worker if you are not running SMP tests?

That was worker xx out of 70 :D  We have a script that spawns a heap of
servers and clients with a defined test config, then cleans up afterwards.
This is called from a web page to make launching, monitoring and browsing
the results a lot easier.

We're running a pair of Dell R830's with 80 logical cores and multiple
10GbE interfaces as the client/server machines.  We then shove the poor
unsuspecting DUT in the middle.  Currently just have a VM doing routing
but it definitely isn't the bottleneck (9.6Gbit/s throughput without
tuning via iperf).

Once we work out what headroom we have and server/client instance
requirements we are probably going to look at using linux namespaces, etc
to segment networks and run multiple tests in parallel with multiple
DUT's.  Have resources, must eat them!

Kind Regards,

William Law

From rousskov at measurement-factory.com  Wed May 10 03:51:12 2017
From: rousskov at measurement-factory.com (Alex Rousskov)
Date: Tue, 9 May 2017 21:51:12 -0600
Subject: Phase stats sync assertion error
In-Reply-To: <2ed9afc46e1bdf7fdd1bf90463cd28b6@mail.gmail.com>
References: <586bd7eb216ad9c8bee756231bfee67d@mail.gmail.com>
 <7673b931f426b84e2993aaf9c54e92f4@mail.gmail.com>
 <6e96cb30-e878-c3df-397c-4d958d30a0c1@measurement-factory.com>
 <2ed9afc46e1bdf7fdd1bf90463cd28b6@mail.gmail.com>
Message-ID: <b571ccad-2fb6-fe16-4aa5-778abbad90a2@measurement-factory.com>

On 05/09/2017 06:02 PM, William Law wrote:
> On 05/09/2017, Alex Rousskov wrote:
>> Is it possible that you have more than ~30 Polygraph processes
>> participating in a test?

> I was running more than that and ended up reducing the count to try and
> stabilise it.  I think it finally steadied at about 20 server processes.
> We are looking to *really* stress proxy/firewall devices (e.g. >10Gbit
> hopefully) so really cranking this up on both client and server side.

That probably explains it then. IIRC, it is not just the server
processes that count because client processes exchange phase information
among them as well (via servers) Please try to work around this problem
by increasing the hard-coded limit (search for 37 in
src/runtime/StatPhaseSync.cc) and recompiling Polygraph. Changing 37 to
some prime number like 97 or even 199 may work well. Please let us know
if that helps, and we will work on removing the hard-coded limit.


>> Why are you using --worker if you are not running SMP tests?
> 
> That was worker xx out of 70 :D  We have a script that spawns a heap of
> servers and clients with a defined test config, then cleans up afterwards.

Please note that starting individual workers on your own is not
officially supported -- the interface between the master process and
workers may change without notice, and those changes may affect your
setup. When you get a chance, consider adding a proper SMP Bench
configuration to your workload so that Polygraph starts workers based on
your test configuration.


Thank you,

Alex.


From pmix at hendrie.id.au  Fri May 19 03:07:56 2017
From: pmix at hendrie.id.au (Michael Hendrie)
Date: Fri, 19 May 2017 12:37:56 +0930
Subject: SMP Workloads
Message-ID: <79B4283F-8E04-4BA6-82BC-EB0D0E00A558@hendrie.id.au>

Hi All,

I'm experimenting with using SMP in workloads in v4.9.0 to increase the capacity of my test rig.  I notice without SMP I have multiple idle CPU cores on the host machines so expecting to see significant increase in performance.

I have tried multiple different configuration but consistently see console output on the client reporting connection errors and when I check one of the server one of the worker processes will have silently terminated.  This doesn't appear to happen with any particular pattern but does occur more often than not making it not possible to start a test with any confidence in SMP mode.

I'm seeing this same behaviour with both SSL and HTTP workloads, SSL workloads was the main driver to move to SMP as they are more taxing on the host but also tried HTTP to rule it out.

Here's the bench config that is intended to start multiple workers as I understand from the changelog. 

Bench sslBench = {
	client_side = {
		max_host_load = 300/sec;
		max_agent_load = 0.4/sec;
		addr_space = [ 'lo::172.17.60-123.1-250/22' ];
		hosts =  [ '172.16.0.60-62' ];
	};
	server_side = {
		max_host_load = client_side.max_host_load;
		max_agent_load = 0.4/sec;
		addr_space = [ 'lo::172.17.188-251.0-250:443/22' ];
		hosts = [ '172.16.0.64-66' ] ** 2;
	};
};

 Entire workload also attached also.

Any suggestions on how I can get some stability in the SMP tests or other users facing the same issue?  I'm running on a RHEL 6.x clone OS.

Thanks,

Michael

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.web-polygraph.org/pipermail/users/attachments/20170519/05462e38/attachment-0002.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smp_ssl.pg
Type: application/octet-stream
Size: 2510 bytes
Desc: not available
URL: <http://lists.web-polygraph.org/pipermail/users/attachments/20170519/05462e38/attachment-0001.obj>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.web-polygraph.org/pipermail/users/attachments/20170519/05462e38/attachment-0003.html>

From rousskov at measurement-factory.com  Fri May 19 16:47:29 2017
From: rousskov at measurement-factory.com (Alex Rousskov)
Date: Fri, 19 May 2017 10:47:29 -0600
Subject: SMP Workloads
In-Reply-To: <79B4283F-8E04-4BA6-82BC-EB0D0E00A558@hendrie.id.au>
References: <79B4283F-8E04-4BA6-82BC-EB0D0E00A558@hendrie.id.au>
Message-ID: <fc8b8809-762b-b64a-54eb-56a3f2c9270b@measurement-factory.com>

On 05/18/2017 09:07 PM, Michael Hendrie wrote:

> I have tried multiple different configuration but consistently see
> console output on the client reporting connection errors and when I
> check one of the server one of the worker processes will have silently
> terminated.  

I recommend focusing on solving that silent termination problem. There
are two general reasons for silent deaths: Polygraph bugs and running
out of system resources. It is usually possible to figure out what
exactly is going on. Enable coredumps. Test that they work by sending a
running Polygraph server process SIGABRT. Check system logs. Etc.

Once that server death becomes less "silent", either report a Polygraph
bug (with that information at hand) or adjust system resources/workload.


Good luck,

Alex.