From william.law at tesserent.com Tue May 9 05:30:13 2017 From: william.law at tesserent.com (William Law) Date: Tue, 9 May 2017 15:30:13 +1000 Subject: Phase stats sync assertion error In-Reply-To: <586bd7eb216ad9c8bee756231bfee67d@mail.gmail.com> References: <586bd7eb216ad9c8bee756231bfee67d@mail.gmail.com> Message-ID: <7673b931f426b84e2993aaf9c54e92f4@mail.gmail.com> Hi all, Running polygraph 4.9.0 on CentOS 6.8 (compiled from tarball). While the simple tests, etc seem to run ok, a more gung-ho configuration that ran ok on an earlier version doesn't want to play ball with the new configuration. I'm getting random assertion fails on both polygraph-client and polygraph-server: StatPhaseSync.cc:97: assertion failed: 'false' Sometimes it may be after a single request of a robot to one of the servers, or randomly on the servers. Server ip ranges: 2.2.2-3.10-250/22 (40 nodes occupying that range) Client ip ranges: 1.1.1-2.10-250/22 (70 nodes here) I get less errors when using GET vs POST. Test environment is pretty basic currently: Client <-> router <-> Server Please let me know if you want any further information and I'll see what I can do. TIA Kind Regards, William Law Polygraph was built with the following versions: gcc 4.4.7-18.el6 gcc-c++ 4.4.7-18.el6 gnuplot 4.2.6-2.el6 gnuplot-common 4.2.6-2.el6 ldns-devel 1.6.16-7.el6.2 libgcc 4.4.7-18.el6 make 1:3.81-23.el6 ncurses-devel 5.7-4.20090207.el6 openssl-devel 1.0.1e-57.el6 patch 2.6-6.el6 zlib-devel 1.2.3-29.el6 Example client command: polygraph-client --worker 61 --ports 3000:65535 --fake_hosts 1.1.2.185-191 --cfg_dirs /usr/local/share/polygraph/polytests --config LargeUploads.pg --log /var/log/polygraph/clt.61.log --verb_lvl 10 Server command: polygraph-server --worker 40 --fake_hosts 2.2.3.238-249 --cfg_dirs /usr/local/share/polygraph/polytests --config "LargeUploads.pg" --log /var/log/polygraph/srv.40.log --idle_tout 300sec >>/var/log/polygraph/pserver.log 2>&1 Test config: Phase ph2 = { name = "MultiUser-Download"; goal.duration = 5min; primary = true; }; Content DownloadContent = { size = const (20MB); mime = { type = undef(); prefixes = [ "page" ]; extensions = [ ".dat" ]; }; cachable =0%; }; Server S = { kind = "S101"; contents = [DownloadContent]; direct_access = contents; addresses = ['2.2.2-3.10-250:80/22' ]; }; Content cntSimple = { size = unif(50KB, 100KB); }; Robot R = { kind = "R101"; pop_model = { pop_distr = popUnif(); }; //req_methods = ["POST":100% ]; req_methods = ["GET":100% ]; origins = S.addresses; addresses = ['1.1.1-2.10-250/22' ]; interests = [ "public" ]; }; schedule(ph2); use(S, R); From Nagaraja_Gundurao at symantec.com Tue May 9 16:19:23 2017 From: Nagaraja_Gundurao at symantec.com (Nagaraja Gundurao) Date: Tue, 9 May 2017 16:19:23 +0000 Subject: Questions regarding WPG Message-ID: <7196CEA3-C5C1-4C82-8E21-6CD316AE9E07@symantec.com> Hi WPG team, As we are using WPG tool as our traffic generator, we have come across a lot of hurdles in getting WPG to work for our requirement, here I am listing out some questions, it would be great if you answer and even better if we have solution to all the questions. * Is there a way to have only some number of the configured robots be active at any point in time? And, have this active set change over time, so we can cycle through all configured robots? * Is there a way to stitch together WPG log files from a number of different WPG clients? * In our lab setup, we tested configuring .pg file and found out the max. robots that could run from a single .pg file to be 3000 robots per vm. Do you agree?. Is it dependent on PC/vm or WPG limitation?. * Can the reports be generated if we have only Server side logs and WPG client is not run(client will be another tool). Here are some error messages we saw and need some explanation on when it occurs and what are it?s effect on overall performance/report. OLog.cc:118: soft assertion failed: theZStream->write(buf, size) 000.72| Xaction.cc:112: error: 4/4 (c14) premature end of msg body 000.83| Xaction.cc:112: error: 1/6 (c15) premature end of msg header 1482456915.764782# size: 0/-1 xact: 21fb326e.07e125e9:00000580 start: 1482456915.463216 [no data to dump] OLog.cc:118: (s11) Resource temporarily unavailable 013.34| Connection.cc:701: error: 3/225 (s104) Connection reset by peer 013.34| error: raw write on SSL connection failed on connection with 10.0.26.170:21521 at 3 reads, 159 writes, and 1 transactions AddrParsers.cc:31: soft assertion failed: defaultPort >= 0 -------------- next part -------------- An HTML attachment was scrubbed... URL: From rousskov at measurement-factory.com Tue May 9 19:22:53 2017 From: rousskov at measurement-factory.com (Alex Rousskov) Date: Tue, 9 May 2017 13:22:53 -0600 Subject: Phase stats sync assertion error In-Reply-To: <7673b931f426b84e2993aaf9c54e92f4@mail.gmail.com> References: <586bd7eb216ad9c8bee756231bfee67d@mail.gmail.com> <7673b931f426b84e2993aaf9c54e92f4@mail.gmail.com> Message-ID: <6e96cb30-e878-c3df-397c-4d958d30a0c1@measurement-factory.com> On 05/08/2017 11:30 PM, William Law wrote: > Running polygraph 4.9.0 on CentOS 6.8 (compiled from tarball). While the > simple tests, etc seem to run ok, a more gung-ho configuration that ran ok > on an earlier version doesn't want to play ball with the new configuration. > > I'm getting random assertion fails on both polygraph-client and > polygraph-server: > > StatPhaseSync.cc:97: assertion failed: 'false' Is it possible that you have multiple tests running concurrently? For example, perhaps your test setup does not always kill servers (or clients) from the old test and they prevent some servers from the new test starting without you noticing? You can work around this problem by disabling phase synchronization, but that is not a proper fix, of course. Is it possible that you have more than ~30 Polygraph processes participating in a test? There is a hard-coded limit (that we should remove) in phase synchronization code. You can work around this problem by increasing the limit (search for 37 in src/runtime/StatPhaseSync.cc) and recompiling Polygraph. Please let us know if that helps! Overlapping tests and more than ~30 processes are the known cases that may cause those assertions in your Polygraph version. > Example client command: > polygraph-client --worker 61 ... > > Server command: > polygraph-server --worker 40 ... Why are you using --worker if you are not running SMP tests? Thank you, Alex. From rousskov at measurement-factory.com Tue May 9 20:53:13 2017 From: rousskov at measurement-factory.com (Alex Rousskov) Date: Tue, 9 May 2017 14:53:13 -0600 Subject: Questions regarding WPG In-Reply-To: <7196CEA3-C5C1-4C82-8E21-6CD316AE9E07@symantec.com> References: <7196CEA3-C5C1-4C82-8E21-6CD316AE9E07@symantec.com> Message-ID: On 05/09/2017 10:19 AM, Nagaraja Gundurao wrote: > * Is there a way to have only some number of the configured robots be > active at any point in time? And, have this active set change over time, Yes, see populus_factor_beg and populus_factor_end in PGL Phase object[1]. You can build arbitrary population growth/decline patterns by scheduling[2] appropriately configured test Phases. For example, PolyMix-4 workload uses[3] that feature. [1] http://www.web-polygraph.org/docs/reference/pgl/types.html#type:docs/reference/pgl/types/Phase [2] http://www.web-polygraph.org/docs/reference/pgl/calls.html#call:docs/reference/pgl/calls/schedule [3] http://www.web-polygraph.org/docs/workloads/polymix-4/#Sect:3.1 > so we can cycle through all configured robots? This aspect is different from the two aspects you have mentioned above. The Phase-driven test schedule does not focus on which configured robots are used. It focuses on the number of used robots. For example, there is currently no way to use 10% of maximum robot population at any given time but eventually use all configured robots. If you describe the details of your use case, and the desirable parameters, we would consider adding support for it. > * Is there a way to stitch together WPG log files from a number of > different WPG clients? Yes, both Polygraph reporter and lx tools merge all given logs automatically by default. This merging was designed for concurrent client runs within one test, but it may work for sequential runs as well, especially if there are no gaps between tests. If something does not work well when you merge, please discuss. > * In our lab setup, we tested configuring .pg file and found out the > max. robots that could run from a single .pg file to be 3000 robots per > vm. Do you agree?. Is it dependent on PC/vm or WPG limitation?. There is no artificial limit on the number of robots. The practical limit heavily depends on your hardware, OS, their configuration, and Polygraph workload. > * Can the reports be generated if we have only Server side logs and WPG > client is not run(client will be another tool). I believe so. Naturally, client-dependent stats will not be available. > Here are some error messages we saw and need some explanation on when it > occurs and what are it?s effect on overall performance/report. > OLog.cc:118: soft assertion failed: theZStream->write(buf, size) > OLog.cc:118: (s11) Resource temporarily unavailable Polygraph cannot write its test log file. Perhaps there is something wrong with the log storage device? It is supposed to be reliable. Are you using some kind of network-dependent storage that may get overwhelmed with log and/or test traffic and fail? Test log corruption is likely until you fix this unusual problem. > 000.72| Xaction.cc:112: error: 4/4 (c14) premature end of msg body > 000.83| Xaction.cc:112: error: 1/6 (c15) premature end of msg header These custom Polygraph errors are documented[4]. [4] http://www.web-polygraph.org/docs/reference/output/messages.html > 013.34| Connection.cc:701: error: 3/225 (s104) Connection reset by peer > 013.34| error: raw write on SSL connection failed on connection with > 10.0.26.170:21521 at 3 reads, 159 writes, and 1 transactions This is a standard system call error -- Polygraph could not write(2) to a TCP socket when talking to an SSL peer (because the peer closed the connection or disappeared). The affected transaction will be counted as failed, of course. Such errors are possible if HTTP agents have mismatching persistent connection settings/defaults, leading to HTTP race conditions. If you do not use persistent HTTP connections or carefully configure them to avoid race conditions, then the peer that closed the connection prematurely should know why it did that. > AddrParsers.cc:31: soft assertion failed: defaultPort >= 0 This is either a Polygraph bug or some kind of PGL misconfiguration. Need more info to classify: Does this happen at startup or during a test? Just once or many times? Is this problem easy to reproduce? HTH, Alex. From william.law at tesserent.com Wed May 10 00:02:57 2017 From: william.law at tesserent.com (William Law) Date: Wed, 10 May 2017 10:02:57 +1000 Subject: Phase stats sync assertion error In-Reply-To: <6e96cb30-e878-c3df-397c-4d958d30a0c1@measurement-factory.com> References: <586bd7eb216ad9c8bee756231bfee67d@mail.gmail.com> <7673b931f426b84e2993aaf9c54e92f4@mail.gmail.com> <6e96cb30-e878-c3df-397c-4d958d30a0c1@measurement-factory.com> Message-ID: <2ed9afc46e1bdf7fdd1bf90463cd28b6@mail.gmail.com> Hi Alex, > > Running polygraph 4.9.0 on CentOS 6.8 (compiled from tarball). While the > > simple tests, etc seem to run ok, a more gung-ho configuration that ran ok > > on an earlier version doesn't want to play ball with the new configuration. > > > > I'm getting random assertion fails on both polygraph-client and > > polygraph-server: > > > > StatPhaseSync.cc:97: assertion failed: 'false' > > Is it possible that you have multiple tests running concurrently? For > example, perhaps your test setup does not always kill servers (or > clients) from the old test and they prevent some servers from the new > test starting without you noticing? You can work around this problem by > disabling phase synchronization, but that is not a proper fix, of course. I was always careful to make sure that any previous clients and servers had been nuked before I started the next run. In the end though most of the instances would fail with an assertion anyway. > Is it possible that you have more than ~30 Polygraph processes > participating in a test? There is a hard-coded limit (that we should > remove) in phase synchronization code. You can work around this problem > by increasing the limit (search for 37 in src/runtime/StatPhaseSync.cc) > and recompiling Polygraph. Please let us know if that helps! I was running more than that and ended up reducing the count to try and stabilise it. I think it finally steadied at about 20 server processes. We are looking to *really* stress proxy/firewall devices (e.g. >10Gbit hopefully) so really cranking this up on both client and server side. > Why are you using --worker if you are not running SMP tests? That was worker xx out of 70 :D We have a script that spawns a heap of servers and clients with a defined test config, then cleans up afterwards. This is called from a web page to make launching, monitoring and browsing the results a lot easier. We're running a pair of Dell R830's with 80 logical cores and multiple 10GbE interfaces as the client/server machines. We then shove the poor unsuspecting DUT in the middle. Currently just have a VM doing routing but it definitely isn't the bottleneck (9.6Gbit/s throughput without tuning via iperf). Once we work out what headroom we have and server/client instance requirements we are probably going to look at using linux namespaces, etc to segment networks and run multiple tests in parallel with multiple DUT's. Have resources, must eat them! Kind Regards, William Law From rousskov at measurement-factory.com Wed May 10 03:51:12 2017 From: rousskov at measurement-factory.com (Alex Rousskov) Date: Tue, 9 May 2017 21:51:12 -0600 Subject: Phase stats sync assertion error In-Reply-To: <2ed9afc46e1bdf7fdd1bf90463cd28b6@mail.gmail.com> References: <586bd7eb216ad9c8bee756231bfee67d@mail.gmail.com> <7673b931f426b84e2993aaf9c54e92f4@mail.gmail.com> <6e96cb30-e878-c3df-397c-4d958d30a0c1@measurement-factory.com> <2ed9afc46e1bdf7fdd1bf90463cd28b6@mail.gmail.com> Message-ID: On 05/09/2017 06:02 PM, William Law wrote: > On 05/09/2017, Alex Rousskov wrote: >> Is it possible that you have more than ~30 Polygraph processes >> participating in a test? > I was running more than that and ended up reducing the count to try and > stabilise it. I think it finally steadied at about 20 server processes. > We are looking to *really* stress proxy/firewall devices (e.g. >10Gbit > hopefully) so really cranking this up on both client and server side. That probably explains it then. IIRC, it is not just the server processes that count because client processes exchange phase information among them as well (via servers) Please try to work around this problem by increasing the hard-coded limit (search for 37 in src/runtime/StatPhaseSync.cc) and recompiling Polygraph. Changing 37 to some prime number like 97 or even 199 may work well. Please let us know if that helps, and we will work on removing the hard-coded limit. >> Why are you using --worker if you are not running SMP tests? > > That was worker xx out of 70 :D We have a script that spawns a heap of > servers and clients with a defined test config, then cleans up afterwards. Please note that starting individual workers on your own is not officially supported -- the interface between the master process and workers may change without notice, and those changes may affect your setup. When you get a chance, consider adding a proper SMP Bench configuration to your workload so that Polygraph starts workers based on your test configuration. Thank you, Alex. From pmix at hendrie.id.au Fri May 19 03:07:56 2017 From: pmix at hendrie.id.au (Michael Hendrie) Date: Fri, 19 May 2017 12:37:56 +0930 Subject: SMP Workloads Message-ID: <79B4283F-8E04-4BA6-82BC-EB0D0E00A558@hendrie.id.au> Hi All, I'm experimenting with using SMP in workloads in v4.9.0 to increase the capacity of my test rig. I notice without SMP I have multiple idle CPU cores on the host machines so expecting to see significant increase in performance. I have tried multiple different configuration but consistently see console output on the client reporting connection errors and when I check one of the server one of the worker processes will have silently terminated. This doesn't appear to happen with any particular pattern but does occur more often than not making it not possible to start a test with any confidence in SMP mode. I'm seeing this same behaviour with both SSL and HTTP workloads, SSL workloads was the main driver to move to SMP as they are more taxing on the host but also tried HTTP to rule it out. Here's the bench config that is intended to start multiple workers as I understand from the changelog. Bench sslBench = { client_side = { max_host_load = 300/sec; max_agent_load = 0.4/sec; addr_space = [ 'lo::172.17.60-123.1-250/22' ]; hosts = [ '172.16.0.60-62' ]; }; server_side = { max_host_load = client_side.max_host_load; max_agent_load = 0.4/sec; addr_space = [ 'lo::172.17.188-251.0-250:443/22' ]; hosts = [ '172.16.0.64-66' ] ** 2; }; }; Entire workload also attached also. Any suggestions on how I can get some stability in the SMP tests or other users facing the same issue? I'm running on a RHEL 6.x clone OS. Thanks, Michael -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: smp_ssl.pg Type: application/octet-stream Size: 2510 bytes Desc: not available URL: -------------- next part -------------- An HTML attachment was scrubbed... URL: From rousskov at measurement-factory.com Fri May 19 16:47:29 2017 From: rousskov at measurement-factory.com (Alex Rousskov) Date: Fri, 19 May 2017 10:47:29 -0600 Subject: SMP Workloads In-Reply-To: <79B4283F-8E04-4BA6-82BC-EB0D0E00A558@hendrie.id.au> References: <79B4283F-8E04-4BA6-82BC-EB0D0E00A558@hendrie.id.au> Message-ID: On 05/18/2017 09:07 PM, Michael Hendrie wrote: > I have tried multiple different configuration but consistently see > console output on the client reporting connection errors and when I > check one of the server one of the worker processes will have silently > terminated. I recommend focusing on solving that silent termination problem. There are two general reasons for silent deaths: Polygraph bugs and running out of system resources. It is usually possible to figure out what exactly is going on. Enable coredumps. Test that they work by sending a running Polygraph server process SIGABRT. Check system logs. Etc. Once that server death becomes less "silent", either report a Polygraph bug (with that information at hand) or adjust system resources/workload. Good luck, Alex.