From pmix at hendrie.id.au Wed Jun 5 07:31:08 2013 From: pmix at hendrie.id.au (Michael Hendrie) Date: Wed, 5 Jun 2013 17:01:08 +0930 Subject: Client side stats SSL workload 4.3.2 In-Reply-To: <2D22BE02-6BDE-4B0F-B26B-C4842A86DE65@hendrie.id.au> References: <87obcb8eki.fsf@gmail.com> <51951074.6080104@measurement-factory.com> <8274538C-DCB9-4F3E-8A9C-8A33D127A5D7@hendrie.id.au> <87ip2h8t91.fsf@gmail.com> <2D22BE02-6BDE-4B0F-B26B-C4842A86DE65@hendrie.id.au> Message-ID: <813FB77A-2546-493D-93B3-06136B3B76D5@hendrie.id.au> On 19/05/2013, at 8:50 AM, Michael Hendrie wrote: > > On 18/05/2013, at 1:59 AM, Dmitry Kurochkin wrote: > >> Hi Michael. >> >> Michael Hendrie writes: >> >>> On 17/05/2013, at 2:29 AM, Alex Rousskov wrote: >>> >>>> On 05/16/2013 03:22 AM, Dmitry Kurochkin wrote: >>>>> Hi Michael. >>>>> >>>>> Michael Hendrie writes: >>>>> >>>>>> Hi List, >>>>>> >>>>>> I recently updated from Web-Polygraph 4.0.10 running on FreeBSD to >>>>>> Web-Polygraph 4.3.2 running on Linux (RHEL 6.1 clone) and now have some >>>>>> inconsistency being reported in the client side runtime stats. >>>>>> >>>>>> The issue only seems to occur when polygraph-client is configured to send >>>>>> requests via a proxy. In the client side stats, I'm seeing a significantly >>>>>> higher request rate than what was configured, on the server side stats they >>>>>> report as expected. e.g. on a test configured for 250req/sec: >>>>>> >>>>>> Client Side Stats: >>>>>> 030.41| i-top1 786670 421.83 1357 0.00 0 1671 >>>>>> ssl 448598 421.83 1360 0.00 0 1671 >>>>>> 030.50| i-top1 788934 452.78 1379 0.00 0 1668 >>>>>> ssl 449857 452.78 1379 0.00 0 1668 >>>>>> 030.58| i-top1 791229 458.98 1465 0.00 0 1660 >>>>>> ssl 451193 458.98 1465 0.00 0 1660 >>>>>> 030.66| i-top1 793401 434.40 1461 0.00 0 1687 >>>>>> ssl 452449 434.40 1463 0.00 0 1687 >>>>>> 030.75| i-top1 795476 415.02 1401 0.00 0 1723 >>>>>> ssl 453636 415.02 1402 0.00 0 1721 >>>>>> >>>>>> Server Side Stats: >>>>>> 030.45| i-top1 426331 227.03 2518 0.00 0 2200 >>>>>> ssl 426331 227.03 2518 0.00 0 1637 >>>>>> 030.54| i-top1 427519 237.59 2539 0.00 0 2201 >>>>>> ssl 427519 237.59 2539 0.00 0 1638 >>>>>> 030.62| i-top1 428710 238.19 2467 0.00 0 2171 >>>>>> ssl 428710 238.19 2468 0.00 0 1608 >>>>>> 030.70| i-top1 429962 250.40 2503 0.00 0 2300 >>>>>> ssl 429962 250.40 2502 0.00 0 1737 >>>>>> >>>>>> >>>>>> I didn't see this inconsistency in 4.0.10 and if I have a direct >>>>>> client->server test (no proxy) then stats are also reported correctly. >>>>>> >>>>>> The proxy itself (squid) is reporting around 190 client req/sec: >>>>>> client_http.requests = 188.674032/sec >>>>>> >>>>>> Having just read the changelog (after typing the above), now wondering if >>>>>> the fix in the non-public 4.6.0 would resolve this issue? >>>>>> >>>>>> * >>>>>> >>>>>> - Make retried transactions obey request inter-arrival >>>>>> distribution and, hence, Robot::req_rate setting. >>>>>> >>>>>> * >>>>>> >>>>>> Any thoughts/idea's? >>>>>> >>>>> >>>>> Since v4.2.0 CONNECT tunnel establishment is treated as a separate >>>>> transaction: >>>>> >>>>> - Treat an HTTP CONNECT response as the end of the transaction. >>>>> Collect and report stats dedicated to these new CONNECT-only >>>>> transactions. These CONNECT transactions are often rather >>>>> different from the in-tunnel transactions that follow. >>>>> Isolating them also prevents CONNECT overheads from polluting >>>>> the first in-tunnel transaction stats. >>>>> >>>>> So an HTTPS GET transaction over proxy is counted as two transactions on >>>>> the client side: CONNECT and GET. >>>>> >>>>> Detailed stats are available in HTML report or can be extracted from >>>>> binary log with the lx tool. >>>>> >>>>> The behavior is the same on all newer Polygraph versions, so upgrading >>>>> would not help. >>>> >>>> .... but there are plans to make baseline reporting match configuration >>>> better when dealing with what we call "compound" transactions. The >>>> overall principle behind the planned changes is to have reported >>>> baseline request rate (i-lines in the console output) match the >>>> configured Bench::peak_req_rate (or equivalent), regardless of whether >>>> the workload involves CONNECT requests and/or authentication. >>>> >>>> We will also report CONNECT (and authentication) activity separately, >>>> just like we report ssl activity separately today. >>>> >>>> >>>> We hope the above changes will reduce surprises caused by compound >>>> transactions involving multiple HTTP requests. >>>> >>>> >>> >>> OK so the numbers stack up then, squid is reporting ~180 req/sec (it will only know the CONNECT and not how many GET's passed through the tunnel) and peak_req_rate configured for 250 so the ~430 being reported client side sounds OK. >>> >>> Unfortunately I have another problem then, was hoping the above could be related to what I'm seeing. >>> >>> If I run a direct test (no proxy involved), it will go for hours and never miss a beat (last one ran for 15 hours). If I use the same workload but introduce a proxy with the --proxy switch the test always bombs out somewhere between the 2 and 3 hour mark (have tried at least 5 different times) even though none of the devices (including the proxy) are reporting any high load or resource exhaustion. Before it bombs here's what the client console says: >>> >>> 136.74| i-top1 3588097 406.69 1583 0.00 0 1681 >>> ssl 2048966 406.69 1584 0.00 0 1679 >>> 136.83| i-top1 3590106 400.26 1600 0.00 0 1659 >>> ssl 2050132 400.26 1602 0.00 0 1633 >>> 136.91| i-top1 3592036 364.13 1695 0.00 0 1719 >>> ssl 2051152 364.13 1696 0.00 0 1653 >>> 137.04| PolyApp.cc:189: error: 1/1 (c58) internal timers may be getting behind >>> 137.04| record level of timing drift: 2.48sec; last check was 6.45sec ago >>> 137.27| i-top1 3592779 34.36 11075 0.00 0 1609 >>> ssl 2051667 34.36 11075 0.00 0 1102 >>> 137.30| PolyApp.cc:189: error: 2/2 (c58) internal timers may be getting behind >>> 137.30| record level of timing drift: 15.40sec; last check was 15.37sec ago >>> 137.40| Connection.cc:581: error: 1/3 (s32) Broken pipe >>> 137.40| error: raw close on SSL connection failed >>> 137.40| connection between 172.17.20.35:31263 and 172.16.0.13:8080 failed at 3 reads, 5 writes, and 2 xacts >>> 137.40| Connection.cc:581: error: 2/4 (s32) Broken pipe >>> 137.40| error: raw close on SSL connection failed >>> 137.40| connection between 172.17.20.173:43974 and 172.16.0.13:8080 failed at 4 reads, 6 writes, and 3 xacts >>> 137.41| Connection.cc:581: error: 4/6 (s32) Broken pipe >>> 137.41| error: raw close on SSL connection failed >>> 137.41| connection between 172.17.20.5:42461 and 172.16.0.13:8080 failed at 3 reads, 5 writes, and 2 xacts >>> 137.41| Connection.cc:581: error: 8/10 (s32) Broken pipe >>> 137.41| error: raw close on SSL connection failed >>> 137.41| connection between 172.17.20.245:35371 and 172.16.0.13:8080 failed at 3 reads, 5 writes, and 2 xacts >>> 137.41| Connection.cc:581: error: 16/18 (s32) Broken pipe >>> 137.41| error: raw close on SSL connection failed >>> 137.41| connection between 172.17.21.32:13088 and 172.16.0.13:8080 failed at 4 reads, 6 writes, and 3 xacts >>> 137.43| i-top1 3593243 53.18 25644 0.00 0 1533 >>> ssl 2051843 53.18 25644 0.00 0 769 >>> 137.44| i-top1 3593243 0.00 -1 -1.00 0 1895 >>> ssl 2051843 0.00 -1 -1.00 0 748 >>> >>> >>> Any ideas on this one? >>> >> >> Are there any errors on client or proxy? > > There are no error on the proxy and the problem begins on the client straight after the "137.04| PolyApp.cc:189: error: 1/1 (c58) internal timers may be getting behind" message > >> >> Some resource must be exhausted on client, proxy or server. I guess you >> have checked CPU and memory usage. Did you look at the network? Number >> of sockets in CLOSE_WAIT state, dropped packets, errors in system log, >> etc? > > I should have said no apparent resource exhaustion, I haven't caught a snapshot at the exact failure time but certainly in the 2.x hours leading up all looks fine. > >> >> Is the request rate stable in the phase where the test fails? In the >> log you provided, response time is increasing before the failure (though >> not by much). If response time is growing during a stable phase (with >> constant request rate) this suggests that proxy performance is degrading >> over time and sooner or later this would result in a test failure. > > Yes the test is configured to get into the "top" phase very quickly, about 3 minutes after start and is fine for 2.x hours before failing. > >> >> Since the failure does not happen in a direct test, proxy would be the >> primary suspect. A good way to triage the issue may be to simplify the >> workload and/or proxy configuration (e.g. disable caching if it is >> enabled). > > Logically sounds like the proxy but have moved it to a environment where I still have polygraph 4.0.10 and it has been under test at the same request rate for 36 hours now with no issue. I haven't compared the exact workload details between the environments yet but would expect very minimal difference, if any at all. > >> >> I hope these ideas would point you in the right direction. >> > > Probably be a few days before I can continue investigating, if you have any other suggestions of where to look based on the above info I'd be glad to hear. > Finally back to looking at this and some analysis seems to indicate a slow memory leak for SSL tests when sent to a forward proxy in 4.3.2. In direct HTTP(S) tests (no forward proxy) and HTTP -> Forward proxy test the memory usage off polygraph-client remains fairly stable once the tests settles into it's top phase. In HTTPS -> Forward proxy however memory usage of polygraph-client continues to grow and eventually test fails Any ideas of where to look given the above analysis? My C++ skills are fairly weak so not sure what assistance I can be in tracking but happy to look. Michael From rousskov at measurement-factory.com Thu Jun 6 22:50:32 2013 From: rousskov at measurement-factory.com (Alex Rousskov) Date: Thu, 06 Jun 2013 16:50:32 -0600 Subject: Client side stats SSL workload 4.3.2 In-Reply-To: <813FB77A-2546-493D-93B3-06136B3B76D5@hendrie.id.au> References: <87obcb8eki.fsf@gmail.com> <51951074.6080104@measurement-factory.com> <8274538C-DCB9-4F3E-8A9C-8A33D127A5D7@hendrie.id.au> <87ip2h8t91.fsf@gmail.com> <2D22BE02-6BDE-4B0F-B26B-C4842A86DE65@hendrie.id.au> <813FB77A-2546-493D-93B3-06136B3B76D5@hendrie.id.au> Message-ID: <51B11238.40706@measurement-factory.com> On 06/05/2013 01:31 AM, Michael Hendrie wrote: > Finally back to looking at this and some analysis seems to indicate a slow memory leak for SSL tests when sent to a forward proxy in 4.3.2. In direct HTTP(S) tests (no forward proxy) and HTTP -> Forward proxy test the memory usage off polygraph-client remains fairly stable once the tests settles into it's top phase. In HTTPS -> Forward proxy however memory usage of polygraph-client continues to grow and eventually test fails > Any ideas of where to look given the above analysis? My C++ skills are fairly weak so not sure what assistance I can be in tracking but happy to look. You can try running Polygraph under valgrind control and analyzing valgrind reports, but that requires some C++ skills and Polygraph knowledge to weed out false positives. I suggest filing a bug report for this problem: https://bugs.launchpad.net/polygraph If you can, please include the workload you are using and some kind of log showing memory usage growth. We will try to reproduce and, if needed, fix the leak. Thank you, Alex. From michael.hendrie at gmail.com Tue Jun 11 08:03:07 2013 From: michael.hendrie at gmail.com (Michael Hendrie) Date: Tue, 11 Jun 2013 17:33:07 +0930 Subject: Client side stats SSL workload 4.3.2 In-Reply-To: <51B11238.40706@measurement-factory.com> References: <87obcb8eki.fsf@gmail.com> <51951074.6080104@measurement-factory.com> <8274538C-DCB9-4F3E-8A9C-8A33D127A5D7@hendrie.id.au> <87ip2h8t91.fsf@gmail.com> <2D22BE02-6BDE-4B0F-B26B-C4842A86DE65@hendrie.id.au> <813FB77A-2546-493D-93B3-06136B3B76D5@hendrie.id.au> <51B11238.40706@measurement-factory.com> Message-ID: On 07/06/2013, at 8:20 AM, Alex Rousskov wrote: > On 06/05/2013 01:31 AM, Michael Hendrie wrote: > >> Finally back to looking at this and some analysis seems to indicate a slow memory leak for SSL tests when sent to a forward proxy in 4.3.2. In direct HTTP(S) tests (no forward proxy) and HTTP -> Forward proxy test the memory usage off polygraph-client remains fairly stable once the tests settles into it's top phase. In HTTPS -> Forward proxy however memory usage of polygraph-client continues to grow and eventually test fails > >> Any ideas of where to look given the above analysis? My C++ skills are fairly weak so not sure what assistance I can be in tracking but happy to look. > > You can try running Polygraph under valgrind control and analyzing > valgrind reports, but that requires some C++ skills and Polygraph > knowledge to weed out false positives. I suggest filing a bug report for > this problem: https://bugs.launchpad.net/polygraph > > If you can, please include the workload you are using and some kind of > log showing memory usage growth. > > We will try to reproduce and, if needed, fix the leak. > Done - https://bugs.launchpad.net/polygraph/+bug/1189763 Thanks, Michael