Client side stats SSL workload 4.3.2

Wed Jun 5 07:31:08 UTC 2013

On 19/05/2013, at 8:50 AM, Michael Hendrie <pmix at hendrie.id.au> wrote:

> 
> On 18/05/2013, at 1:59 AM, Dmitry Kurochkin <dmitry.kurochkin at measurement-factory.com> wrote:
> 
>> Hi Michael.
>> 
>> Michael Hendrie <pmix at hendrie.id.au> writes:
>> 
>>> On 17/05/2013, at 2:29 AM, Alex Rousskov <rousskov at measurement-factory.com> wrote:
>>> 
>>>> On 05/16/2013 03:22 AM, Dmitry Kurochkin wrote:
>>>>> Hi Michael.
>>>>> 
>>>>> Michael Hendrie <pmix at hendrie.id.au> writes:
>>>>> 
>>>>>> Hi List,
>>>>>> 
>>>>>> I recently updated from Web-Polygraph 4.0.10 running on FreeBSD to
>>>>>> Web-Polygraph 4.3.2 running on Linux (RHEL 6.1 clone) and now have some
>>>>>> inconsistency being reported in the client side runtime stats.
>>>>>> 
>>>>>> The issue only seems to occur when polygraph-client is configured to send
>>>>>> requests via a proxy.  In the client side stats, I'm seeing a significantly
>>>>>> higher request rate than what was configured, on the server side stats they
>>>>>> report as expected.  e.g. on a test configured for 250req/sec:
>>>>>> 
>>>>>> Client Side Stats:
>>>>>> 030.41| i-top1 786670 421.83   1357   0.00   0 1671
>>>>>>         ssl 448598 421.83   1360   0.00   0 1671
>>>>>> 030.50| i-top1 788934 452.78   1379   0.00   0 1668
>>>>>>         ssl 449857 452.78   1379   0.00   0 1668
>>>>>> 030.58| i-top1 791229 458.98   1465   0.00   0 1660
>>>>>>         ssl 451193 458.98   1465   0.00   0 1660
>>>>>> 030.66| i-top1 793401 434.40   1461   0.00   0 1687
>>>>>>         ssl 452449 434.40   1463   0.00   0 1687
>>>>>> 030.75| i-top1 795476 415.02   1401   0.00   0 1723
>>>>>>         ssl 453636 415.02   1402   0.00   0 1721
>>>>>> 
>>>>>> Server Side Stats:
>>>>>> 030.45| i-top1 426331 227.03   2518   0.00   0 2200
>>>>>>         ssl 426331 227.03   2518   0.00   0 1637
>>>>>> 030.54| i-top1 427519 237.59   2539   0.00   0 2201
>>>>>>         ssl 427519 237.59   2539   0.00   0 1638
>>>>>> 030.62| i-top1 428710 238.19   2467   0.00   0 2171
>>>>>>         ssl 428710 238.19   2468   0.00   0 1608
>>>>>> 030.70| i-top1 429962 250.40   2503   0.00   0 2300
>>>>>>         ssl 429962 250.40   2502   0.00   0 1737
>>>>>> 
>>>>>> 
>>>>>> I didn't see this inconsistency in 4.0.10 and if I have a direct
>>>>>> client->server test (no proxy) then stats are also reported correctly.
>>>>>> 
>>>>>> The proxy itself (squid) is reporting around 190 client req/sec:
>>>>>> client_http.requests = 188.674032/sec
>>>>>> 
>>>>>> Having just read the changelog (after typing the above), now wondering if
>>>>>> the fix in the non-public 4.6.0 would resolve this issue?
>>>>>> 
>>>>>> *
>>>>>> 
>>>>>> - Make retried transactions obey request inter-arrival
>>>>>> distribution and, hence, Robot::req_rate setting.
>>>>>> 
>>>>>> *
>>>>>> 
>>>>>> Any thoughts/idea's?
>>>>>> 
>>>>> 
>>>>> Since v4.2.0 CONNECT tunnel establishment is treated as a separate
>>>>> transaction:
>>>>> 
>>>>> 	- Treat an HTTP CONNECT response as the end of the transaction.
>>>>> 	  Collect and report stats dedicated to these new CONNECT-only
>>>>> 	  transactions. These CONNECT transactions are often rather
>>>>> 	  different from the in-tunnel transactions that follow.
>>>>> 	  Isolating them also prevents CONNECT overheads from polluting
>>>>> 	  the first in-tunnel transaction stats.
>>>>> 
>>>>> So an HTTPS GET transaction over proxy is counted as two transactions on
>>>>> the client side: CONNECT and GET.
>>>>> 
>>>>> Detailed stats are available in HTML report or can be extracted from
>>>>> binary log with the lx tool.
>>>>> 
>>>>> The behavior is the same on all newer Polygraph versions, so upgrading
>>>>> would not help.
>>>> 
>>>> .... but there are plans to make baseline reporting match configuration
>>>> better when dealing with what we call "compound" transactions. The
>>>> overall principle behind the planned changes is to have reported
>>>> baseline request rate (i-lines in the console output) match the
>>>> configured Bench::peak_req_rate (or equivalent), regardless of whether
>>>> the workload involves CONNECT requests and/or authentication.
>>>> 
>>>> We will also report CONNECT (and authentication) activity separately,
>>>> just like we report ssl activity separately today.
>>>> 
>>>> 
>>>> We hope the above changes will reduce surprises caused by compound
>>>> transactions involving multiple HTTP requests.
>>>> 
>>>> 
>>> 
>>> OK so the numbers stack up then, squid is reporting ~180 req/sec (it will only know the CONNECT and not how many GET's passed through the tunnel) and peak_req_rate configured for 250 so the ~430 being reported client side sounds OK.
>>> 
>>> Unfortunately I have another problem then, was hoping the above could be related to what I'm seeing.  
>>> 
>>> If I run a direct test (no proxy involved), it will go for hours and never miss a beat (last one ran for 15 hours).  If I use the same workload but introduce a proxy with the --proxy switch the test always bombs out somewhere between the 2 and 3 hour mark (have tried at least 5 different times) even though none of the devices (including the proxy) are reporting any high load or resource exhaustion.  Before it bombs here's what the client console says:
>>> 
>>> 136.74| i-top1 3588097 406.69   1583   0.00   0 1681
>>>          ssl 2048966 406.69   1584   0.00   0 1679
>>> 136.83| i-top1 3590106 400.26   1600   0.00   0 1659
>>>          ssl 2050132 400.26   1602   0.00   0 1633
>>> 136.91| i-top1 3592036 364.13   1695   0.00   0 1719
>>>          ssl 2051152 364.13   1696   0.00   0 1653
>>> 137.04| PolyApp.cc:189: error: 1/1 (c58) internal timers may be getting behind
>>> 137.04| record level of timing drift: 2.48sec; last check was 6.45sec ago
>>> 137.27| i-top1 3592779  34.36  11075   0.00   0 1609
>>>          ssl 2051667  34.36  11075   0.00   0 1102
>>> 137.30| PolyApp.cc:189: error: 2/2 (c58) internal timers may be getting behind
>>> 137.30| record level of timing drift: 15.40sec; last check was 15.37sec ago
>>> 137.40| Connection.cc:581: error: 1/3 (s32) Broken pipe
>>> 137.40| error: raw close on SSL connection failed
>>> 137.40| connection between 172.17.20.35:31263 and 172.16.0.13:8080 failed at 3 reads, 5 writes, and 2 xacts
>>> 137.40| Connection.cc:581: error: 2/4 (s32) Broken pipe
>>> 137.40| error: raw close on SSL connection failed
>>> 137.40| connection between 172.17.20.173:43974 and 172.16.0.13:8080 failed at 4 reads, 6 writes, and 3 xacts
>>> 137.41| Connection.cc:581: error: 4/6 (s32) Broken pipe
>>> 137.41| error: raw close on SSL connection failed
>>> 137.41| connection between 172.17.20.5:42461 and 172.16.0.13:8080 failed at 3 reads, 5 writes, and 2 xacts
>>> 137.41| Connection.cc:581: error: 8/10 (s32) Broken pipe
>>> 137.41| error: raw close on SSL connection failed
>>> 137.41| connection between 172.17.20.245:35371 and 172.16.0.13:8080 failed at 3 reads, 5 writes, and 2 xacts
>>> 137.41| Connection.cc:581: error: 16/18 (s32) Broken pipe
>>> 137.41| error: raw close on SSL connection failed
>>> 137.41| connection between 172.17.21.32:13088 and 172.16.0.13:8080 failed at 4 reads, 6 writes, and 3 xacts
>>> 137.43| i-top1 3593243  53.18  25644   0.00   0 1533
>>>          ssl 2051843  53.18  25644   0.00   0  769
>>> 137.44| i-top1 3593243   0.00     -1  -1.00   0 1895
>>>          ssl 2051843   0.00     -1  -1.00   0  748
>>> 
>>> 
>>> Any ideas on this one?
>>> 
>> 
>> Are there any errors on client or proxy?
> 
> There are no error on the proxy and the problem begins on the client straight after the "137.04| PolyApp.cc:189: error: 1/1 (c58) internal timers may be getting behind" message
> 
>> 
>> Some resource must be exhausted on client, proxy or server.  I guess you
>> have checked CPU and memory usage.  Did you look at the network?  Number
>> of sockets in CLOSE_WAIT state, dropped packets, errors in system log,
>> etc?
> 
> I should have said no apparent resource exhaustion, I haven't caught a snapshot at the exact failure time but certainly in the 2.x hours leading up all looks fine.
> 
>> 
>> Is the request rate stable in the phase where the test fails?  In the
>> log you provided, response time is increasing before the failure (though
>> not by much).  If response time is growing during a stable phase (with
>> constant request rate) this suggests that proxy performance is degrading
>> over time and sooner or later this would result in a test failure.
> 
> Yes the test is configured to get into the "top" phase very quickly, about 3 minutes after start and is fine for 2.x hours before failing.
> 
>> 
>> Since the failure does not happen in a direct test, proxy would be the
>> primary suspect.  A good way to triage the issue may be to simplify the
>> workload and/or proxy configuration (e.g. disable caching if it is
>> enabled).
> 
> Logically sounds like the proxy but have moved it to a environment where I still have polygraph 4.0.10 and it has been under test at the same request rate for 36 hours now with no issue.  I haven't compared the exact workload details between the environments yet but would expect very minimal difference, if any at all.
> 
>> 
>> I hope these ideas would point you in the right direction.
>> 
> 
> Probably be a few days before I can continue investigating, if you have any other suggestions of where to look based on the above info I'd be glad to hear.
> 

Finally back to looking at this and some analysis seems to indicate a slow memory leak for SSL tests when sent to a forward proxy in 4.3.2.  In direct HTTP(S) tests (no forward proxy) and  HTTP -> Forward proxy test the memory usage off polygraph-client remains fairly stable once the tests settles into it's top phase.  In HTTPS -> Forward proxy however memory usage of polygraph-client continues to grow and eventually test fails

Any ideas of where to look given the above analysis?  My C++ skills are fairly weak so not sure what assistance I can be in tracking but happy to look.

Michael