Client side stats SSL workload 4.3.2
Michael Hendrie
pmix at hendrie.id.au
Wed Jun 5 07:31:08 UTC 2013
On 19/05/2013, at 8:50 AM, Michael Hendrie <pmix at hendrie.id.au> wrote:
>
> On 18/05/2013, at 1:59 AM, Dmitry Kurochkin <dmitry.kurochkin at measurement-factory.com> wrote:
>
>> Hi Michael.
>>
>> Michael Hendrie <pmix at hendrie.id.au> writes:
>>
>>> On 17/05/2013, at 2:29 AM, Alex Rousskov <rousskov at measurement-factory.com> wrote:
>>>
>>>> On 05/16/2013 03:22 AM, Dmitry Kurochkin wrote:
>>>>> Hi Michael.
>>>>>
>>>>> Michael Hendrie <pmix at hendrie.id.au> writes:
>>>>>
>>>>>> Hi List,
>>>>>>
>>>>>> I recently updated from Web-Polygraph 4.0.10 running on FreeBSD to
>>>>>> Web-Polygraph 4.3.2 running on Linux (RHEL 6.1 clone) and now have some
>>>>>> inconsistency being reported in the client side runtime stats.
>>>>>>
>>>>>> The issue only seems to occur when polygraph-client is configured to send
>>>>>> requests via a proxy. In the client side stats, I'm seeing a significantly
>>>>>> higher request rate than what was configured, on the server side stats they
>>>>>> report as expected. e.g. on a test configured for 250req/sec:
>>>>>>
>>>>>> Client Side Stats:
>>>>>> 030.41| i-top1 786670 421.83 1357 0.00 0 1671
>>>>>> ssl 448598 421.83 1360 0.00 0 1671
>>>>>> 030.50| i-top1 788934 452.78 1379 0.00 0 1668
>>>>>> ssl 449857 452.78 1379 0.00 0 1668
>>>>>> 030.58| i-top1 791229 458.98 1465 0.00 0 1660
>>>>>> ssl 451193 458.98 1465 0.00 0 1660
>>>>>> 030.66| i-top1 793401 434.40 1461 0.00 0 1687
>>>>>> ssl 452449 434.40 1463 0.00 0 1687
>>>>>> 030.75| i-top1 795476 415.02 1401 0.00 0 1723
>>>>>> ssl 453636 415.02 1402 0.00 0 1721
>>>>>>
>>>>>> Server Side Stats:
>>>>>> 030.45| i-top1 426331 227.03 2518 0.00 0 2200
>>>>>> ssl 426331 227.03 2518 0.00 0 1637
>>>>>> 030.54| i-top1 427519 237.59 2539 0.00 0 2201
>>>>>> ssl 427519 237.59 2539 0.00 0 1638
>>>>>> 030.62| i-top1 428710 238.19 2467 0.00 0 2171
>>>>>> ssl 428710 238.19 2468 0.00 0 1608
>>>>>> 030.70| i-top1 429962 250.40 2503 0.00 0 2300
>>>>>> ssl 429962 250.40 2502 0.00 0 1737
>>>>>>
>>>>>>
>>>>>> I didn't see this inconsistency in 4.0.10 and if I have a direct
>>>>>> client->server test (no proxy) then stats are also reported correctly.
>>>>>>
>>>>>> The proxy itself (squid) is reporting around 190 client req/sec:
>>>>>> client_http.requests = 188.674032/sec
>>>>>>
>>>>>> Having just read the changelog (after typing the above), now wondering if
>>>>>> the fix in the non-public 4.6.0 would resolve this issue?
>>>>>>
>>>>>> *
>>>>>>
>>>>>> - Make retried transactions obey request inter-arrival
>>>>>> distribution and, hence, Robot::req_rate setting.
>>>>>>
>>>>>> *
>>>>>>
>>>>>> Any thoughts/idea's?
>>>>>>
>>>>>
>>>>> Since v4.2.0 CONNECT tunnel establishment is treated as a separate
>>>>> transaction:
>>>>>
>>>>> - Treat an HTTP CONNECT response as the end of the transaction.
>>>>> Collect and report stats dedicated to these new CONNECT-only
>>>>> transactions. These CONNECT transactions are often rather
>>>>> different from the in-tunnel transactions that follow.
>>>>> Isolating them also prevents CONNECT overheads from polluting
>>>>> the first in-tunnel transaction stats.
>>>>>
>>>>> So an HTTPS GET transaction over proxy is counted as two transactions on
>>>>> the client side: CONNECT and GET.
>>>>>
>>>>> Detailed stats are available in HTML report or can be extracted from
>>>>> binary log with the lx tool.
>>>>>
>>>>> The behavior is the same on all newer Polygraph versions, so upgrading
>>>>> would not help.
>>>>
>>>> .... but there are plans to make baseline reporting match configuration
>>>> better when dealing with what we call "compound" transactions. The
>>>> overall principle behind the planned changes is to have reported
>>>> baseline request rate (i-lines in the console output) match the
>>>> configured Bench::peak_req_rate (or equivalent), regardless of whether
>>>> the workload involves CONNECT requests and/or authentication.
>>>>
>>>> We will also report CONNECT (and authentication) activity separately,
>>>> just like we report ssl activity separately today.
>>>>
>>>>
>>>> We hope the above changes will reduce surprises caused by compound
>>>> transactions involving multiple HTTP requests.
>>>>
>>>>
>>>
>>> OK so the numbers stack up then, squid is reporting ~180 req/sec (it will only know the CONNECT and not how many GET's passed through the tunnel) and peak_req_rate configured for 250 so the ~430 being reported client side sounds OK.
>>>
>>> Unfortunately I have another problem then, was hoping the above could be related to what I'm seeing.
>>>
>>> If I run a direct test (no proxy involved), it will go for hours and never miss a beat (last one ran for 15 hours). If I use the same workload but introduce a proxy with the --proxy switch the test always bombs out somewhere between the 2 and 3 hour mark (have tried at least 5 different times) even though none of the devices (including the proxy) are reporting any high load or resource exhaustion. Before it bombs here's what the client console says:
>>>
>>> 136.74| i-top1 3588097 406.69 1583 0.00 0 1681
>>> ssl 2048966 406.69 1584 0.00 0 1679
>>> 136.83| i-top1 3590106 400.26 1600 0.00 0 1659
>>> ssl 2050132 400.26 1602 0.00 0 1633
>>> 136.91| i-top1 3592036 364.13 1695 0.00 0 1719
>>> ssl 2051152 364.13 1696 0.00 0 1653
>>> 137.04| PolyApp.cc:189: error: 1/1 (c58) internal timers may be getting behind
>>> 137.04| record level of timing drift: 2.48sec; last check was 6.45sec ago
>>> 137.27| i-top1 3592779 34.36 11075 0.00 0 1609
>>> ssl 2051667 34.36 11075 0.00 0 1102
>>> 137.30| PolyApp.cc:189: error: 2/2 (c58) internal timers may be getting behind
>>> 137.30| record level of timing drift: 15.40sec; last check was 15.37sec ago
>>> 137.40| Connection.cc:581: error: 1/3 (s32) Broken pipe
>>> 137.40| error: raw close on SSL connection failed
>>> 137.40| connection between 172.17.20.35:31263 and 172.16.0.13:8080 failed at 3 reads, 5 writes, and 2 xacts
>>> 137.40| Connection.cc:581: error: 2/4 (s32) Broken pipe
>>> 137.40| error: raw close on SSL connection failed
>>> 137.40| connection between 172.17.20.173:43974 and 172.16.0.13:8080 failed at 4 reads, 6 writes, and 3 xacts
>>> 137.41| Connection.cc:581: error: 4/6 (s32) Broken pipe
>>> 137.41| error: raw close on SSL connection failed
>>> 137.41| connection between 172.17.20.5:42461 and 172.16.0.13:8080 failed at 3 reads, 5 writes, and 2 xacts
>>> 137.41| Connection.cc:581: error: 8/10 (s32) Broken pipe
>>> 137.41| error: raw close on SSL connection failed
>>> 137.41| connection between 172.17.20.245:35371 and 172.16.0.13:8080 failed at 3 reads, 5 writes, and 2 xacts
>>> 137.41| Connection.cc:581: error: 16/18 (s32) Broken pipe
>>> 137.41| error: raw close on SSL connection failed
>>> 137.41| connection between 172.17.21.32:13088 and 172.16.0.13:8080 failed at 4 reads, 6 writes, and 3 xacts
>>> 137.43| i-top1 3593243 53.18 25644 0.00 0 1533
>>> ssl 2051843 53.18 25644 0.00 0 769
>>> 137.44| i-top1 3593243 0.00 -1 -1.00 0 1895
>>> ssl 2051843 0.00 -1 -1.00 0 748
>>>
>>>
>>> Any ideas on this one?
>>>
>>
>> Are there any errors on client or proxy?
>
> There are no error on the proxy and the problem begins on the client straight after the "137.04| PolyApp.cc:189: error: 1/1 (c58) internal timers may be getting behind" message
>
>>
>> Some resource must be exhausted on client, proxy or server. I guess you
>> have checked CPU and memory usage. Did you look at the network? Number
>> of sockets in CLOSE_WAIT state, dropped packets, errors in system log,
>> etc?
>
> I should have said no apparent resource exhaustion, I haven't caught a snapshot at the exact failure time but certainly in the 2.x hours leading up all looks fine.
>
>>
>> Is the request rate stable in the phase where the test fails? In the
>> log you provided, response time is increasing before the failure (though
>> not by much). If response time is growing during a stable phase (with
>> constant request rate) this suggests that proxy performance is degrading
>> over time and sooner or later this would result in a test failure.
>
> Yes the test is configured to get into the "top" phase very quickly, about 3 minutes after start and is fine for 2.x hours before failing.
>
>>
>> Since the failure does not happen in a direct test, proxy would be the
>> primary suspect. A good way to triage the issue may be to simplify the
>> workload and/or proxy configuration (e.g. disable caching if it is
>> enabled).
>
> Logically sounds like the proxy but have moved it to a environment where I still have polygraph 4.0.10 and it has been under test at the same request rate for 36 hours now with no issue. I haven't compared the exact workload details between the environments yet but would expect very minimal difference, if any at all.
>
>>
>> I hope these ideas would point you in the right direction.
>>
>
> Probably be a few days before I can continue investigating, if you have any other suggestions of where to look based on the above info I'd be glad to hear.
>
Finally back to looking at this and some analysis seems to indicate a slow memory leak for SSL tests when sent to a forward proxy in 4.3.2. In direct HTTP(S) tests (no forward proxy) and HTTP -> Forward proxy test the memory usage off polygraph-client remains fairly stable once the tests settles into it's top phase. In HTTPS -> Forward proxy however memory usage of polygraph-client continues to grow and eventually test fails
Any ideas of where to look given the above analysis? My C++ skills are fairly weak so not sure what assistance I can be in tracking but happy to look.
Michael
More information about the Users
mailing list