Client side stats SSL workload 4.3.2

Fri May 17 16:29:30 UTC 2013

Hi Michael.

Michael Hendrie <pmix at hendrie.id.au> writes:

> On 17/05/2013, at 2:29 AM, Alex Rousskov <rousskov at measurement-factory.com> wrote:
>
>> On 05/16/2013 03:22 AM, Dmitry Kurochkin wrote:
>>> Hi Michael.
>>> 
>>> Michael Hendrie <pmix at hendrie.id.au> writes:
>>> 
>>>> Hi List,
>>>> 
>>>> I recently updated from Web-Polygraph 4.0.10 running on FreeBSD to
>>>> Web-Polygraph 4.3.2 running on Linux (RHEL 6.1 clone) and now have some
>>>> inconsistency being reported in the client side runtime stats.
>>>> 
>>>> The issue only seems to occur when polygraph-client is configured to send
>>>> requests via a proxy.  In the client side stats, I'm seeing a significantly
>>>> higher request rate than what was configured, on the server side stats they
>>>> report as expected.  e.g. on a test configured for 250req/sec:
>>>> 
>>>> Client Side Stats:
>>>> 030.41| i-top1 786670 421.83   1357   0.00   0 1671
>>>>           ssl 448598 421.83   1360   0.00   0 1671
>>>> 030.50| i-top1 788934 452.78   1379   0.00   0 1668
>>>>           ssl 449857 452.78   1379   0.00   0 1668
>>>> 030.58| i-top1 791229 458.98   1465   0.00   0 1660
>>>>           ssl 451193 458.98   1465   0.00   0 1660
>>>> 030.66| i-top1 793401 434.40   1461   0.00   0 1687
>>>>           ssl 452449 434.40   1463   0.00   0 1687
>>>> 030.75| i-top1 795476 415.02   1401   0.00   0 1723
>>>>           ssl 453636 415.02   1402   0.00   0 1721
>>>> 
>>>> Server Side Stats:
>>>> 030.45| i-top1 426331 227.03   2518   0.00   0 2200
>>>>           ssl 426331 227.03   2518   0.00   0 1637
>>>> 030.54| i-top1 427519 237.59   2539   0.00   0 2201
>>>>           ssl 427519 237.59   2539   0.00   0 1638
>>>> 030.62| i-top1 428710 238.19   2467   0.00   0 2171
>>>>           ssl 428710 238.19   2468   0.00   0 1608
>>>> 030.70| i-top1 429962 250.40   2503   0.00   0 2300
>>>>           ssl 429962 250.40   2502   0.00   0 1737
>>>> 
>>>> 
>>>> I didn't see this inconsistency in 4.0.10 and if I have a direct
>>>> client->server test (no proxy) then stats are also reported correctly.
>>>> 
>>>> The proxy itself (squid) is reporting around 190 client req/sec:
>>>> client_http.requests = 188.674032/sec
>>>> 
>>>> Having just read the changelog (after typing the above), now wondering if
>>>> the fix in the non-public 4.6.0 would resolve this issue?
>>>> 
>>>> *
>>>> 
>>>> - Make retried transactions obey request inter-arrival
>>>>  distribution and, hence, Robot::req_rate setting.
>>>> 
>>>> *
>>>> 
>>>> Any thoughts/idea's?
>>>> 
>>> 
>>> Since v4.2.0 CONNECT tunnel establishment is treated as a separate
>>> transaction:
>>> 
>>> 	- Treat an HTTP CONNECT response as the end of the transaction.
>>> 	  Collect and report stats dedicated to these new CONNECT-only
>>> 	  transactions. These CONNECT transactions are often rather
>>> 	  different from the in-tunnel transactions that follow.
>>> 	  Isolating them also prevents CONNECT overheads from polluting
>>> 	  the first in-tunnel transaction stats.
>>> 
>>> So an HTTPS GET transaction over proxy is counted as two transactions on
>>> the client side: CONNECT and GET.
>>> 
>>> Detailed stats are available in HTML report or can be extracted from
>>> binary log with the lx tool.
>>> 
>>> The behavior is the same on all newer Polygraph versions, so upgrading
>>> would not help.
>> 
>> .... but there are plans to make baseline reporting match configuration
>> better when dealing with what we call "compound" transactions. The
>> overall principle behind the planned changes is to have reported
>> baseline request rate (i-lines in the console output) match the
>> configured Bench::peak_req_rate (or equivalent), regardless of whether
>> the workload involves CONNECT requests and/or authentication.
>> 
>> We will also report CONNECT (and authentication) activity separately,
>> just like we report ssl activity separately today.
>> 
>> 
>> We hope the above changes will reduce surprises caused by compound
>> transactions involving multiple HTTP requests.
>> 
>> 
>
> OK so the numbers stack up then, squid is reporting ~180 req/sec (it will only know the CONNECT and not how many GET's passed through the tunnel) and peak_req_rate configured for 250 so the ~430 being reported client side sounds OK.
>
> Unfortunately I have another problem then, was hoping the above could be related to what I'm seeing.  
>
> If I run a direct test (no proxy involved), it will go for hours and never miss a beat (last one ran for 15 hours).  If I use the same workload but introduce a proxy with the --proxy switch the test always bombs out somewhere between the 2 and 3 hour mark (have tried at least 5 different times) even though none of the devices (including the proxy) are reporting any high load or resource exhaustion.  Before it bombs here's what the client console says:
>
> 136.74| i-top1 3588097 406.69   1583   0.00   0 1681
>            ssl 2048966 406.69   1584   0.00   0 1679
> 136.83| i-top1 3590106 400.26   1600   0.00   0 1659
>            ssl 2050132 400.26   1602   0.00   0 1633
> 136.91| i-top1 3592036 364.13   1695   0.00   0 1719
>            ssl 2051152 364.13   1696   0.00   0 1653
> 137.04| PolyApp.cc:189: error: 1/1 (c58) internal timers may be getting behind
> 137.04| record level of timing drift: 2.48sec; last check was 6.45sec ago
> 137.27| i-top1 3592779  34.36  11075   0.00   0 1609
>            ssl 2051667  34.36  11075   0.00   0 1102
> 137.30| PolyApp.cc:189: error: 2/2 (c58) internal timers may be getting behind
> 137.30| record level of timing drift: 15.40sec; last check was 15.37sec ago
> 137.40| Connection.cc:581: error: 1/3 (s32) Broken pipe
> 137.40| error: raw close on SSL connection failed
> 137.40| connection between 172.17.20.35:31263 and 172.16.0.13:8080 failed at 3 reads, 5 writes, and 2 xacts
> 137.40| Connection.cc:581: error: 2/4 (s32) Broken pipe
> 137.40| error: raw close on SSL connection failed
> 137.40| connection between 172.17.20.173:43974 and 172.16.0.13:8080 failed at 4 reads, 6 writes, and 3 xacts
> 137.41| Connection.cc:581: error: 4/6 (s32) Broken pipe
> 137.41| error: raw close on SSL connection failed
> 137.41| connection between 172.17.20.5:42461 and 172.16.0.13:8080 failed at 3 reads, 5 writes, and 2 xacts
> 137.41| Connection.cc:581: error: 8/10 (s32) Broken pipe
> 137.41| error: raw close on SSL connection failed
> 137.41| connection between 172.17.20.245:35371 and 172.16.0.13:8080 failed at 3 reads, 5 writes, and 2 xacts
> 137.41| Connection.cc:581: error: 16/18 (s32) Broken pipe
> 137.41| error: raw close on SSL connection failed
> 137.41| connection between 172.17.21.32:13088 and 172.16.0.13:8080 failed at 4 reads, 6 writes, and 3 xacts
> 137.43| i-top1 3593243  53.18  25644   0.00   0 1533
>            ssl 2051843  53.18  25644   0.00   0  769
> 137.44| i-top1 3593243   0.00     -1  -1.00   0 1895
>            ssl 2051843   0.00     -1  -1.00   0  748
>
>
> Any ideas on this one?
>

Are there any errors on client or proxy?

Some resource must be exhausted on client, proxy or server.  I guess you
have checked CPU and memory usage.  Did you look at the network?  Number
of sockets in CLOSE_WAIT state, dropped packets, errors in system log,
etc?

Is the request rate stable in the phase where the test fails?  In the
log you provided, response time is increasing before the failure (though
not by much).  If response time is growing during a stable phase (with
constant request rate) this suggests that proxy performance is degrading
over time and sooner or later this would result in a test failure.

Since the failure does not happen in a direct test, proxy would be the
primary suspect.  A good way to triage the issue may be to simplify the
workload and/or proxy configuration (e.g. disable caching if it is
enabled).

I hope these ideas would point you in the right direction.

Regards,
  Dmitry

> Michael
> _______________________________________________
> Users mailing list
> Users at web-polygraph.org
> http://www.web-polygraph.org/mailman/listinfo/users