Bench SMP mode

Wed Aug 8 01:19:42 UTC 2018

On 08/07/2018 06:28 PM, William Law wrote:
> This is what the cpu information looks like on the boxes:
> ~]# lscpu
> Architecture:          x86_64
> CPU op-mode(s):        32-bit, 64-bit
> Byte Order:            Little Endian
> CPU(s):                80
> On-line CPU(s) list:   0-79
> Thread(s) per core:    2
> Core(s) per socket:    10
> Socket(s):             4
> NUMA node(s):          4
> Vendor ID:             GenuineIntel
> CPU family:            6
> Model:                 79
> Model name:            Intel(R) Xeon(R) CPU E5-4620 v4 @ 2.10GHz
> Stepping:              1
> CPU MHz:               2095.127
> BogoMIPS:              4190.02
> Virtualization:        VT-x
> L1d cache:             32K
> L1i cache:             32K
> L2 cache:              256K
> L3 cache:              25600K
> NUMA node0 CPU(s): 0,4,8,12,16,20,24,28,32,36,40,44,48,52,56,60,64,68,72,76
> NUMA node1 CPU(s): 1,5,9,13,17,21,25,29,33,37,41,45,49,53,57,61,65,69,73,77
> NUMA node2 CPU(s): 2,6,10,14,18,22,26,30,34,38,42,46,50,54,58,62,66,70,74,78
> NUMA node3 CPU(s): 3,7,11,15,19,23,27,31,35,39,43,47,51,55,59,63,67,71,75,79
> ~]#

> I'm working on the logic that every second cpu on a node (aka socket) is the
> "logical" second hyperthread cpu.  Looking at the CPU load when pushing
> traffic I see the same behaviour with software interrupts loading up
> particular cores.

I hope somebody more knowledgeable about CPU architectures can validate
your theory. I wish I could!

As for the network interrupts, YMMV, but they tend to migrate towards
busy workers/CPUs in the tests I have seen, which is not necessarily a
good thing when the worker is close to maxing out a CPU core. Confining
interrupts to dedicated cores may improve overall performance.

> See attached! (will email you direct if the list whinges).  Setup for 20
> cores, 1 dump with 2 cores per worker, the other with 1 core per worker.

Thank you for sharing these helpful backtraces.

>> You have one worker process per physical core. One process cannot
>> consume more than 100% of anything. Workers have no threads (for this
>> discussion, you can view each worker as a thread if you wish). And two
>> virtual cores are a red herring -- in a context of a single busy
>> process, they only add overheads.

> Shame, thought the robots might have run as individual threads under a
> worker, make more use of SMP.

I am not sure I share your disappointment in terms of performance: A
robot=thread model would only scale well for very busy robots, which is
both unrealistic (in most cases) and already supported (by configuring
one robot per worker).

Polygraph was born before SMP became a thing on regular machines we used
for drones. If we were to write it from scratch today, we would have
used threads for ease of worker management/synchronization, but we would
still not dedicate a thread to each robot because such rigid and
expensive architecture would not scale in many realistic simulations
that use thousands of robots.

Cheers,

Alex.