Description
Recently, I've been doing some performance profiling of the NServiceBus RabbitMQ transport and I've come across some interesting findings.
When I started my performance investigation, the main machine I was using was an older Sandy Bridge i7-2600K. In the middle of this, I put together a new development machine that has a brand new Skylake i7-6700K. On my new machine, I noticed the throughput numbers I was seeing in my tests were much lower than on my older machine!
I put together a repro project here and was able to run the scenario on a number of different machines:
Core 2 Duo E6550
--------------
1 consumer: 6908
32 consumer: 7667
i7-2600K
--------------
1 consumer: 7619
32 consumer: 11124
i7-4870HQ
--------------
1 consumer: 6103
32 consumer: 5237
i5-5200U
--------------
1 consumer: 7178
32 consumer: 1928
i7-6500U
--------------
1 consumer: 2243
32 consumer: 1253
i7-6700K
--------------
1 consumer: 6476
32 consumer: 2213
Those numbers are messages/sec. While the actual numbers aren't that important, the general trend here is problematic. On every CPU I tested it on that was newer than the 2nd-gen Core/Sandy Bridge, the more consumers there are, the worse the performance gets.
After spending some time analyzing some profiler results, my colleague @Scooletz and I came to the conclusion that the current design of the ConsumerWorkService / BatchingWorkPool is the culprit here. It appears that there is a lot of lock contention going on, and that seems to be causing a lot of trouble for the new cpus for some reason.
We have been able to come up with some alternative designs that eliminate this performance problem, though there are some trade-offs with them, so we wanted to show you both of them.
The first approach is largely focused on the BatchingWorkPool
itself. All locks have been removed and concurrent collections are used everywhere instead. The main trade-off here is that with this approach we can't guarantee per-channel operation order any more. While I don't think that is guarantee that is critical, I know it's a behavior change from the current design. PR #252 covers this change.
The second approach is focused on changing the ConsumerWorkService
instead. It no longer uses the BatchingWorkPool
at all, and instead creates a dedicated thread per model that is responsible for dispatching the work items in a loop. With this approach, per-channel operation order should be maintained, but there will no longer be a way to pass in a custom TaskScheduler
to limit concurrency. Using a custom scheduler to start the loops could mean that a model would never get to process work at all. @Scooletz should be opening a PR with this change soon.