ConsumerWorkService / BatchingWorkPool implementation is a performance bottleneck 

Recently, I've been doing some performance profiling of the [NServiceBus RabbitMQ transport](https://github.com/Particular/NServiceBus.RabbitMQ) and I've come across some interesting findings. 

When I started my performance investigation, the main machine I was using was an older Sandy Bridge i7-2600K. In the middle of this, I put together a new development machine that has a brand new Skylake i7-6700K. On my new machine, I noticed the throughput numbers I was seeing in my tests were much lower than on my older machine!

I put together a repro project [here](https://github.com/bording/RabbitPerfTest) and was able to run the scenario on a number of different machines:

```
Core 2 Duo E6550
--------------
1 consumer: 6908
32 consumer: 7667

i7-2600K
--------------
1 consumer: 7619
32 consumer: 11124

i7-4870HQ
--------------
1 consumer: 6103
32 consumer: 5237

i5-5200U
--------------
1 consumer: 7178
32 consumer: 1928

i7-6500U
--------------
1 consumer: 2243
32 consumer: 1253

i7-6700K
--------------
1 consumer: 6476
32 consumer: 2213
```

Those numbers are messages/sec. While the actual numbers aren't that important, the general trend here is problematic. On every CPU I tested it on that was newer than the 2nd-gen Core/Sandy Bridge, the more consumers there are, the worse the performance gets.

After spending some time analyzing some profiler results, my colleague @scooletz and I came to the conclusion that the current design of the [ConsumerWorkService](https://github.com/rabbitmq/rabbitmq-dotnet-client/blob/master/projects/client/RabbitMQ.Client/src/client/impl/ConsumerWorkService.cs) / [BatchingWorkPool](https://github.com/rabbitmq/rabbitmq-dotnet-client/blob/master/projects/client/RabbitMQ.Client/src/util/BatchingWorkPool.cs) is the culprit here. It appears that there is a lot of lock contention going on, and that seems to be causing a lot of trouble for the new cpus for some reason.

We have been able to come up with some alternative designs that eliminate this performance problem, though there are some trade-offs with them, so we wanted to show you both of them. 

The first approach is largely focused on the `BatchingWorkPool` itself. All locks have been removed and concurrent collections are used everywhere instead. The main trade-off here is that with this approach we can't guarantee per-channel operation order any more. While I don't think that is  guarantee that is critical, I know it's a behavior change from the current design. PR #252 covers this change.

The second approach is focused on changing the `ConsumerWorkService` instead. It no longer uses the `BatchingWorkPool` at all, and instead creates a dedicated thread per model that is responsible for dispatching the work items in a loop. With this approach, per-channel operation order should be maintained, but there will no longer be a way to pass in a custom `TaskScheduler` to limit concurrency. Using a custom scheduler to start the loops could mean that a model would never get to process work at all. @scooletz should be opening a PR with this change soon.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ConsumerWorkService / BatchingWorkPool implementation is a performance bottleneck #251

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

ConsumerWorkService / BatchingWorkPool implementation is a performance bottleneck #251

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions