Skip to content

ConsumerWorkService / BatchingWorkPool implementation is a performance bottleneck  #251

Closed
@bording

Description

@bording

Recently, I've been doing some performance profiling of the NServiceBus RabbitMQ transport and I've come across some interesting findings.

When I started my performance investigation, the main machine I was using was an older Sandy Bridge i7-2600K. In the middle of this, I put together a new development machine that has a brand new Skylake i7-6700K. On my new machine, I noticed the throughput numbers I was seeing in my tests were much lower than on my older machine!

I put together a repro project here and was able to run the scenario on a number of different machines:

Core 2 Duo E6550
--------------
1 consumer: 6908
32 consumer: 7667

i7-2600K
--------------
1 consumer: 7619
32 consumer: 11124

i7-4870HQ
--------------
1 consumer: 6103
32 consumer: 5237

i5-5200U
--------------
1 consumer: 7178
32 consumer: 1928

i7-6500U
--------------
1 consumer: 2243
32 consumer: 1253

i7-6700K
--------------
1 consumer: 6476
32 consumer: 2213

Those numbers are messages/sec. While the actual numbers aren't that important, the general trend here is problematic. On every CPU I tested it on that was newer than the 2nd-gen Core/Sandy Bridge, the more consumers there are, the worse the performance gets.

After spending some time analyzing some profiler results, my colleague @Scooletz and I came to the conclusion that the current design of the ConsumerWorkService / BatchingWorkPool is the culprit here. It appears that there is a lot of lock contention going on, and that seems to be causing a lot of trouble for the new cpus for some reason.

We have been able to come up with some alternative designs that eliminate this performance problem, though there are some trade-offs with them, so we wanted to show you both of them.

The first approach is largely focused on the BatchingWorkPool itself. All locks have been removed and concurrent collections are used everywhere instead. The main trade-off here is that with this approach we can't guarantee per-channel operation order any more. While I don't think that is guarantee that is critical, I know it's a behavior change from the current design. PR #252 covers this change.

The second approach is focused on changing the ConsumerWorkService instead. It no longer uses the BatchingWorkPool at all, and instead creates a dedicated thread per model that is responsible for dispatching the work items in a loop. With this approach, per-channel operation order should be maintained, but there will no longer be a way to pass in a custom TaskScheduler to limit concurrency. Using a custom scheduler to start the loops could mean that a model would never get to process work at all. @Scooletz should be opening a PR with this change soon.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions