At Burt we’re heavy into RabbitMQ. We use it as a series .. of tubes .. that we dump .. all our .. data .. on.. -ish. It’s been an integral part of our system for a long time and it’s served us well. Previous load tests had us worried though. We knew that if we overloaded the inputs we would sink the system. That was old RabbitMQ though.. 2.8 has flow control. Well. Lets see what 2.8 can do.
Our whole platform runs off of Amazon EC2, and for this test set we set up RabbitMQ clusters of 1, 2 and 3 c1.xlarge instances.
Each cluster gets 24 durable queues. We basically use RabbitMQ for transport, and to get the maximum throughput you want to have many queues, since each queue is essentially single threaded.
We loaded the queues by parsing about 2.3 gigs worth of messages from old production logs. The messages were combined and sent in batches of ten (another good technique for getting higher throughput), and routed explicitly to one of then 24 queues based on a property of the messages. Each batch wound up weighing in at about 12kB. The messages are marked as persistent. The queues were drained with a dummy application that connects to all the queues and reads data as fast as possible, prefetching 100 message batches and ack’ing each one individually.
The loader process is smart and connects to the MQ instance that hosts the queue, same for drainer, this avoids unnecessary network traffic between the nodes in the cluster (*).
All graphs were created by measuring the messages per second at 5 second intervals (where each message is a batch of ten actual messages).
Below are the tests we performed and their outcome.
Loader/Drainer max, single node, no clustering
Running 1 loader and 5 drainers we arrived at 25k fragments per second as the average drainer speed. Pretty consistently.
Running 4 loaders and 1 drainer we arrived at just above 30k fragments per second as the average loader speed. Spiky as hell though, but with consistent throughput. This kind of shenanigans used to kill RabbitMQ:s of < 2.8. Good JOB! Flow control FTW!
1 loader and 5 drainers
4 loaders and 1 drainer
Clustering, 1, 2 and 3 instances
We spread out the queues evenly across instances. That is to say the 3-instance cluster ran 8 queues per instance.
The result? No difference. No god damn difference what so ever. As far as speed is concerned, the single MQ-instance did juuust fine. And it didn’t ripple as much.
1MQ, 6 loaders and 5 drainers
2MQ, 6 loaders and 5 drainers
3MQ, 6 loaders and 5 drainers
This doesn’t feel like a very reasonable conclusion, and I do suppose that if we’d started more loaders/drainers we’d see the benefits of the cluster, but damn it I can’t be starting and stopping servers all day long. With the awesome infrastructure tools by @mwq it literarily takes MINUTES to do, but we ain’t got that kind of time. We’re busy people. Got code to hack and coffee to drink. That Star Craft II ladder aint gonna climb itself!
Ah, high availability. Each message on every queue is replicated to another instance. One instance goes down, you can reconnect to another instance and resume operation. Which is a seriously ballsy thing to do. We don’t usually do that. We stop the system and restart the failing instance. Not that we wouldn’t like to have magical failover awesomeness, but HA in RabbitMQ is expensive, like bricks of gold pressed latinum expensive. It hasn’t proven a huge issue for us — RabbitMQ has been rock solid since 2.6, so we keep our fingers crossed.
The performance is obviously impacted.
We’re going with the cluster. Probably 2 nodes with durable queues.
Listen, it’s not all about throughput. If the real-life drainers die we like to know that we can stockpile message for a couple of minutes until everything restarts. A single server only has so much memory and it’ll get overloaded easier.
We won’t be using much of the clustering capabilities though, as we create queues and connect to the instances individually. The only thing we use is the neat GUI so what we can track and monitor the individual instances.
- Machines used were Amazon’s c1.xlarge, 7Gig RAM, 8 cores.
- The trailing of at the end of the graphs is usually one of the loaders finishing earlier than the rest.
- 1 loader 5 drainer graph’s dips are probably our loader’s GC:s, so don’t read too much into it.
- All graphs as PDF
(*) If you’re connected to machine A and the queue you’re posting to is on machine B, RabbitMQ will forward your message for you. The problem is that this isn’t free, because the message needs to go back out on the network and land on machine B, taking time and bandwidth on your network card. This used to be a big problem for us in 2.6. We solve the problem by running a routing aware wrapper over the driver, check it out: github.com/burtcorp/autobahn