We did some performance testing of our analytics crunching machinery during the holidays. Initially the results were very disappointing. We ran more tests, tweaked the code, ran even more tests and tried a few optimizations we’ve had up our sleeves for some time but hadn’t thought necessary yet. It helped, but still the numbers were less than what we had hoped for.
Before we dive too deep lets just briefly explain what the crunching machinery does: simply put it takes a stream of ad exposures and increments numbers.
The numbers are metrics and are keyed by dimension and segment. For example when we see an ad with the dimensions 300x200 that was displayed on example.com at 11:47 on a Sunday to someone from Mexico we want to increment the number of ad impressions for the segment “300x200” in the dimension “ad format”, as well as the segment “Mexico” in the dimension “country”, the segment “example.com” in the dimension “site”, as well as combinations of these all aggregated by the day and the hour — and of course there are lots of other metrics than just the number of impressions. An ad impression can result in hundreds of incremented numbers all over the keyspace created by these dimensions, segments and metrics.
Back to the story: To make matters even worse we dug up an old sprint demo presentation where it said that at that point we had measured the performance to 50% more than what we were currently seeing, and about the same as the optimizations had given us. Something was definitely wrong here. We knew we had added features since then, but surely those features couldn’t be responsible for such a high performance loss? And surely the optimizations we had just put in should more than compensate for that?
To investigate we turned off a few features, and sure, we got better numbers, but still not enough that it could explain the discrepancy. There had to be another reason.
We had an idea that it the problem could be because of some difference between the old dataset and the current one. Remember that each ad impression causes lots of numbers to be incremented, and to not bring our databases to their knees we write these in batches. These batches aren’t very big, they represent a few tens of seconds of data, but it’s enough. This way keys with many updates will not yield many more database operations than less active ones. In fact, the total number of keys is a better indicator of the load the database will have to handle than the number of ad impressions.
We’ve added a few new customers the last few months and perhaps their data looked different enough to affect the performance — perhaps the keyspace had not grown as linearly with the added load as we though. To test this we assembled new data sets to test each customer’s data in isolation and see how the performance compared between them.
To our surprise the performance of each data set was more or less the same, but we were even more surprised that for each it was much higher than the numbers we had measured before.
We were baffled. How could it be that when we ran these data sets separately everything was much faster? Sure, the keyspace would be smaller for each run, but only by a small factor proportional to the number of customers.
Then it struck us: there was a difference we hadn’t taken into account. The first data set was randomly distributed over the entire day, whereas the new sets were snapshots of a specific time period, ordered by time. For a batch processing system this would have been an insignificant detail, but for our real time processing system it makes all the difference. The first data set was originally made for doing statistics on the data, and the spread over a full day was wanted. We didn’t even consider this when we reused it for the benchmarking.
When the data coming into the crunching machine is randomly distributed over time the whole batching optimization is rendered useless. Since we aggregate the metrics for each hour, randomly distributed data yields a more than 20 fold increase in the number of database operations required.
In conclusion: let this be a cautionary tale for you to make sure that when you do performance testing your data set must be as close to the real thing as possible. Both in what it contains, but also in how it would be processed by the system under test. The wrong data set can make your performance testing results completely useless.
In the end we wasted some time, but it could have been so much worse: just think of how bad it would have been if we were tricked into believing that our systems were much faster than they really were, just because of a data set that would trigger all the right optimizations.