An engineer I work with asked me for tips on what to read about database benchmarking. I told him I've learned a lot from reading Mark Callaghan's blog. Now that I think about it, articles and conference talks from Baron Scwhartz were also, or even more, fundamental early on when I was getting started. In particular his booklet or lecture on queueing theory is a great way to get you from beginner to expert in performance and benchmarking.
But... I've also come to realize that maybe I should write something on the topic myself! So here we go, this is a list of common beginner mistakes and how to fix them.
All of these mistakes were made by engineers who worked full time as database and big data experts. I've added some stories in cursive font.
Single threaded benchmarks
A very common mistake for beginners is to write a script or program that will execute operations against a database using only a single thread. I started working in the database industry in 2008, and already then this would fail to come anywhere close to modeling how a database under real production load behaves.
In 2013 when I started at MongoDB, I did work with a customer who simply used mongoimport to load some json files into a sharded MongoDB cluster. I mean they used this as their benchmark to evaluate MongoDB performance! Back then mongoimport by default used a single thread, and that's exactly what their benchmark was doing.
Funny story... This customer was under a lot of pressure to go into production. So they demanded that the consultant had to be extremely competent to solve this problem. I told them we will send our most senior MongoDB consultant. I told the consultant about this, and he resisted:
- You can't tell them that! It's not true!
- But you have worked here longer than anyone else on the team, I replied.
- Hmm... I guess.
So the consultant went onsite to the customer, added the --threads option to mongoimport, and instantly got multiple times better performance. Then they proceeded to write their own benchmark client in Java so that they could also do other optimizations. The customer was very content with our most senior consultant. Soon after this the consultant went on to found his own startup in the Kubernetes space. They recently did a nice exit by being acquired by a bigger security company. He totally deserved this!
Warmup
Most systems and especially databases exhibit various startup effects. This means it takes time for them to reach a steady state. In some cases this takes minutes, in extreme cases hours. This means your benchmark tooling needs to support a warmup phase where it discards any measurements. The benchmark result should only be measured during steady state. And you need to be aware of how long it takes to reach that steady state.
Single query
An extreme case of combining the above two mistakes is to just run a single query once and timing the execution time. This is first of all a single threaded operation (unless the database happens to support parallel query). It also is suspectible to warmup effects. And there isn't even an attempt to account for such variance by repeating the query multiple times so that one can report results as average and other aggregates.
When Postgresql added JSON support, EnterpriseDB was quick to publish benchmark results comparing PostgreSQL and MongoDB. PostgreSQL of course was faster, that's how this game works. So I went and looked at this new benchmark thinking maybe it was something we could use to benchmark MongoDB. What I found was a benchmark doing all of the above mistakes. Essentially the benchmark was a single query executed and timed with the mongo shell, pretty much: time mongo --eval "db.coll.find(...)"
. This is just to say, these things happen to the best of us and you don't need to learn much to quickly reach to an average professional level. (To reach a truly expert level, on the other hand, takes a life time.)
Small data set
Yet another beginner mistake is to test with a too small dataset. This takes two forms:
1. Very common for any developer is to develop and test the app with only a small number of records in the database. A hundred or thousand records may seem like a lot, but is actually trivial to handle for the database. When such a system is put into production, it will quickly run into performance problems once it accumulates more data. Typically the problem is that there are no indexes at all, or in any case the ones that exist are sub-optimal.
I sometimes do guest lectures at universities and vocational colleges. I try to show them how to log slow queries, use EXPLAIN to understand why they are slow and then add an index to make them fast. I then tell them if they can learn to do this, they can make 2000EUR a day as a database consultant. After this I usually have everyone's undivided attention!
2. Say you create a realistic test database with more than a trivial amount of data. Let's say 10 GB. Now it's still possible that your server has 64 GB of RAM and therefore your entire data set fits in memory. It makes a big difference whether your active data set fits in memory or not. For most production databases it does not. When benchmarking, you must be aware of the ratio of database size and memory available both to the database and operating system cache.
The second thing I tell university students is that if a problem can't be fixed by adding an index, they should tell the customer to buy more RAM. Seriously, this is all you need to do to start a career as a successful database consultant.
Image credit: prayitnophotography @ Flickr
- Log in to post comments
- 881 views
Some additional notes
I'd add on the top: before starting any sort of 'benchmarking', find out really carefully what it is you want to measure, under what circumstances, for what purpose, and for which audience.
Even standalone database server typically has tens of parameters and hardwares (hardware settings too) vary in numerous ways. Which OSes matter, and which compilers you use. What syslibs there are and does every customer have identical env. And so on (repeat infinitely). If the servers are grouped, replicate, change roles, or have any other dynamic behavior because you'd have to decide how much dynamics your test includes (and you need to be able to explain it to the audience).
When everything is clear and results can be reproduced, change the client driver version and try to reproduce.
Infinite and quite complex task.