MariaDB 5.2: Benchmarking Virtual Columns, Views and ExtractValue()
In this post I will share results on some "benchmarking" I did on the database created in the previous post: MariaDB 5.2: Using MariaDB as a document store and Virtual Columns for indexing. In addition to just playing with the new syntax, I wanted to actually benchmark using virtual columns against some other techniques. If you didn't read that previous post yet, please do so that you know the schema that is being used and the whole point of what we are doing.
The premise for this benchmark was already given last week:
Before I write the next blog, I invite you to guess the result of the benchmark. I had two conflicting rules of thumb as hypothesis:
- SELECTING via a virtual column only one word instead of a large XML document saves bandwidth and is therefore faster than the standard key-value fetch.
- SELECTING via a virtual column only one word instead of a large XML document uses more CPU in the database and is therefore slower.
I tested using both VIRTUAL and PERSISTENT virtual columns, and on a CPU bound, disk bound and network bound (ie performance dominated by network latency).
Place your bets, the results will be published later this week!
Note that these tests do not use the secondary key on 'username' for anything. Getting a secondary index is the main benefit of the schema I laid out in the previous post. But there is nothing to benchmark, since there isn't really anything to compare against. Without the secondary index you simply cannot filter on username at all (except for a full table scan, which is "not at all").
What I wanted to test rather was: Is there a difference one way or another if I SELECT the username column (which uses virtual columns) instead of the full doc column (which is just a normal key-value get)?
I've published the scripts I used on Launchpad: lp:~hingo/sysbench/mariadb-tests. There are 7 similar tests using the schema from the previous post. (...which you should have read by now :-)
- kv-xml.lua: SELECT the xml document by primary key. (Normal key-value.)
- kv-xml-vcol.lua --vcol-type=persistent: SELECT username column only, by primary key. Use PERSISTENT virtual columns.
- kv-xml-vcol.lua --vcol-type=virtual: SELECT username column only, by primary key. Use VIRTUAL virtual columns.
- kv-xml-viewsubstr.lua: SELECT username column from a VIEW. This compares the performance of using a view and a VIRTUAL virtual column. The view uses the same SUBSTR() method as the kv-xml-vcol.lua test.
- kv-xml-view.lua: SELECT username column from a VIEW. As above, but this test uses ExtractValue() instead of SUBSTR().
- kv-xml-extractvalue.lua: SELECT the username from the XML inside the doc column, using the ExtractValue() function directly in the SELECT statement. (No virtual columns or views involved, but computationally the same effort.)
- kv-xml-substr.lua: SELECT the username from the XML inside the doc column, using the SUBSTR() method that was used in kv-xml-vcol.lua directly in the SELECT statement. (No virtual columns or views involved, but computationally the same effort.)
The above in code:
kv-xml.lua: SELECT doc FROM sbtest WHERE id = ? kv-xml-vcol.lua: SELECT username FROM sbtest WHERE id= ? kv-xml-viewsubstr.lua: CREATE VIEW sbtest_v (id, doc, username) AS SELECT id, doc, SUBSTR(...) FROM sbtest ... SELECT username FROM sbtest_v WHERE id = ? kv-xml-view.lua: CREATE VIEW sbtest_v (id, doc, username) AS SELECT id, doc, ExtractValue(doc, '/user/username') FROM sbtest ... SELECT username FROM sbtest_v WHERE id = ? kv-xml-extractvalue.lua: SELECT ExtractValue(doc, '/user/username') FROM sbtest WHERE id = ? kv-xml-substr.lua: SELECT SUBSTR(...) FROM sbtest WHERE id = ?
I only ran the tests on my laptop (2 cores), using a default MariaDB 5.2 installation and my.cnf file (including the limitation of max 100 connections). Each run was short, only a few minutes. So I won't share the actual numbers in public, just summarize the behavior and conclusions. Maybe if there is interest, I can later ask Hakan to run the tests on the multicore servers he has access to, to get more reliable numbers.
The only change to the default config is that all SELECTs are done with the SQL_NO_CACHE to eliminate query cache. I used such a small dataset, that with the query cache on, all variants would have exactly the same performance because of being 100% served from the query cache.
CPU bound vs Disk bound vs Network bound
- With a small database that fits fully into the buffer pool, all of the 7 tests become CPU bound where MariaDB uses about 130% and sysbench about 65% (leaving the last 5% for the system, mostly X). In this scenario you will see differences with the above queries, which are reported below.
- With a large database the tests are disk bound. All tests have roughly the same performance. The performance is limited by disk seeks, so there is no advantage in SELECTing the virtual column username instead of the full doc BLOB. The hypothesis here was that fetching a smaller string would be faster than the whole blob. This is not true. (It is left as an excercise to the reader to find out how large the BLOB needs to be so that virtual columns provide a benefit here. A few megabytes? Gigabytes?
- I also tested between two laptops, in which case the performance is limited by network latency. Again, there is no difference in the size of data that is fetched. I was lazy and just used my WiFi, but my prior experience tells me that also in a faster network it is the round trip that is expensive, not the amount of data transferred.
Comparing the different tests
So it is mainly in the CPU bound case that the tests show any meaningful differences. While I won't share a lot of numbers, the below percentages show relative performance with each approach. (100% is 27242 tps)
kv-xml.lua 100% kv-xml-vcol.lua --vcol-type=persistent 99% kv-xml-vcol.lua --vcol-type=virtual 100% kv-xml-viewsubstr.lua 68% kv-xml-view.lua 60% kv-xml-extractvalue.lua 71% kv-xml-substr.lua 94% Any w/ Query cache on 136%
Of course, to get more reliable results, you'd need to do a longer test run on actual servers. But assuming the above numbers would be similar also on a multicore system, we could then draw the following conclusions:
Regarding virtual columns
- SELECTing the username column is exactly as efficient as selecting the whole BLOB, there is no difference one way or the other. This means, if you like the elegance of virtual columns, nothing stops you from using it, which will save you some space. (Note that we used SUBSTR() method here.)
- It doesn't matter whether you use VIRTUAL or PERSISTENT columns. There is no penalty in calculating the column value on the fly. (This result is a surprise to me, I expected PERSISTENT to be faster on SELECTs.)
- Comparing kv-xml-vcol.lua and kv-xml-substr.lua it seems that extracting the username as a virtual column is 5% faster than having the function in the SELECT and using a traditional table. 5% is large enough I believe this difference would show also on a larger system. It is not obvious to me why virtual columns are faster?
- It is probably due to random variation, but it intrigues me that VIRTUAL columns is slightly faster than PERSISTENT ones. I'm tempted to setup a more rigorous test to find out if this is true or just variation. One rule of thumb in writing high-performance code is that often using compact data results in better utilization of CPU caches, and this may often have a much higher impact on performance than many other factors. With VIRTUAL columns you move less data in the table, and then compute the column, whereas with a PERSISTENT column there is less computation but you actually move another column between RAM and CPU. So one could speculate that using a VIRTUAL column therefore would be 1-2% faster than a PERSISTENT column, even if this is counter intuitive at first. (Even the fact they are equal in a CPU bound test is counter intuitive if you ask me.)
Regarding ExtractValue() vs SUBSTR()
- Both with the function in the SELECT and using a VIEW, ExtractValue() is 25% slower than using SUBSTR(). It seems that it's implementation isn't as efficient as it could be.
- This result is funny to me! MySQL 5.1 had 5 new features, this is one of them. When I used to sell MySQL, many customers would ask in their RFQ whether MySQL has "XML support". They didn't specify it more than that, and when asked, they didn't even know what they wanted "XML support" for. So I always answered YES, to those questions, even educated almost 100 engineers at a company how to use ExtractValue() and UpdateXML(). According to this result, you should probably avoid these functions as much as possible. Use SUBSTR() as a cheat - if you know the schema of your XML beforehand it is safe (albeit ugly) to do so. Or just parse the XML in the application code instead - you should always avoid adding computation to the database if possible.
- I guess we were lucky that ExtractValue() wasn't supported for virtual columns, othewise I wouldn't have got these interesting results.
There is a rule of thumb that you should only do simple operations in the database, and move all computation to the application layer. This is because the application layer is much easier to scale out than the database. The above results suggest that a cheap operation like SUBSTR() is ok to do inside the database (if there is some benefit, of course), whereas you should watch out for functions that are a drag on performance. If you need to parse an XML document, do it in Java, PHP... not in MariaDB!
- Views are +30% slower than equivalent virtual columns. This means you should use virtual columns when possible. (Remember that Views can still do many other things a virtual column can't, such as joining two separate tables.)
- In this test, using a View was also much slower than just putting the function in the SELECT. If confirmed in a proper test run on larger systems, there is reason to suspect the implementation of Views is suboptimal.
Query cache: Since I first ran the tests with query cache on, we can see that it added 36% performance to the best case without query cache. This was of course a small dataset, but it can be used as a kind of "upper bound" for the best available performance. When query cache was on, MariaDB would use 100% of CPU and Sysbench around 80%.