sysbench

Writing a data loader for database benchmarks

A task that I've done many times in my career in databases is to load data into a database as a first step in some benchmark. To do it efficiently you want to use multiple threads. Dividing the work onto many threads requires good comprehension of third grade math, yet can be surprisingly hard to get right.

The typical setup is often like this:

  1. The benchmark framework launches N independent threads. For example in Sysbench these are completely isolated Lua environments with no shared data structures or communication possible between the threads.
  2. Each thread gets as input its thread id i and the total number of threads launched N.

Comments on the Codership Galera vs NDB cloud shootout

Alex Yurchenko finally posted results on a benchmark he has planned to do for a long time: Galera vs NDB cloud shootout.

Their blog requires registration to comment, so I'll post my comment here instead:

***

Sysbench can do the loadbalancing itself, so there is no need for external loadbalancer. Just add a comma separated list of master MySQL nodes to --mysql-host. This is similar to what the JDBC and PHP drivers can do too, and it is my favorite architecture. Why introduce extra layers of stuff that you don't need and that doesn't bring any additional value?

Re-doing Galera disk bound benchmark

I've been promising I should re-visit once more the disk bound sysbench tests I ran on Galera. In December I finally had some lab time to do that. If you remember what troubled me then it was that in all my other Galera benchmarks performance with Galera was equal or much better compared to performance on a single MySQL node. (And this is very unusual wrt high availability solutions, usually they come with a performance penalty. This is why Galera is so great.) However, on the tests with a disk bound workload, there was performance degradation, and what was even more troubling was the performance seemed to decrease more when adding more write masters.

In these tests I was able to understand the performance decrease and it had nothing to do with Galera and not even InnoDB. It's a defect in my lab setup: all nodes kept their data on a partition mounted from an EMC SAN device - the same device for all nodes. Hence, when directing work to more nodes, and the workload is bottlenecked by disk access, naturally performance would decrease rather than increase. Unfortunately I don't currently have servers available (but will have sometime during this year) where I could re-run this same test with local disks.

As part of this lab session I also investigated the effect varying the number of Galera slave applier threads, which I will report on in the remainder of this post. Of course, the results are a bit obscure now due to the problematic lab setup wrt SAN, but I'll make some observations nevertheless.
While the previous tests were run on MySQL 5.1, this test was run on MySQL 5.5 and I will make some observations there too.

Slides for Choosing a MySQL High Availability solution hingo Tue, 2011-11-01 08:59

Here are the slides to my first talk at Percona Live UK 2011: Choosing a MySQL High Availability solution.1

Galera disk bound workload revisited

Update 2012-01-09: I have now been able to understand the poor(ish) results in this benchmark. They are very likely due to a bad hardware setup and neither Galera nor InnoDB is to blame. See https://openlife.cc/blogs/2012/january/re-doing-galera-disk-bound-bench…

People commenting on my results for benchmarking Galera on a disk bound workload seemed to be confused by the performance degrading when writing to more than one master, and not convinced at my speculations on the reasons. Since sysbench 0.5 has the benchmarks in the form of LUA scripts, it was temptingly easy to tweak those a little to see if my speculations were correct. So yesterday I did run tests again with a slightly modified sysbench workload. (Everything else is identical, so see previous article for details on the setup.)

Running sysbench tests against a Galera cluster

So, vacation is over and I was in luck: Already during first week I had ample time to finally put Galera replication to the test. It was a great experience: I learned a lot, and eventually got the great results I was hoping to see.

Again I've started by just running the standard Sysbench oltp read-write test. Since this is a commonly used benchmark, it produces numbers that are comparable with others running the same benchmarks. Including, as it happens, Galera developers themselves.

These tests were run on an 8 core server with 32 GB of RAM and the disk on some EMC device with a 2,5GB write cache.

One-liner for condensing sysbench output into a csv file

An important part of benchmarking is to draw graphs. A graph can reveal results you wouldn't have spotted just by looking at raw numbers. By the way, the process of massaging the raw numbers into graphs will often reveal things too.

Sysbench output tends to be quite wordy, especially when you have a script that runs 1, 2, 4, 8... threads with the same test. To manually copy paste the numbers into a spreadsheet is tiresome. So I came up with this monster shell one-liner to condense the output into a csv file. I'm posting it here so I will find it the next time I need it:

DRBD and Semi-sync shootout on large server hingo Tue, 2011-05-10 15:14

DRBD and semi-sync benchmarks on a 2x8 132 GB server

I recently had the opportunity to run some benchmarks against a relatively large server, to learn how it was behaving in its specific configuration. I got some interesting results that I'll share here.

MariaDB 5.2: Benchmarking Virtual Columns, Views and ExtractValue()

In this post I will share results on some "benchmarking" I did on the database created in the previous post: MariaDB 5.2: Using MariaDB as a document store and Virtual Columns for indexing. In addition to just playing with the new syntax, I wanted to actually benchmark using virtual columns against some other techniques. If you didn't read that previous post yet, please do so that you know the schema that is being used and the whole point of what we are doing.

The premise for this benchmark was already given last week:

Before I write the next blog, I invite you to guess the result of the benchmark. I had two conflicting rules of thumb as hypothesis:

About the bookAbout this siteAcademicAccordAmazonAppleBeginnersBooksBuildBotBusiness modelsbzrCassandraCloudcloud computingclsCommunitycommunityleadershipsummitConsistencycoodiaryCopyrightCreative CommonscssDatabasesdataminingDatastaxDevOpsDistributed ConsensusDrizzleDrupalEconomyelectronEthicsEurovisionFacebookFrosconFunnyGaleraGISgithubGnomeGovernanceHandlerSocketHigh AvailabilityimpressionistimpressjsInkscapeInternetJavaScriptjsonKDEKubuntuLicensingLinuxMaidanMaker cultureMariaDBmarkdownMEAN stackMepSQLMicrosoftMobileMongoDBMontyProgramMusicMySQLMySQL ClusterNerdsNodeNoSQLNyrkiöodbaOpen ContentOpen SourceOpenSQLCampOracleOSConPAMPParkinsonPatentsPerconaperformancePersonalPhilosophyPHPPiratesPlanetDrupalPoliticsPostgreSQLPresalespresentationsPress releasesProgrammingRed HatReplicationSeveralninesSillySkySQLSolonStartupsSunSybaseSymbiansysbenchtalksTechnicalTechnologyThe making ofTransactionsTungstenTwitterUbuntuvolcanoWeb2.0WikipediaWork from HomexmlYouTube