Yesterday I posted results from some MySQL benchmarks I had been doing on a large server. In this post I'd like to list 4 important fixes that were done to avoid bad performance:
- Linux swapping, NUMA: If you have more than 1 CPU socket, you will be using Non-Uniform Memory Allocation. Linux has a bad default strategy of allocating memory here, and it is easy to end up in a situation where you have many GB of free memory left, yet Linux is already swapping. Yes, seeing that happen will be very confusing! This problem has been investigated and explained brilliantly by Jeremy Cole. The solution is a one liner given at the end of his article.
- Swappiness: This is not a bug, just easy to forget, or easy to assume someone else already set this. By default, Linux will think it is a good idea to swap out some processes (especially MySQL, that consumes a lot of memory) in favor of the filesystem cache, so that your disk access becomes faster. This is a good idea in many cases, but not when you're running a database. So you should set
vm.swappiness=0
in/etc/sysctl.conf
. This was covered in Yoshinori's tutorial at the MySQL conference. - Swappiness, Linux bug: On kernels prior to 2.6.28 (such as CentOS 5...) Linux will still swap out MySQL regardless of vm.swappiness, because the algorithm for balancing swapping and filesystem cache is stupid. The fix is to upgrade your kernel. A workaround is again offered by Yoshinori to those who can't upgrade: A perl script you can run from cron.
- InnoDB contention on 100+ GB buffer pool: It seems that InnoDB performance will severely degrade under some conditions if your buffer pool is 100 GB or more (exact limit unknown). This was originally observed by Vadim. The contention is possibly triggered when your system has low latencies when writing to disk, or when using ext3. A workaround is to make the buffer pool smaller - 64 GB has been verified to work well.
I'd like to thank Jeremy Cole, Yoshinori Matsunobu and Vadim Tkachenko for sharing their experiences and making it easy to fix what would otherwise have been very difficult problems.
- Log in to post comments
- 55420 views
NUMA swapping
Maybe a stupid question, why not run everything with numactl --interleave all?
No, I have exactly the same
No, I have exactly the same stupid question :-)
Reading between the lines of Jeremy's post, it seems the Linux devs think that --interleave has worse performance than the default "allocate all memory from your own NUMA node". But Jeremy's benchmarks do not show any difference. Even so, I totally agree with Jeremy's point that even if you do allocate all of the memory from one NUMA node, then swapping is the worst solution to think of, at least then Linux should continue to allocate memory from the other available NUMA nodes.
Thanks for this. Very
Thanks for this. Very helpful!
--interleave
running with --interleave might cause performance issues due to node NUMA distances
Many people say it might
Many people say it might (although none has been observed) otoh swapping MySQL to disk will certainly cause performance issues. The current Linux behavior is simply not defensible.