As part of our datacenter migration, the database server received a substantial upgrade:
2 Xeon X5470 CPUs
8 total cores @ 3.33 Ghz
2 Xeon X5680 CPUs
12 total cores @ 3.33 GHz
However, a few things didn’t go quite to plan in the migration. Much to our chagrin, the database server ended up being barely faster — and maybe even a bit slower than our old database! This was deeply troubling.
The new Nehalem CPUs (what you may know as Core i7) are sort of meh on the desktop, but they are monsters on the server. It’s not unusual to see 200% performance increases going from Core 2 class server CPUs, like the ones we have in Oregeon, to these newer Core i7 class server CPUs. Just ask AnandTech’s Johan De Gelas:
The Nehalem architecture only caused a small ripple in the desktop world, mostly due to high pricing and performance that only shines in high-end applications. However, it has created a giant tsunami in the server world. The Xeon 5570 doubles the performance of its predecessor in applications that matter to more than 80% of the server market. Pulling this off without any process technology or clock speed advantage, without any significant increase in power consumption, is nothing but a historic achievement for the ambitious and talented team of Ronak Singhal.
So … yeah. We should be seeing performance improvements, and big ones, not the break-even parity (at best!) we were actually seeing.
We began looking into it and troubleshooting. That’s why there was some downtime around 5 pm Pacific the last few days. We were messing around with our primary and backup database servers in NYC. Here’s what we tried:
- We thought maybe the combination of SQL Server 2008 R2 and Intel’s next-gen HyperThreading were not mixing well. We’re still not sure, but we opted to be on the safe side and disable HyperThreading for now; 12 real, physical cores seems like plenty for our workload without adding fake logical CPUs to the mix.
- We realized we had mixed up CPUs a bit and we didn’t have the correct CPU in the server. Close, but not quite right. This was easy enough to fix with a CPU swap, but it alone was not enough to explain the performance issues.
- After trying a few other minor things, and with a nudge from Brent “database ninja” Ozar we narrowed it down to the clock speed of the CPUs themselves. Despite having set high performance mode in Windows Server 2008 R2’s power management control panel, the CPUs weren’t clocking up at all under load — we were seeing about half the clock speed under load we should have.
Kyle asked why our CPUs weren’t clocking up on Server Fault. In the process of asking the question and researching it ourselves, we discovered the answer. These Dell servers were inexplicably shipped with BIOS settings that …
- did not allow the host operating system to control the CPU power settings
- did not set the machine to high performance mode
- did not scale CPU speed under load properly
… kind of the worst of all worlds. But Kyle quickly flipped a few BIOS settings so that the machine was set to “hyperspeed mode”, and performance suddenly got a lot better. How much better?
My benchmarks, let me show you them! This is an average of 10 SQL query runs on a copy of the Stack Overflow database, under no (or very little) real world load.
|gnarly query for Sportsmanship badge||3177 ms||2919 ms||1285 ms|
|simple full text query||555 ms||423 ms||335 ms|
Notice that database performance scales nearly linearly with CPU speed. This has always been the case in our benchmarking, but our dataset fits in memory. I don’t think that’s unusual these days. Building a 64 GB server like this one is not terribly expensive any more — and solid state drives are bridging the gap between disk and memory performance at 256 GB and beyond. Anyway, the received wisdom that “database servers need fast disks above all else” is kind of a lie in my experience. Paying extortionate rates for a crazy fast I/O subsystem is a waste; instead, spend that money on really fast CPUs and as much memory as you can afford.
Most of all, there’s the crushing 2x Nehalem Xeon performance increase we would expect to see! It’s “only” 25% faster on full text operations, but we’ll take that too!
So, our apologies for the downtime. We tried to share everything we learned in the process here and on Server Fault so the community can benefit. We hope this upgrade brings a faster and more responsive set of Stack Exchange sites to you!
(and if you’d like oodles more datacenter details, do check out the Server Fault blog. If you’d like network admin opportunities, do check out Stack Overflow Jobs.)