I was working with a client recently and they had an interesting problem. Recently they moved their application from a physical machine to a VM. And when they did performance took a nose dive. We’ve all heard this story before. They’ve got a crack team at this client and had checked all the usual suspects including adding more RAM and more CPUs (even though CPU was low already).
When I started looking at the system I saw minimal IO, in fact basically no IO on the system. So we got a user on the phone, and I profiled based on their username. I saw a huge number of commands go streaming by, all with a runtime of 0 milliseconds. Nothing really worth noting here except that SQL was only getting 6-8 commands per second. The client application in this case is running on Citrix within the same VMware farm so the client and the database are physically very close to each other, so unless the application is doing a lot of work on each row it should be throwing more commands at SQL than that.
Now in a perfect world I’d have the vendor fix their horrible code so that it wasn’t doing the RBAR, but the vendor couldn’t be convinced that this was a problem because none of their other clients were having a problem. In-fact the other installations at this client weren’t have a problem. But they were all much smaller installations.
What we ended up doing was turning a few things off, both in Windows and in VMware.
In Windows we turned off Receive Side Scaling, the TCP chimney, network auto tuning and task offloading. The thought behind all these components is that they are trying to optimize the network, but doing that takes time. Granted that time is just nanoseconds, but when you’re trying to through thousands of RBAR queries against a server at a time, a few hundred nanoseconds each is going to add up to seconds. When we profiled the server and let it run, we saw more than 16,000 commands being sent to the database just to open a single window in the application.
The code that we used to disable all these features is all done using the netsh command, and is shown below.
netsh int tcp set global chimney=disabled
netsh interface tcp set global rss=disabled
netsh interface tcp set global autotuninglevel=disabled
netsh int ip set global taskoffload=disabled
The vendor also recommended turning off IPv6 but I wouldn’t recommended that, especially if you are in a clustered SQL environment as modern versions of Windows clustering require IPv6.
On the VMware side we also turned off Interrupt Coalescing on the virtual NICs. The goal of having this on, to quote Jim Hannon from House of Brick is “to benefit the entire cluster by reducing the CPU overhead of TCP traffic”. Now in this case we have CPU to burn, and a very latency sensitive application. So turning it off became an option. This is done by changing the VMware advanced setting ethernetX.coalescingScheme (if the setting isn’t there you have to add it). Then set it to disabled like so.
ethernetX.coalescingScheme = "disabled"
With all these changes in place we fired up the VM and got the application user back on the phone and had them click the button again. And we were back in business. Response time was right were we expected it to be, and were it had been in the physical world before the migration.