Over the last few years at Coeo, I’ve seen multiple clients with performance problems caused by incorrectly configured VMWare.
It’s easy enough to understand why these problems occur – we’ve all become accustomed to installing software without really thinking about what we’re doing. “Next, Next, yeah whatever, Next, Done”. And, while these default settings work OK in many cases, SQL Server – particularly a busy SQL Server – will have problems with those default settings.
We see problems in three major areas:
- Power Management
- Memory
- Storage
I’m not going to do a deep dive on this – there’s more than enough information in VMWare’s Best Practices guide for SQL Server[1]. (At the time of writing – August 2024 – that VMWare document is dated April 2019.)
I’m going to give a TL;DR version and provide reference points in that document for the full gory details.
Power Management
Options to manage power are available in multiple places, and they all need to be configured so that SQL Server isn’t throttled for performance.
At the hardware level, the BIOS/UEFI in PCs and Servers perform power management, reducing the clock speed of the CPU when not under load, reducing the performance of the server, and resulting in delays as the processor speeds are ramped up to meet demand. See section 3.4.2 of the VMWare document for a list of things to enable and disable – it’s quite extensive.
In the middle, we have the VMWare ESXi host power management options. For details on how to configure High Performance mode, see section 3.4.3 of the VMWare document. Note that this is dependent on having configured the BIOS/UEFI settings.
At the operating system level, by default Windows sets power management to “Balanced”[2] – this is fine for general server workloads, but not for systems that require low latency (SQL Server). Section 4.1.1 of the VMWare document discusses configuring Windows to use the high-performance power plan on SQL Servers; Section 4.2.3 gives pointers to documentation for Linux-based SQL Servers.
At Coeo, one of our regular checks on servers is to make sure that Windows High Performance Power Plan has been selected, and we will work with clients and their suppliers to resolve this if necessary.
Memory
Infrastructure and virtualization people like to configure their environments to over-commit memory so they can cram more VMs into their hosts. Similar to power management, this is generally not a problem for most workloads, but it can cause significant issues for SQL Server.
SQL Server likes to use all of the memory available – mainly to cache recently-used data, resulting in fewer disk reads to retrieve data, resulting in faster query responses.
VMWare will, at times of memory stress, try to force servers to page memory out to disk to free up resources for other VMs. In VMWare’s case, there’s a “Balloon Driver” device used to simulate a request from Windows for a large pool of memory, allowing reclamation of memory by VMWare. See section 3.7.3 of the documentation, which says:
When designing SQL Server for performance, the goal is to eliminate any chance of paging from happening.
We can see the impact on the memory of this balloon driver being triggered:
This is a screenshot from SolarWinds SQL Sentry from a period when a client’s VMWare infrastructure was undergoing maintenance. We can clearly see SQL Server’s memory buffer being reduced by 75% for several hours.
“But why don’t we just use Lock Pages In Memory[3]?” I hear you ask. Section 4.3.2 recommends against using LPIM unless the SQL Server instance has had maximum memory setting properly configured, and if the VMWare host will not be overcommitted and memory is reserved (see section 3.7.2), otherwise “SQL Server instability could occur”. That sounds like a bad thing.
Similarly, disabling the balloon driver could cause issues in the event of a serious outage affecting the VMWare environment, and is not recommended.
Also, ensure that Trace Flag 834[4] (“Large Pages”) has not been enabled, as this will cause the buffer pool to be grown at startup to the configured size and won’t be released by the SQL Server process.
In summary – reserve memory for your SQL Server VMs – see section 3.7.2 – but leave the balloon driver enabled in case of emergency. And I’m not going to go into detail here of how much memory to allocate to SQL Server – the answer is, as always, “it depends”.
Storage
From section 3.8.3 of the VMWare document:
Many SQL Server performance issues can be traced back to improper storage configuration (…) a misconfigured storage subsystem can increase I/O latency and significantly degrade performance
For example, this screenshot was taken of SQL Sentry’s “Disk Activity” tab for a busy SQL Server:
Here we can see a single disk controller through which all disks are presented to the server; this is an “LSI Adapter, SAS 3000 series, 8-port with 1068”.
From section 3.8.3.4, there are three key takeaways:
- Use VMWare PVSCSI controller rather than LSI – it gives a higher IO throughput and lower CPU when compared with LSI
- Use multiple PVSCSI adapters.
- If you’re lucky enough to have all-flash SSD storage, try vNVMe rather than PVSCSI.
Multiple controllers allow for the separation of the disks according to purpose (or busyness). VMWare’s recommendation in sections 3.8.3.3 and 3.8.3.4 is based on the usage patterns for the drives and recommends one controller for SQL Server database data files, one for SQL Server transaction logs, one for TempDB, and one for (in effect) everything else. While this may not be the optimal result, it’s a general starting point that can be refined based on database access patterns.
Further Work
These are just the top-level recommendations. There are further sections in the documentation that discuss network configuration, hot-swap, NUMA, encryption, etc; however, the information above is from issues that we see having a performance impact on client’s servers in VMWare environments.
[1] https://www.vmware.com/docs/sql-server-on-vmware-best-practices-guide – dated April 2019. The author of the web version of the documentation has asked us to provide a link to that also; note, however, that the basic recommendations remain the same – https://core.vmware.com/resource/architecting-microsoft-sql-server-vmware-vsphere
[2] https://learn.microsoft.com/en-us/windows-server/administration/performance-tuning/hardware/power/power-performance-tuning#using-power-plans-in-windows-server
[3] https://learn.microsoft.com/en-us/sql/database-engine/configure-windows/enable-the-lock-pages-in-memory-option-windows
[4] https://learn.microsoft.com/en-us/sql/t-sql/database-console-commands/dbcc-traceon-trace-flags-transact-sql