Tuesday, June 19, 2012

Getting to the CORE of the problem

Recently I completed a migration of a SQL Server (2008 R2) onto some new hardware for a client, the SAN storage was retained as is, while the actual server was upgraded to a blade holding 64Gigs of memory and 4 of AMD`s Opteron 6276  CPU`s.  Upon logging on to the server I noticed 64 processors available to the OS and SQL – 64!

This staggering number is thanks to AMD`s “Bulldozer” technology, their response to Intel`s Hyper threading. This uses two strong threads to provide dedicated hardware to each thread running on a single core, thus basically multiplying an 8 core chip to 16.
Info on Bulldozer:
The SQL instance in question hosts an array of OLAP databases, the largest of which is around 600Gb. This database undergoes an ETL process every night that pulls data from various sources and stages it into a data warehouse.

Now, immediately after the migration onto the new hardware, we noticed the nightly build taking almost twice as long as it used to, overlapping other run times on the server and causing them to fail.

The first thing I normally look at when confronted with a SQL server Instance that is not performing well is Wait Statistics. This normally provides a good high-level indication of where the major bottlenecks and performance issues are on a server.
When I ran the query below, CXPACKET waits accounted for 90% of total waits on the server.

        wait_type ,
        wait_time_ms wait_time_ms ,
        signal_wait_time_ms ,
        wait_time_ms - signal_wait_time_ms AS resource_wait_time_ms ,
        100.0 * wait_time_ms / SUM(wait_time_ms) OVER ( )
                                    AS percent_total_waits ,
        100.0 * signal_wait_time_ms / SUM(signal_wait_time_ms) OVER ( )
                                    AS percent_total_signal_waits ,
        100.0 * ( wait_time_ms - signal_wait_time_ms )
        / SUM(wait_time_ms) OVER ( ) AS percent_total_resource_waits
FROM    sys.dm_os_wait_stats
WHERE   wait_time_ms > 0
        AND wait_type NOT IN
ORDER BY wait_time_ms DESC


CXPACKET waits occur when a process is executed in parallel and the process needs to wait for one of the worker threads to complete. This wait type normally points to an underlying problem on the server that is causing threads to be stalled, be it the disk I/O subsystem, memory pressure or blocking etc.

However, under certain circumstances inappropriate parallelism can cause the server to underperform, massively parallelised queries may be putting pressure on the I/O subsystem(which in this case was already under strain) or cause SQL`s internal latching mechanism to function ineffectively. In these cases the other high wait types on the server would include LATCH_*, PAGEIOLATCH_* or IO_COMPLETION together with CXPACKET. And this was exactly the case with the above-mentioned server.

Now knowing that the disks where already under pressure, and that SQL has free reign to open any number of threads per process as the Maximum degree of parallelism (MAXDOP)  setting was set to 0, which places no restriction on parallelisation, my first inclination was to throttle the use of multiple threads slightly.
Having a look at the dynamic managed view dm_io_virtual_file_stats which shows the read and write latency on the database files, I noted that throughput to disk was all but fast.  But with the client unable to expand the SAN at that moment, I had to come up with an interim solution to get rid of some of our headaches.

Normally the advice around the SQL community regarding MAXDOP is to allow more threads for OLAP-type workloads (Less queries with longer running times) and less threads for OLTP - Type workloads(more queries but less expensive). I decided to change it to 8, the number of physical cores per processor. 
The MAXDOP setting shouldn’t be considered on its own, cost threshold for parallelism is another server-wide setting that sets a costs value at which a query will be allowed to use multiple threads; this will stop short low-cost queries causing server overhead. I stuck with the default of 5, but analysis must be done for each individual server to determine this setting.

The figures below are based on hourly snapshots of the relevant system data.
 Figure 1 - CXPACKET Waits

Figure 2 - Other Wait types

Figure 3 - I/O on database files
Above figures 1 and 2 shows the CXPACKET waits and other I/O related waits coming down after changing MAXDOP to 8. Figure 3 shows the average time taken for a read or write to complete on the database files.

Obviously SQL was generating more overhead from splitting and combining queries, than the performance boost that parallel queries where supposed to provide in the first place.

It is always nice to see concrete results like these after making a change. Although we are still experiencing Disk I/O contention and there remains lots of work to be done, this change has made it easier to determine some of the other real issues on the server.


  1. Very nice post Frikkie! Excellent to see your real world experience and not just theory..nice!

  2. How many NUMA nodes does this server have? If there are more than 1, then it would most certainly determine your max degree of parallelism setting...