Recently I completed a migration of a SQL Server (2008 R2) onto some new hardware for a client, the SAN storage was retained as is, while the actual server was upgraded to a blade holding 64Gigs of memory and 4 of AMD`s Opteron 6276 CPU`s. Upon logging on to the server I noticed 64 processors available to the OS and SQL – 64!
This staggering number is thanks to AMD`s “Bulldozer”
technology, their response to Intel`s Hyper threading. This uses two strong
threads to provide dedicated hardware to each thread running on a single core,
thus basically multiplying an 8 core chip to 16.
Info on Bulldozer:
The SQL instance in question hosts an array of OLAP
databases, the largest of which is around 600Gb. This database undergoes an ETL
process every night that pulls data from various sources and stages it into a
data warehouse. Now, immediately after the migration onto the new hardware, we noticed the nightly build taking almost twice as long as it used to, overlapping other run times on the server and causing them to fail.
The first thing I normally look at when confronted with a
SQL server Instance that is not performing well is Wait Statistics. This
normally provides a good high-level indication of where the major bottlenecks
and performance issues are on a server.
When I ran the query below, CXPACKET waits accounted for 90%
of total waits on the server. SELECT
wait_type ,
wait_time_ms wait_time_ms ,
signal_wait_time_ms ,
wait_time_ms - signal_wait_time_ms AS resource_wait_time_ms ,
100.0 * wait_time_ms / SUM(wait_time_ms) OVER ( )
AS percent_total_waits ,
100.0 * signal_wait_time_ms / SUM(signal_wait_time_ms) OVER ( )
AS percent_total_signal_waits ,
100.0 * ( wait_time_ms - signal_wait_time_ms )
/ SUM(wait_time_ms) OVER ( ) AS percent_total_resource_waits
FROM sys.dm_os_wait_stats
WHERE wait_time_ms > 0
AND wait_type NOT IN
( 'SLEEP_TASK', 'BROKER_TASK_STOP', 'BROKER_TO_FLUSH',
'SQLTRACE_BUFFER_FLUSH','CLR_AUTO_EVENT', 'CLR_MANUAL_EVENT',
'LAZYWRITER_SLEEP', 'SLEEP_SYSTEMTASK', 'SLEEP_BPOOL_FLUSH',
'BROKER_EVENTHANDLER', 'XE_DISPATCHER_WAIT', 'FT_IFTSHC_MUTEX',
'CHECKPOINT_QUEUE', 'FT_IFTS_SCHEDULER_IDLE_WAIT',
'BROKER_TRANSMITTER', 'FT_IFTSHC_MUTEX', 'KSOURCE_WAKEUP',
'LAZYWRITER_SLEEP', 'LOGMGR_QUEUE', 'ONDEMAND_TASK_QUEUE',
'REQUEST_FOR_DEADLOCK_SEARCH', 'XE_TIMER_EVENT', 'BAD_PAGE_PROCESS',
'DBMIRROR_EVENTS_QUEUE', 'BROKER_RECEIVE_WAITFOR',
'PREEMPTIVE_OS_GETPROCADDRESS', 'PREEMPTIVE_OS_AUTHENTICATIONOPS',
'WAITFOR', 'DISPATCHER_QUEUE_SEMAPHORE', 'XE_DISPATCHER_JOIN',
'RESOURCE_QUEUE' )
ORDER BY wait_time_ms DESC
Source:
http://www.simple-talk.com/books/sql-books/troubleshooting-sql-server-a-guide-for-the-accidental-dba/
CXPACKET waits occur when a process is executed in parallel and
the process needs to wait for one of the worker threads to complete. This wait
type normally points to an underlying problem on the server that is causing
threads to be stalled, be it the disk I/O subsystem, memory pressure or blocking
etc.
However, under certain circumstances inappropriate
parallelism can cause the server to underperform, massively parallelised
queries may be putting pressure on the I/O subsystem(which in this case was already under strain) or cause SQL`s internal
latching mechanism to function ineffectively. In these cases the other high
wait types on the server would include LATCH_*, PAGEIOLATCH_* or IO_COMPLETION
together with CXPACKET. And this was exactly the case with the above-mentioned
server.
Now knowing that the disks where already under pressure, and
that SQL has free reign to open any number of threads per process as the
Maximum degree of parallelism (MAXDOP)
setting was set to 0, which places no restriction on parallelisation, my
first inclination was to throttle the use of multiple threads slightly.
Having a look at the dynamic managed view dm_io_virtual_file_stats which shows the read and
write latency on the database files, I noted that throughput to disk was all but
fast. But with the client unable to
expand the SAN at that moment, I had to come up with an interim solution to get
rid of some of our headaches.Normally the advice around the SQL community regarding MAXDOP is to allow more threads for OLAP-type workloads (Less queries with longer running times) and less threads for OLTP - Type workloads(more queries but less expensive). I decided to change it to 8, the number of physical cores per processor.
The MAXDOP setting shouldn’t be considered on its own, cost threshold for parallelism is another server-wide setting that sets a costs value at which a query will be allowed to use multiple threads; this will stop short low-cost queries causing server overhead. I stuck with the default of 5, but analysis must be done for each individual server to determine this setting.
The figures below are based on hourly snapshots of the
relevant system data.
Figure 1 - CXPACKET Waits
Obviously SQL was generating more overhead from splitting
and combining queries, than the performance boost that parallel queries where
supposed to provide in the first place.
It is always nice to see concrete results like these after
making a change. Although we are still experiencing Disk I/O contention and
there remains lots of work to be done, this change has made it easier to
determine some of the other real issues on the server.
cool
ReplyDeleteVery nice post Frikkie! Excellent to see your real world experience and not just theory..nice!
ReplyDeleteHow many NUMA nodes does this server have? If there are more than 1, then it would most certainly determine your max degree of parallelism setting...
ReplyDeletehttp://support.microsoft.com/kb/2023536