Wednesday, July 10, 2013

Performance on large tables


Data volumes are exploding in every business where a Database or BI professional might find himself. Subsequently data warehouse and decision support systems to keep up with this growth,  and are expected to perform at the same levels while offering even deeper insights into what the data is actually saying.


When working with large tables, there are certain strategies to consider in improving query performance and total impact on the underlying server.  These include, correct indexing for your workloads, data compression, partitioning, column-store indexes etc. Most SQL server professionals know and have used these, but sometimes get forgotten and left by the wayside when looking at a performance issue.

What is a large table?

First off, it is necessary to define what a “large” table is, unfortunately it is no straight forward answer, and will depend on what the schema of the table is, how it is queried, what the speed of the underlying I/O system is and the amount of RAM of the server. It is not merely a question of how many rows the table has.  

This is actually a whole separate discussion, but when you find that there is a table that gets scanned, and the clustered index doesn’t fit into memory, or you find that SQL is regularly running out of buffer space and there is lots of activity on TempDb – then you can start looking at that table.

I will be focussing on Data Warehouse scenarios. This is where wide tables with many rows are most common, because large amounts of de-normalized data is stored and is kept over a long periods in order to build historical trends.

Most of these topics have entire books written about them – I will only be touching on each of them quickly.

Table Schema and Size

If you are in the lucky situation where you have control over the data schema, try and choose the smallest data type possible for your columns. In many occasions it is possible to use SmallInt /TinyInt or SmallDateTime instead of Int and DateTime for example.
Also, obviously working with less data renders this whole discussion unnecessary. So first try and see if older data can be archived out of the table. 

Appropriate Indexing

·        Best candidate for the clustered Index;

The most efficient candidate for a clustered index is normally the key that is the shortest (number of columns and smallest data types), is unique and is sequential. Data warehouse surrogate keys are often the best example of this. In some unique cases the index can be moved to another key because of excessive lookups.

·        Appropriate non-clustered indexes;

Creating covering indexes (and sometimes filtered)  for the most common queries on the table will avoid table scans and having to read the big table into memory and possibly having to spill into TempDB with Hash and Sort operations.  Reporting and building aggregations off data warehouse tables normally don’t generate a lot of ad-hoc queries, so one or two well-placed non-clustered indexes can often have a big impact. Also keep in mind if you would like SQL to use bitmap filtered hash joins – then the foreign key values of the dimensions must not be indexed on the fact table. These types of joins normally out-perform other join types on large scans.

·        Avoid duplicate and unused indexes;

Every index takes up space in the data file, causing overhead on the disks and any maintenance. Furthermore any DML operation has to do extra work to keep them up to date. Make sure there aren’t indexes who`s keys overlap, and that you monitor your index statistics (sys.dm_db_index_usage_stats) to make sure you can don’t have unused indexes on the table.

The  sys.dm_db_index_usage_stats Dynamic Management View is reset every time SQL restarts, so it might be a good idea to snapshot it on a regular basis in order to build a solid usage trend over time. 

Lastly don’t forget the most obvious things, make sure the index isn’t fragmented, and that the distribution statistics of the data are up to date!

 

Data compression;

Data compression reduces the size of the table, which helps improve query performance because queries on compressed data read fewer pages from disk. However, data compression requires extra CPU resources for updates, because data must be decompressed before and compressed after the update. Data compression is therefore suitable for data warehousing scenarios in which data is mostly read and only occasionally updated.
SQL Server supports three compression implementations:

·         Row compression – Store fixed string and numeric data type columns as variable length.

·         Page compression – Uses Prefix and Dictionary compressions, basically storing duplicate values more efficiently, also includes row compression.

·         Unicode compression –Uses single byte storage for Unicode characters where possible, this is implemented when you apply either row or page compression.

You can estimate the space savings by each compression mode by using the following stored procedure. sp_estimate_data_compression_savings. This will give you an idea which one( page or row) will give the biggest benefit for the extra CPU overhead. Row compression is less intensive on CPU than page compression is.

Running sp_estimate_data_compression_savings on the  FactInternetSales table in the AdventureWorksDW database shows the possible savings in for all the indexes and partitions.

Query:

EXEC sp_estimate_data_compression_savings @schema_name = 'dbo'
,@object_name = 'FactInternetSales'
,@index_id = NULL
,@partition_number = NULL
,@data_compression = 'page'

Results:
Clustered index size before: 82736 KB
Clustered index size after: 11768 K

It is obvious that implementing page compression on the Clustered index will bring the biggest savings.
ALTER TABLE dbo.FactInternetSales REBUILD

       WITH (DATA_COMPRESSION = page);

Having a look at the I/O statistics on a query before and after compression has been enabled gives a good indication of how much less data is being read. This will ease pressure on disk and memory.

Query:

SELECT * FROM FactInternetSales

Results Before Compression:
Scan count 1, logical reads 10334, physical reads 1, read-ahead reads 10337

Results After Compression:
Scan count 1, logical reads 2151, physical reads 1, read-ahead reads 2153


 ColumnStore indexes;

SQL Server 2012 has a new method of storing nonclustered indexes. In addition to regular row storage, SQL Server 2012 can store index data column by column, in what’s called a columnstore index. ColumnStore indexes accelerate data warehouse queries but are not suitable for OLTP workloads. Creating a ColumnStore index on a table makes it Read-Only.

ColumnStore indexes are more effective because only the columns that are needed by the query are returned, not the whole row. The data is also heavily compressed, reducing space and I/O overhead. SQL server also introduced batch processing which takes advantage of the fact that data is stored in columns; this is done by processing data in batches roughly 1000 rows rather than one row at a time. Only certain operators like Filter and Scan are supported for batch processing.

Note: SQL Server 2014 will feature updateable ColumnStore indexes; also, you will be able to create it as a clustered index.

ColumnStore indexes can work great when they are created for whole fact tables, also some large dimensions can benefit. Use it when your workloads are mostly read-only and allows you to drop and re-create the index or switch out partitions regularly. 

There are quite a number of restrictions and limitations when using columnstore indexes, such as data types, replication etc. for more information goes to: http://msdn.micr oft.com/en-us/library/gg492088.aspx

To demonstrate a query can be run before and after a ColumnStore index is created . In this instance the index was created on all the fields of the FactInternetSales table.

Query:


USE [AdventureWorksDW2012]
GO
SELECT  C.FirstName, c.LastName, D.CalendarYear,
SUM(I.SalesAmount) AS Sales
FROM dbo.FactInternetSales AS I
INNER JOIN dbo.DimCustomer AS C
ON I.CustomerKey = C.CustomerKey
INNER JOIN dbo.DimProduct AS P
ON I.ProductKey = p.ProductKey
INNER JOIN dbo.DimDate AS d
ON I.OrderDateKey = D.DateKey
GROUP BY C.FirstName,c.LastName, D.CalendarYear
ORDER BY C.FirstName ,c.LastName,  D.CalendarYear;

Result before ColumnStore:

The Clustered index scan is the most expensive operation in the query – 41% of total cost.

 


Result after ColumnStore:


The ColumnStore index was used, now the operator cost is only 2% of total cost.

 

Indexed Views

Many of the most expensive operations on Data Warehouse and decision support systems are queries that aggregate data. It can then be beneficial to pre-calculate these aggregations and persist them on disk. Indexed views provide this ability, without having to create extra tables and having to maintain them during ETL. In SQL Server 2012 Enterprise edition, SQL will choose to use an indexed view in a query plan automatically, so there is no need to change any queries.
Indexed views are created when you define a view with the schema binding option and create a clustered index on the view.
Indexed views have their own set of restrictions and considerations;




No comments:

Post a Comment