Data volumes are exploding in every
business where a Database or BI professional might find himself. Subsequently
data warehouse and decision support systems to keep up with this growth, and are expected to perform at the same
levels while offering even deeper insights into what the data is actually
saying.
When working with large tables, there are
certain strategies to consider in improving query performance and total impact
on the underlying server. These include,
correct indexing for your workloads, data compression, partitioning,
column-store indexes etc. Most SQL server professionals know and have used
these, but sometimes get forgotten and left by the wayside when looking at a
performance issue.
What is a large table?
First off, it is necessary to define what a
“large” table is, unfortunately it is no straight forward answer, and will
depend on what the schema of the table is, how it is queried, what the speed of
the underlying I/O system is and the amount of RAM of the server. It is not
merely a question of how many rows the table has.
This is actually a whole separate
discussion, but when you find that there is a table that gets scanned, and the
clustered index doesn’t fit into memory, or you find that SQL is regularly
running out of buffer space and there is lots of activity on TempDb – then you
can start looking at that table.
I will be focussing on Data Warehouse
scenarios. This is where wide tables with many rows are most common, because
large amounts of de-normalized data is stored and is kept over a long periods
in order to build historical trends.
Most of these topics have entire books
written about them – I will only be touching on each of them quickly.
Table Schema and Size
If you are in the lucky situation where you
have control over the data schema, try and choose the smallest data type
possible for your columns. In many occasions it is possible to use SmallInt
/TinyInt or SmallDateTime instead of Int and DateTime for example.
Also, obviously working with less data
renders this whole discussion unnecessary. So first try and see if older data
can be archived out of the table.
Appropriate Indexing
· Best candidate for the clustered Index;
The most efficient candidate for a
clustered index is normally the key that is the shortest (number of columns and
smallest data types), is unique and is sequential. Data warehouse surrogate
keys are often the best example of this. In some unique cases the index can be
moved to another key because of excessive lookups.
· Appropriate non-clustered indexes;
Creating covering indexes (and sometimes
filtered) for the most common queries on
the table will avoid table scans and having to read the big table into memory
and possibly having to spill into TempDB with Hash and Sort operations. Reporting and building aggregations off data
warehouse tables normally don’t generate a lot of ad-hoc queries, so one or two
well-placed non-clustered indexes can often have a big impact. Also keep in
mind if you would like SQL to use bitmap filtered hash joins – then the foreign
key values of the dimensions must not be indexed on the fact table. These types
of joins normally out-perform other join types on large scans.
· Avoid duplicate and unused indexes;
Every index takes up space in the data
file, causing overhead on the disks and any maintenance. Furthermore any DML
operation has to do extra work to keep them up to date. Make sure there aren’t
indexes who`s keys overlap, and that you monitor your index statistics (sys.dm_db_index_usage_stats)
to make sure you can don’t have unused indexes on the table.
The sys.dm_db_index_usage_stats Dynamic
Management View is reset every time SQL restarts, so it might be a good idea to
snapshot it on a regular basis in order to build a solid usage trend over
time.
Lastly don’t forget the most obvious
things, make sure the index isn’t fragmented, and that the distribution
statistics of the data are up to date!
Data compression;
SQL Server supports three compression implementations:
·
Row compression – Store fixed string and
numeric data type columns as variable length.
·
Page compression – Uses Prefix and
Dictionary compressions, basically storing duplicate values more efficiently,
also includes row compression.
·
Unicode compression –Uses single byte
storage for Unicode characters where possible, this is implemented when you
apply either row or page compression.
You can estimate the space savings by each
compression mode by using the following stored procedure. sp_estimate_data_compression_savings.
This will give you an idea which one( page or row) will give the biggest
benefit for the extra CPU overhead. Row compression is less intensive on CPU
than page compression is.
Running sp_estimate_data_compression_savings on the FactInternetSales table in the
AdventureWorksDW database shows the possible savings in for all the indexes and
partitions.
Query:
EXEC sp_estimate_data_compression_savings @schema_name = 'dbo',@object_name = 'FactInternetSales'
,@index_id = NULL
,@partition_number = NULL
,@data_compression = 'page'
Results:
Clustered index size before: 82736 KB
Clustered index size after: 11768 K
It is obvious that implementing page
compression on the Clustered index will bring the biggest savings.
ALTER TABLE dbo.FactInternetSales REBUILD
WITH (DATA_COMPRESSION = page);
Having a look at the I/O statistics on a
query before and after compression has been enabled gives a good indication of
how much less data is being read. This will ease pressure on disk and memory.
Query:
Results Before Compression:
Scan count 1, logical reads
10334, physical reads 1, read-ahead reads 10337
Results After Compression:
Scan count 1, logical reads
2151, physical reads 1, read-ahead reads 2153
ColumnStore indexes;
SQL Server 2012 has a new method of storing
nonclustered indexes. In addition to regular row storage, SQL Server 2012 can
store index data column by column, in what’s called a columnstore index. ColumnStore
indexes accelerate data warehouse queries but are not suitable for OLTP
workloads. Creating a ColumnStore index on a table makes it Read-Only.
ColumnStore indexes are more effective
because only the columns that are needed by the query are returned, not the
whole row. The data is also heavily compressed, reducing space and I/O
overhead. SQL server also introduced batch processing which takes advantage of
the fact that data is stored in columns; this is done by processing data in
batches roughly 1000 rows rather than one row at a time. Only certain operators
like Filter and Scan are supported for batch processing.
Note:
SQL Server 2014 will feature updateable ColumnStore indexes; also, you will be
able to create it as a clustered index.
ColumnStore indexes can work great when
they are created for whole fact tables, also some large dimensions can benefit.
Use it when your workloads are mostly read-only and allows you to drop and
re-create the index or switch out partitions regularly.
There are quite a number of restrictions
and limitations when using columnstore indexes, such as data types, replication
etc. for more information goes to: http://msdn.micr oft.com/en-us/library/gg492088.aspx
To demonstrate a query can be run before
and after a ColumnStore index is created . In this instance the index was
created on all the fields of the FactInternetSales
table.
Query:
USE [AdventureWorksDW2012]
GOSELECT C.FirstName, c.LastName, D.CalendarYear,
SUM(I.SalesAmount) AS Sales
FROM dbo.FactInternetSales AS I
INNER JOIN dbo.DimCustomer AS C
ON I.CustomerKey = C.CustomerKey
INNER JOIN dbo.DimProduct AS P
ON I.ProductKey = p.ProductKey
INNER JOIN dbo.DimDate AS d
ON I.OrderDateKey = D.DateKey
GROUP BY C.FirstName,c.LastName, D.CalendarYear
ORDER BY C.FirstName ,c.LastName, D.CalendarYear;
Result before ColumnStore:
The Clustered index scan is the most
expensive operation in the query – 41% of total cost.
Result after ColumnStore:
The ColumnStore index was used, now the
operator cost is only 2% of total cost.
Indexed Views
Many of the most expensive operations on
Data Warehouse and decision support systems are queries that aggregate data. It
can then be beneficial to pre-calculate these aggregations and persist them on
disk. Indexed views provide this ability, without having to create extra tables
and having to maintain them during ETL. In SQL Server 2012 Enterprise edition,
SQL will choose to use an indexed view in a query plan automatically, so there
is no need to change any queries.
Indexed views are created when you define a
view with the schema binding option and create a clustered index on the view.
Indexed views have their own set of restrictions
and considerations;
No comments:
Post a Comment