Thursday, December 13, 2012

Database vs Data Warehouse

What is a data warehouse?

A data warehouse is a repository of an organization's electronically stored data. Data warehouses are designed to facilitate reporting and analysis. Often designed as OLAP (On-Line Analytical Processing) systems, these databases contain read-only data that can be queried and analysed far more efficiently as compared to your regular OLTP (On-Line Transaction Processing) application databases. In this sense an OLAP system is designed to be read-optimized.


A data warehouse consists of a computer database responsible for the collection and storage of information for a specific organization. This collection of information is then used to manage information efficiently and analyze the collected data. Although data warehouses vary in overall design, majority of them are subject oriented, meaning that the stored information is connected to objects or events that occur in reality. The data provided by the data warehouse for analysis provides information on a specific subject, rather than the functions of the company and is collected from varying sources into one unit having time-variant.

Data warehousing professionals build and maintain critical warehouse infrastructure to support business and assist business executives in making smart business decisions. Warehouse ETL (Extraction, Transformation and Loading of data) is an essential part of data warehousing where the data is populated into the warehouse with information from production databases. The data warehouse ETL is changed according to the needs of the business in order to maintain a consistent and accurate reporting system.

For more information on the Data Warehouse Lifecycle:

http://books.google.co.za/books/about/The_Data_Warehouse_Lifecycle_Toolkit.html?id=XaUV6r2Xy0IC&redir_esc=y

How is it different from a Database?

Application databases are OLTP (On-Line Transaction Processing) systems where every transaction has to be recorded, and as fast as possible at that. Thus OLAP systems are more likely designed to be write-optimized. There are a number of fundamental differences which separate a data warehouse from a database.

The basic differences are:

The biggest difference between the two is that most databases place emphasis on a single application, and this application will generally be one that is based on transactions. If the data is analyzed, it will be done within a single domain, but multiple domains are not uncommon.

Some of the separate units that may be comprised within a database include payroll or inventory. Each system will place an emphasis on one subject, and it will not deal with other areas. In contrast, data warehouses deal with multiple domains simultaneously.

As it deals with multiple subject areas, the data warehouse finds connections between them. This allows the data warehouse to show how the company is performing as a whole, rather than in individual areas. Another powerful aspect of data warehouses is their ability to support the analysis of trends. They are not volatile, and the information stored in them doesn't change as much as it would in a common database. The two types of data that you will want to become familiar with is operational data and decision support data. The purpose, format, and structure of these two data types are quite different. In most cases, the operational data will be placed in a relational database.

In the relational database, tables are frequently used, and they may be normalized. The operational data will be calibrated in a way that allows it to deal with transactions that are made on a daily basis. Every time a transaction takes place in the company, a record must be made of it. As can be expected, this data will be updated on a frequent basis. To ensure the efficiency of the system, the data must be placed in a certain number of tables, and the tables must have fields. Because of this, a single transaction may be comprised of at least five fields.

While this system may be highly efficient in an operational database, it is not conducive to queries. In this situation, decision support data is often useful, and it offers support for things that are not readily used by operational data.

If you want to take out a single invoice, you will often be required to join multiple tables. While operational data will deal mostly with transactions that are made daily, decision support data will give meaning to the data that is operational. The differences between decision support data and operational data can be split into three categories, and these are dimensionality, time span, and granularity.

Dimensionality is a concept which shows that the data is connected in various ways. The data that is stored in a data warehouse will often be multidimensional, and it is much different than the simple view that is often seen with operational data. Many data analysts are concerned with the many dimensional aspects of data.

Time Span deals with transactions that are atomic, or current. Generally, operational data will deal with a short time frame. However, decision support data tends to deal with long time frames. Many company managers are interested in transactions that occurred over a certain time period. Instead of dealing with the purchase of one customer, managers are often more interested in the buying patterns of a group of customers. If a sale has just been made, it will not be found in a decision support data warehouse.

Granularity is the third concept that separates operational data from decision support data. Operational data will deal with transactions that have occurred within a certain period of time. However, the decision support data must be broken down into different parts of aggregation. While it may be summarized, it may also be more current. The managers within an organization will need information that is summarized at various degrees.

Data warehouses have become more important in the Information Age, and they are a necessity for many large corporations, as well as some medium sized businesses. They are much more elaborate than a mere database, and they can find connections in data that cannot be readily found within most databases.

2 comments:

  1. While I agree that an OLAP system is MUCH bigger in size to an OLTP (properly designed) system, I don't agree that an OLAP system is more elaborate. A properly normalized OLTP database, which has been developed to encapsulate the business logic with rules and user-defined data types can become monstrously complicated very quickly. Generally speaking you would not even have data were it not for the OLTP system in the first place. Everything has a purpose and I don't think one is more elaborate or important than the other.

    ReplyDelete