Today, reports and analytics are almost as important as core business. Reports can be built out of your live data; often this approach will do the trick for small- and medium-sized companies without lots of data. But when things get bigger – or the amount of data starts increasing dramatically – it’s time to think about separating your operational and reporting systems.
Before we tackle basic data modeling, we need some background on the systems involved. We can roughly divide systems in two categories: operational and reporting systems. Operational systems are often called Online Transaction Processing (OLTP). Reporting and analytical systems are referred to as Online Analytical Processing (OLAP). OLTP systems support business processes. They work with “live” operational data, are highly normalized, and react very quickly to user actions. On the other hand, the primary purpose of the OLAP systems is analytics. These systems use summarized data, which is usually placed in a denormalized data warehousing structure like the star schema. (What is denormalization? Simply put, it’s having redundant data records for the sake of better performance. Read more.)
Now that we know a little about the systems, let’s start examining the data warehouse and its pieces and processes.
Data Warehouses vs. Data Marts
A data warehouse (DWH) is a system used to store information for use in data analysis and reporting. Data marts are areas of a data warehouses used to store information needed by a single department or even by an individual user. (Think of the DWH as a building, and data marts as offices inside the building.)
Why are data marts needed? All relevant data is stored inside the company DWH. Most users, however, only need to access certain subsets of data, like those relating to sales, production, logistics or marketing. Data marts are important from both a security standpoint (limiting unnecessary access) and from a user standpoint (we don’t want to confuse them or force them to wade through extraneous data).
There are two different approaches to the data warehouse-data mart relationship:
- Top-Down: Data marts are created from the data warehouse. (This is something that Bill Inmon, the “father of the data warehouse”, would agree with, along with the idea that warehouses should be in 3NF.)
- Bottom-Up: Data marts are created first, then combined into a data warehouse. (This approach is closer to what Ralph Kimball, a data warehouse and dimensional modeling expert, advocates.)
The ETL process is used to add “new” data to the OLAP system on a regular basis. ETL is short for Extract, Transform and Load. As the name hints, we’ll extract data from one or more operational databases, transform it to fit our warehouse structure, and load the data into the DWH.
Dimensional modeling, which is part of data warehouse design, results in the creation of the dimensional model. There are two types of tables involved:
Dimension tables are used to describe the data we want to store. For example: a retailer might want to store the date, store, and employee involved in a specific purchase. Each dimension table is its own category (date, employee, store) and can have one or more attributes. For each store, we can save its location at the city, region, state and country level. For each date, we can store the year, month, day of the month, day of the week, etc. This is related to the hierarchy of attributes in the dimension table.
In the star schema, we’ll usually find that some attributes are a subset of other attributes in the same record. This redundancy is deliberate and done in the name of better performance. We could use date, location, and sales agent dimensions to aggregate (the transform part of the ETL process) and store data inside DWH. In dimensional modeling, it’s very important to define the right dimensions and choose proper granulation.
- Fact tables contain the data we want to include in reports, aggregated based on values within the related dimension tables. A fact table has only columns that store values and foreign keys referencing the dimension tables. Combining all the foreign keys forms the primary key of the fact table. For instance, a fact table could store a number of contacts and the number of sales resulting from these contacts.
With this info in place, we can now dig into the star schema data model.
The Star Schema
The star schema is the simplest model used in DWH. Because the fact table is in the center of the schema with dimension tables around it, it looks roughly like a star. This is especially apparent when the fact table is surrounded by five dimension tables. A variant of the star schema the centipede schema, where the fact table is surrounded by a large number of small dimension tables.
Star schemas are very commonly used in data marts. We can relate them to the top-down data model approach. We’ll analyze two star schemas (data marts) and then combine them to make a single model.
Star Schema Example: Sales
The sales report is one today’s most common reports. As we mentioned before, in most cases we could generate sales reports from the live system. But when data or business size makes this too cumbersome, we’ll have to build a data warehouse or a data mart to streamline the process. After designing our star schema, an ETL process will get the data from operational database(s), transform the data into the proper format for the DWH, and load the data into the warehouse.
The model presented above contains of one fact table (colored light red) and five dimension tables (colored light blue). The tables in the model are:
fact_sales
– This table contains references to the dimension tables plus two facts (price and quantity sold). Note that all five foreign keys together form the primary key of the table.dim_sales_type
– This is a sales-type dimension table with only one attribute, “type_name
”.dim_employee
– This is an employee dimension table that stores basic employee attributes: full name and birth year.dim_product
– This is a product dimension table with only two attributes (other than the primary key): product name and product type.dim_time
– This table handles the time dimension. It contains five attributes besides the primary key. The lowest-level data is sales by date (action_date
). Theaction_week
attribute is the number of the week in that year (i.e. the first week in January would be given the number 1; the last week in December would get the number 52, etc.) Theactual_month
andactual_year
attributes store the calendar month and year when the sale occurred. These can be extracted from theaction_date
attribute. Theaction_weekday
attribute stores the name of the day when the sale took place.dim_store
– This is a store dimension. For each store we’ll save the city, region, state and country where it is located. Here we can clearly notice that the star schema is denormalized.
Star Schema Example: Supply Orders
There are a lot of similarities between this model, shown below, and the sales model.
This model is intended to store the history of placed orders. We have one fact table and four dimension tables. The dimension tables dim_employee
, dim_product
and dim_time
are exactly the same as in the sales model. However, the following tables are different:
fact_supply_order
– contains aggregated data about the orders placed.dim_supplier
– is a dimension table that stores supplier data in the same manner asdim_store
kept store data in the sales model.
Advantages and Disadvantages to the Star Schema
There are plenty of advantages to using the star schema. The fact table is related to each dimension table by exactly one relation, and we don’t need any additional dictionaries to describe dimension tables. That simplifies queries and decreases query execution time. We could produce the same report directly from our OLTP system, but the query would be much more complex and it could impinge on the overall performance of the system. The following sample query for the sales model will return the quantity of all phone-type products type sold in Berlin stores in 2016:
SELECT dim_store.store_address, SUM(fact_sales.quantity) AS quantity_sold FROM fact_sales INNER JOIN dim_product ON fact_sales.product_id = dim_product.product_id INNER JOIN dim_time ON fact_sales.time_id = dim_time.time_id INNER JOIN dim_store ON fact_sales.store_id = dim_store.store_id WHERE dim_time.action_year = 2016 AND dim_store.city = 'Berlin' AND dim_product.product_type = 'phone' GROUP BY dim_store.store_id, dim_store.store_address
The biggest disadvantage of the star schema is redundancy. Each dimension is stored in a separate dimension table, and this causes denormalization. In our example, city belongs to a region or state, which belongs to a country; we don’t store that relation as a rule in our database, but we continually repeat it. This means we’ll spend more disk space and have a data integrity risk.
The Galaxy Schema
We can look at the two previous models as two data marts, one for the sales department and the other for the supply department. Each of them consists of only one fact table and a few dimensional tables. If we wanted, we could combine these two data marts into one model. This type of schema, containing several fact tables and sharing some dimension tables, is called a galaxy schema. Sharing dimension tables can reduce database size, especially where shared dimensions have many possible values. Ideally, in both data marts the dimensions are defined in the same manner. If that’s not the case, we’ll have to adjust the dimensions to fit both needs.
A galaxy schema, built out of our two example data marts, is shown below:
The star schema is one approach to organizing a data warehouse. It is very straightforward and is most often used in data marts. If we don’t have to worry about disk space and we take good care of data integrity, then the star schema is a viable first and best choice. If not, we should think of another approach. One is the snowflake schema, which we’ll discuss in an upcoming article.