In the previous two parts, we’ve presented the live database model for a subscription-based business and a data warehouse (DWH) we could use for reporting. While it’s obvious that they should work together, there was no connection between these two models. Today, we’ll take that next step and write the code to transfer data from the live database into our DWH.
The Data Models
Before we dive into the code, let’s remind ourselves of the two models we’ll work with. First is the transactional data model we’ll use to store our real-time data. Taking into account that we run a subscription-based business, we’ll need to store customer and subscription details, customers’ orders, and the order statuses.
There is really a lot we could add to this model, such as tracking payments and storing historical data (especially changes in customer and subscription data). To emphasize the ETL (extract, transform, and load) process, though, I want to keep this model as simple as possible.
Using a transactional data model as a reporting database might work in some instances, but it won’t work for all cases. We’ve already mentioned that, but it’s worthwhile to repeat it. If we want to separate our reporting tasks from our real-time processes, we should create some kind of reporting database. A data warehouse is one solution.
Our DWH is centered around four fact tables. The first two track the number of customers and subscriptions on a daily level. The remaining two track the number of deliveries and the products included in these deliveries.
My assumption is that we’ll run our ETL process once a day. First, we’ll populate dimension tables with new values (where needed). After that, we’ll populate fact tables.
To avoid unnecessary repetitions, I’ll demonstrate only the code that will populate the first two dimension tables and the first two fact tables. The remaining tables can be populated using very similar code. I encourage you to write down the code yourself. There is no better way to learn something new than by trying it.
The Idea: Dimension Tables
The general idea is to create stored procedures that we could regularly use to populate the DWH -- dimension tables as well as fact tables. These procedures will transfer data between two databases on the same server. This means that some queries inside these procedures will use tables from both databases. This is expected; we need to compare the state of the DWH to the live DB and make changes to the DWH according to what’s going on in the live DB.
We have four dimension tables in our DWH: dim_time
, dim_city
, dim_product
, and dim_delivery_status
.
The time dimension is populated by adding the previous date. The main assumption is that we’ll run this procedure daily, after the close of business.
The city and product dimensions will depend on the current values stored in the city
and product
dictionaries in the live database. If we add something to these dictionaries, then new values will be added to the dimension tables on the next DWH update.
The last dimension table is the dim_delivery_status
table. It won’t get updated because it contains only three default values. A delivery is either in transit, canceled, or delivered.
The Idea: Fact Tables
Populating fact tables is actually the real job. While the dictionaries in the live database don’t contain a timestamp attribute, tables with data inserted as a result of our operations do. You’ll notice two timestamp attributes, time_inserted
and time_updated
, in the data model.
Again, I’m assuming that we’ll successfully run the DWH import once a day. This enables us to aggregate the data on a daily level. We’ll count the numbers of active and canceled customers and subscriptions, as well as the deliveries and delivered products for that date.
Our live model works well if we run an insert procedure after the COB (close of business). Still, if we want more flexibility, we should make some changes to the model. One such change could be having a separate history table to track the exact moment when any data related to customers or subscriptions changed. With our current organization, we’ll know that the change has happened, but we won’t know if there were any changes before this one (e.g. a customer canceled yesterday, re-activated his account after midnight, and then canceled again today).
Populating Dimension Tables
As mentioned before, I’ll go with the assumption that we’ll run the DWH import exactly once a day. If that’s not the case, we’d need additional code to delete newly-inserted data from the dimension and fact tables. For the dimension tables, this would be limited to deleting the given date.
First, we’ll check if the given date exists in the dim_time
table. If not, we’ll add a new row to the table; if it does, we don’t need to do anything. In most cases, all dates are inserted during the initial production deployment. But I’ll go with this example for educational purposes.
For the dim_city
and dim_product
dimensions, I’ll add only any new values I detect in the city
and product
tables. I won’t make any deletions because any previously inserted values could be referenced in some fact table. We could go with a soft deletion, e.g. having an “active” flag which we could switch on and off.
For the last table, dim_delivery_status
, I won’t do anything because it will always contain the same three values.
The code below creates a procedure that will populate the dimension tables dim_time
and dim_city
.
For the time dimension, I’ll add yesterday’s date. I’m going with the assumption that the ETL process starts right after midnight. I’ll check if that dimension already exists and if not, then I’ll add the new date in the table.
For the city dimension, I’ll use a LEFT JOIN to join data from the live database and the DWH database to determine which rows are missing. Then I’ll add only any missing data to the dimension table. It’s worth mentioning that there are a few ways to check if data has been changed. This process is called change data capture, or CDC. A common method is checking for updated timestamps or versions. There are few additional ways, but they are outside the scope of this article.
Let’s take a look at the code now, which is written using MySQL syntax.
DROP PROCEDURE IF EXISTS p_update_dimensions// CREATE PROCEDURE p_update_dimensions () BEGIN SET @time_exists = 0; SET @time_date = DATE_ADD(DATE(NOW()), INTERVAL -1 DAY); -- procedure populates dimension tables with new values -- dim_time SET @time_exists = (SELECT COUNT(*) FROM subscription_dwh.dim_time dim_time WHERE dim_time.time_date = @time_date); IF (@time_exists = 0) THEN INSERT INTO subscription_dwh.`dim_time`(`time_date`, `time_year`, `time_month`, `time_week`, `time_weekday`, `ts`) SELECT @time_date AS time_date, YEAR(@time_date) AS time_year, MONTH(@time_date) AS time_month, WEEK(@time_date) AS time_week, WEEKDAY(@time_date) AS time_weekday, NOW() AS ts; END IF; -- dim_city INSERT INTO subscription_dwh.`dim_city`(`city_name`, `postal_code`, `country_name`, `ts`) SELECT city_live.city_name, city_live.postal_code, country_live.country_name, Now() FROM subscription_live.city city_live INNER JOIN subscription_live.country country_live ON city_live.country_id = country_live.id LEFT JOIN subscription_dwh.dim_city city_dwh ON city_live.city_name = city_dwh.city_name AND city_live.postal_code = city_dwh.postal_code AND country_live.country_name = city_dwh.country_name WHERE city_dwh.id IS NULL; END// -- CALL p_update_dimensions ()
Running this procedure -- which we do using commented procedure CALL -- inserts a new date and all missing cities into the dimension tables. Try adding your own code to populate the remaining two dimension tables with new values.
The ETL Process in a Data Warehouse
The main idea behind data warehousing is to contain aggregated data in the desired format. Of course, we should know that format before we even start building the warehouse. If we’ve done everything as planned, we can get all the benefits a DWH offers us. The main benefit is improved performance when running queries. Our queries work with fewer records (because they are aggregated) and run on the reporting database (rather than the live one).
But before we can query, we need to store facts in our database. The way we’ll do that depends on what we need to do with our data later. If we don’t have a good overall picture before starting to build our DWH, we could soon find ourselves in trouble! soon.
The name of this process is ETL: E = Extract, T = Transform, L = Load. It grabs the data, transforms it to suit the DWH structure, and loads it in the DWH. To be precise, the actual process we’ll use is ELT: Extract, Load, Transform. Since we’re using stored procedures, we’ll extract data, load it, and then transform it to meet our needs. It’s good to know that while ETL and ELT are slightly different, the terms are sometimes used interchangeably.
Populating the Fact Tables
Populating fact tables is why we’re really here. Today, I’ll populate two fact tables, the fact_customer_subscribed
table and the fact_subscription_status
table. The remaining two fact tables are yours to try as homework.
Before we move on to populating fact table, we must assume that the dimension tables are populated with new values. Populating the fact tables follows the same pattern. Since they have the same structure, I’ll explain both of them together.
We’re grouping data by two dimensions: time and city. The time dimension will be set to yesterday, and we’ll find the ID of the related record in the dim_time
table by comparing dates (the last INNER JOIN in both queries).
The ID of dim_city
is extracted by joining all the attributes that form a UNIQUE combination in the dimension table (city name, postal code, and country name).
In this query, we’ll test values with CASE and then SUM them. For active and inactive customers, I haven’t tested the date. However, I’ve selected as-is values for these fields. For new and canceled accounts, I’ve tested the updated time.
DROP PROCEDURE IF EXISTS p_update_facts// CREATE PROCEDURE p_update_facts () BEGIN SET @time_date = DATE_ADD(DATE(NOW()), INTERVAL -1 DAY); -- procedure populates fact tables with new values -- fact_customer_subscribed INSERT INTO `fact_customer_subscribed`(`dim_city_id`, `dim_time_id`, `total_active`, `total_inactive`, `daily_new`, `daily_canceled`, `ts`) SELECT city_dwh.id AS dim_ctiy_id, time_dwh.id AS dim_time_id, SUM(CASE WHEN customer_live.active = 1 THEN 1 ELSE 0 END) AS total_active, SUM(CASE WHEN customer_live.active = 0 THEN 1 ELSE 0 END) AS total_inactive, SUM(CASE WHEN customer_live.active = 1 AND DATE(customer_live.time_updated) = @time_date THEN 1 ELSE 0 END) AS daily_new, SUM(CASE WHEN customer_live.active = 0 AND DATE(customer_live.time_updated) = @time_date THEN 1 ELSE 0 END) AS daily_canceled, MIN(NOW()) AS ts FROM subscription_live.`customer` customer_live INNER JOIN subscription_live.`city` city_live ON customer_live.city_id = city_live.id INNER JOIN subscription_live.`country` country_live ON city_live.country_id = country_live.id INNER JOIN subscription_dwh.dim_city city_dwh ON city_live.city_name = city_dwh.city_name AND city_live.postal_code = city_dwh.postal_code AND country_live.country_name = city_dwh.country_name INNER JOIN subscription_dwh.dim_time time_dwh ON time_dwh.time_date = @time_date GROUP BY city_dwh.id, time_dwh.id; -- fact_subscription_status INSERT INTO `fact_subscription_status`(`dim_city_id`, `dim_time_id`, `total_active`, `total_inactive`, `daily_new`, `daily_canceled`, `ts`) SELECT city_dwh.id AS dim_ctiy_id, time_dwh.id AS dim_time_id, SUM(CASE WHEN subscription_live.active = 1 THEN 1 ELSE 0 END) AS total_active, SUM(CASE WHEN subscription_live.active = 0 THEN 1 ELSE 0 END) AS total_inactive, SUM(CASE WHEN subscription_live.active = 1 AND DATE(subscription_live.time_updated) = @time_date THEN 1 ELSE 0 END) AS daily_new, SUM(CASE WHEN subscription_live.active = 0 AND DATE(subscription_live.time_updated) = @time_date THEN 1 ELSE 0 END) AS daily_canceled, MIN(NOW()) AS ts FROM subscription_live.`customer` customer_live INNER JOIN subscription_live.`subscription` subscription_live ON subscription_live.customer_id = customer_live.id INNER JOIN subscription_live.`city` city_live ON customer_live.city_id = city_live.id INNER JOIN subscription_live.`country` country_live ON city_live.country_id = country_live.id INNER JOIN subscription_dwh.dim_city city_dwh ON city_live.city_name = city_dwh.city_name AND city_live.postal_code = city_dwh.postal_code AND country_live.country_name = city_dwh.country_name INNER JOIN subscription_dwh.dim_time time_dwh ON time_dwh.time_date = @time_date GROUP BY city_dwh.id, time_dwh.id; END// -- CALL p_update_facts ()
Once again, I’ve commented out the last line. Remove the comment and you can use this line to call the procedure and insert new values. Please note that I haven’t deleted any existing old values, so this procedure won’t work if we already have values for that date and city. This can be solved by performing deletions before insertions.
Remember, we need to populate the remaining fact tables in our DWH. I encourage you to try that yourself!
Another thing I would definitely recommend is placing the whole process inside a transaction. That would ensure that either all insertions succeed or none are made. This is very important when we want to avoid having data partially inserted, e.g. if we have multiple procedures for inserting dimensions and facts and some of them do their job while others fail.
What Do You Think?
Today we’ve seen how we could perform the ELT/ETL process and load data from a live database into a data warehouse. While the process we demonstrated is pretty simplified, it contains all the elements needed to E(xtract) the data, T(ransform) it into a suitable format, and finally L(oad) it into the DWH. What do you think? Please tell us your experiences in the comments below.