Back to articles list
- 12 minutes read

A look at algorithms used in RDBMS implementations of DWH systems

When you’re using a data warehouse, some actions get repeated over and over. We will take a look at four common algorithms used to deal with these situations.

Most DWH (data warehouse) systems today are on a RDBMS (relational database management system) platform. These databases (Oracle, DB2, or Microsoft SQL Server) are widely used, easy to work with, and mature – a very important thing to bear in mind when choosing a platform. Although they include special features for DWH configuration (like bitmap indexes, partitioning, and materialized views) we still have to take care of data transformation ourselves.

In this article, we will examine four algorithms for working with the data in RDBMS DWH systems. As much of DWH development revolves around temporal-sensitive data, most of the algorithms will deal with changing data in that way.

All of the techniques we will consider are written in pure SQL, not procedural SQL (like PL/SQL). When writing batch-oriented algorithms, it is always more prudent to write in pure SQL. This is because SQL is translated via an execution optimizer (i.e. Oracle CBO) which chooses the optimal execution and order of sub-operations for the query. How CBO does this is a fascinating subject; perhaps we will explore it a bit more in an upcoming article. For now, let’s stick with the algorithms.

We will start with the simple stuff and slowly increase the complexity.

1. Adding Information to a Periodic Snapshot Fact Table

First of all, let’s see how to transform a periodic snapshot fact table with fixed time intervals to include Type 2 information that is updated over time.

To refresh your knowledge of fact table types and organizations, see my previous article. We will use the same fact tables here.

So our first table is fact_account_snapshot and our fact table (with Type 2 technical columns) is fact_account_snapshot_period. They hold the same information, but there’s a difference in how they organize the data. The fact_account_snapshot table contains data for every snapshot date, while fact_account_snapshot_period stores data only when it has changed. Both hold account balance information, so in our case we will have a couple of accounts with balances that have changed through time.




To transform the rows and fill the fact_account_snapshot table from the period table, all we need to do is expand the monthly period of the fact_account_snapshot_period table with every snapshot date contained in that period.

Monthly period to snapshot

Picture 1: Monthly period to snapshot

To do this, we need to connect the fact_account_snapshot_period to a time dimensional table which contains the snapshot dates. Let’s say that this table is called dim_time. (We should mention that fact_seq represents the ID sequence of the fact_account_snapshot table.) The SQL statement for the transformation is simple:

insert into 
  fact_account_snapshot(id,time_id,employee_id,account_id,balance_amount)
as select 
  fact_seq.nextval,
  dt.time_id,
  fasp.employee_id,
  fasp.account_id,
  fasp.balance_amount 
from fact_account_snapshot_period fasp 
join dim_time dt 
on (date_from >= time_date
and date_to <= time_date )

Note: If the upper bound of the date_to column is NULL for a high date (such as 31.12.9999), you will need to include an NVL or CASE statement to limit date_to to the max snapshot date you require.

With this type of organization, we are sacrificing space for speed. The new organization takes more memory, but it requires less from the CPU.

2. From Snapshot to Type 2 Organization

Now if you do mind your tables taking a lot of space and you have some snapshot fact tables, you can transform them to a Type 2 organization. This is the inverse of the use case we applied in the first algorithm.




To start, we need to transform the data from the snapshot version to the monthly Type 2 version.

Monthly snapshot to period

Picture 2: Monthly snapshot to period

As this algorithm is more complex, we will break it into three sections. Each section corresponds to one of the three nested SELECT blocks which we will tie together later.

So let’s begin.

First, we need a simple data extract of the first ten rows from the fact_account_snapshot table. Here’s the SQL:

SELECT
  * 
FROM
  (SELECT account_id,
     balance_amount,
     time_id
   FROM fact_account_snapshot
   WHERE account_id = 1000
   ORDER BY time_id )
WHERE rownum <= 10;

The data we get is:


ACCOUNT_IDBALANCE_AMOUNTTIME_ID
1000020010201
1000020010301
1000020010401
1000900020010501
1000020010601
1000020010701
1000900020010801
1000020010901
1000020011001
1000900020011101


On this data set, we will flag the difference between the current row and the lagging row if there is a difference in the balance_amount. The code looks like this:

SELECT 
  account_id,
  time_id,
  balance_amount,
  lag (balance_amount, 1, balance_amount )
       OVER (PARTITION BY account_id ORDER BY time_id) lead_balance_amount,
  CASE
     WHEN lag (balance_amount, 1, balance_amount ) 
             OVER (PARTITION BY account_id ORDER BY time_id)
          - balance_amount != 0
     THEN 1
     ELSE 0
   END flag
FROM fact_account_snapshot 
ORDER BY time_id ;

The data we get is:


ACCOUNT_IDTIME_IDBALANCE_AMOUNTLAG_BALANCE_AMOUNTFLAG
100020010201000
100020010301000
100020010401090001
100020010501900001
100020010601000
100020010701090001
100020010801900001
100020010901000
100020011001090001
1000200111019000100001


As we see here, the flag is set to return a “1” when the previous value of balance_amount for the specific account_id is different from the current value.

Next, we construct groups of rows with same balance_amount values.

with first_part as (
  SELECT 
    account_id,
    time_id,
    balance_amount,
    LEAD (balance_amount, 1, balance_amount )
          OVER (PARTITION BY account_id ORDER BY time_id) lead_balance_amount,
    CASE
       WHEN lag (balance_amount, 1, balance_amount) 
                 OVER (PARTITION BY account_id ORDER BY time_id)
            - balance_amount != 0
       THEN 1
       ELSE 0
     END flag
   FROM fact_account_snapshot 
   ORDER BY time_id )
SELECT 
  account_id,
  time_id,
  balance_amount,
  SUM (flag) OVER (PARTITION BY account_id ORDER BY time_id) flag_group
FROM first_part;

The rows we saw earlier are now flagged by flag:


ACCOUNT_IDTIME_IDBALANCE_AMOUNTFLAG_GROUP
10002001020100
10002001030100
10002001040100
10002001050190001
10002001060102
10002001070102
10002001080190003
10002001090104
10002001100104
10002001110190005


Hopefully, you can see the solution on the horizon. All that is left for us to do is to reduce the rows by the group number. We can do this reduction with a simple GROUP BY() statement:

with first_part as ( 
  SELECT 
    account_id,
    time_id,
    balance_amount,
    LEAD (balance_amount, 1, balance_amount)
          OVER (PARTITION BY account_id ORDER BY time_id) lead_balance_amount,
    CASE
      WHEN lag (balance_amount, 1, balance_amount ) 
                OVER (PARTITION BY account_id ORDER BY time_id)
           - balance_amount != 0
      THEN 1
      ELSE 0
    END flag
  FROM fact_account_snapshot 
  ORDER BY time_id ),

second_part as (
  SELECT  
    account_id,
    time_id,
    balance_amount,
    SUM (flag) OVER (PARTITION BY account_id ORDER BY time_id) flag_group
   FROM first_part)

select 
  account_id,
  balance_amount,
  to_date(min(time_id),'yyyymmdd') date_from, 
  add_months(to_date(max(time_id),'yyyymmdd'),1) date_to 
from second_part
group by account_id, flag_group, balance_amount 
order by date_from;

Note: We added one month to the calculation of the date_to column so we could get a nice lower bound inclusive and upper bound exclusive time period.


ACCOUNT_IDBALANCE_AMOUNTDATE_FROMDATE_TO
1000002/01/0105/01/01
1000900005/01/0106/01/01
1000006/01/0108/01/01
1000900008/01/0109/01/01
1000009/01/0111/01/01
1000900011/01/0112/01/01
10001000012/01/0101/01/02


The data is now represented with a time period; in our particular case the ten rows are now condensed to seven rows.

3. Refreshing Batch Data in a Type 2 Snapshot Fact Table

The most common algorithm in the DWH world inserts a new delta of data into a Type 2 table. This can be a dimension or a snapshot fact table, as long as it’s of Type 2 organization.

Delta of data is any data that has changed or been added since a certain time. The difference is calculated in the ETL process. Specifically, we’re searching for new or changed data that needs to be traced in a dimensional table or measured in a fact table.

There are lots of ways we can implement this change. Let’s look at one fun example using analytical functions. For this example, we will return to our old friend, the customer dimension table dim_customer.

Table dim_customer

We will take a ten-row extract from the table:


IDCUST_
NATURAL_KEY
FIRST_
NAME
LAST_
NAME
ADDRESS_
RESIDENCE
CITY_
RESIDENCE
COUNTRY_
RESIDENCE
DATE_OF_
BIRTH
MARITAL_
STATUS
GENDERVALID_
FROM
VALID_TO
130983290067MirkoVladovic5 BrdaSplitCroatia09/03/83NM01/01/0012/31/99
2123123435455JohnSmith57 StreetNew YorkUSA07/03/73NM01/01/0012/31/99
3123123435232IvanVasiljevic7 PlosadMoscowRussia01/02/93NM01/01/0012/31/99
4122999323237HelenGrant12 Waterloo roadLondonUK03/03/53NF01/01/0012/31/99
553566232322HelenFelder2 GasseBerlinGermany01/03/81YF01/01/0012/31/99
6831232131120SaraSchubert3 GasseFrankfurtGermany09/03/03NF01/01/0012/31/99
7934230842030SamuelLittle27 StreetSan FranciscoUSA09/30/68YM01/01/0012/31/99
8198763111222JannetNolte29 StreetNew YorkUSA10/21/62NF01/01/0012/31/99
9832378872030MargarethTeacher2 StreetBostonUSA11/01/98NF01/01/0012/31/99
10921318422030IvanIvanovic9 PlosadVladivostokRussia12/03/57NM01/01/0012/31/99


Now suppose that we get the delta of data in the current data load:


CUST_
NATURAL_KEY
FIRST_
NAME
LAST_
NAME
ADDRESS_
RESIDENCE
CITY_
RESIDENCE
COUNTRY_
RESIDENCE
DATE_
OF_BIRTH
MARITAL_
STATUS
GENDER
198763111222JannetNolte27 StreetNew YorkUSA10/21/62NF
832378872030MargarethSmith2 StreetBostonUSA11/01/98YF
921318422030IvanIvanovic21 CirkKazanRussia12/03/57NM


In the delta data, we see that Jannet moved to a new address in the same city, Margareth got married and changed her last name, and Ivan moved to a new city.

So we need our data to display historical data up to a set point in time (captured with the date_from and date_to fields). In our case, this is the time of the load.

We will break the action into two parts: inserting the new and updating the old. First we insert the new data (if any table constraints hinder this, they must be disabled in the ETL process):

insert into dim_customer
 (select 11, '198763111222', 'Jannet', 'Nolte', '27 Street',
    'New York','USA',to_date('21.10.1962','dd.mm.yyyy'),'N','F',
     to_date('01.10.2016','dd.mm.yyyy'),to_date('31.12.9999','dd.mm.yyyy') 
  from dual
 union
  select 12,'832378872030','Margareth','Smith','2 Street',
    'Boston','USA',to_date('01.11.1998','dd.mm.yyyy'),'Y','F' ,
    to_date('01.10.2016','dd.mm.yyyy'),to_date('31.12.9999','dd.mm.yyyy') 
  from dual
 union
   select 13,'921318422030','Ivan','Ivanovic','21 Cirk',
     'Kazan','Russia',to_date('03.12.1957','dd.mm.yyyy'),'N','M' ,
     to_date('01.10.2016','dd.mm.yyyy'),to_date('31.12.9999','dd.mm.yyyy') 
   from dual);

The new rows are now contained in the table with the old data:


IDCUST_
NATURAL_KEY
FIRST_
NAME
LAST_
NAME
ADDRESS_
RESIDENCE
CITY_
RESIDENCE
MARITAL_
STATUS
VALID_
FROM
VALID_TO
8198763111222JannetNolte29 StreetNew YorkN01/01/0012/31/99
11198763111222JannetNolte27 StreetNew YorkN10/01/1612/31/99
9832378872030MargarethTeacher2 StreetBostonN01/01/0012/31/99
12832378872030MargarethSmith2 StreetBostonY10/01/1612/31/99
10921318422030IvanIvanovic9 PlosadVladivostokN01/01/0012/31/99
13921318422030IvanIvanovic21 CirkKazanN10/01/1612/31/99


Note that each new row defined by the natural key has a valid_from value of “10-01-16”. This is the ETL load date from which we consider the row valid.

Now we update the old date by closing the relevant records. As you see, the old rows with the ID values of “8”,”9”,and “10” need to be closed. We do this with by inputting a valid_to date that corresponds with the new rows’ valid_from date. This happens with help from SQL analytical functions:

UPDATE dim_customer dc
SET (valid_to) =
  (SELECT 
     l_v
   FROM
     (SELECT 
        cust_natural_key,
        valid_from,
        lead(valid_from,1,valid_to) 
            over (partition BY cust_natural_key order by valid_from) l_v
      FROM dim_customer) dc_i
   WHERE dc.cust_natural_key = dc_i.cust_natural_key
   AND dc.valid_from         = dc_i.valid_from)

Note: The above algorithm updates the entire dim_customer table. It is sometimes wise to filter the changed rows in a separate table or use a MERGE statement.

After the update statement, this data is returned:


IDCUST_
NATURAL_KEY
FIRST_
NAME
LAST_
NAME
ADDRESS_
RESIDENCE
CITY_
RESIDENCE
MARITAL_
STATUS
VALID_
FROM
VALID_
TO
8198763111222JannetNolte29 StreetNew YorkN01/01/0010/01/16
11198763111222JannetNolte27 StreetNew YorkN10/01/1612/31/99
9832378872030MargarethTeacher2 StreetBostonN01/01/0010/01/16
12832378872030MargarethSmith2 StreetBostonY10/01/1612/31/99
10921318422030IvanIvanovic9 PlosadVladivostokN01/01/0010/01/16
13921318422030IvanIvanovic21 CirkKazanN10/01/1612/31/99


There are other ways to implement the closing of Type 2 dimensions in data warehousing. Why not do a little experimentation and see what you can come up with on your own?

4. Fact Table Period Merges

One of the interesting problems we encounter in data warehousing is the period merge. This is needed when an event has multiple time periods and we need to construct one overlaying time period.

Let’s explain this using a simple example. Say a client has a couple of loan accounts and he is late with some of the payments. If we want to find out the time period in which the client is late, we need a period merge.

Simple merge

Picture 3: Simple merge

So if we look at the picture above, we see that the period in which the client was late with his payments was between January 1 and October 1.

Of course, a client can have many accounts that have a complex interrelationship. The following picture shows this:

Picture 4: Complex merge

Note: The period of December 1, 2016 to February 1, 2017 is not merged because it does not intersect with any other periods.

Next, we will look at a brief model that demonstrates the SQL needed for a period merge:

The data from f_default_periods is the shown here:


IDDIM_CUSTOMER_IDACCOUNT_NUMASSET_LEVELVALID_FROMVALID_TO
190001000D01/01/1608/01/16
290001001D03/01/1610/01/16
390001002D02/01/1606/01/16
490001003D07/01/1611/01/16
590001004D12/01/1602/01/17


select 
  dim_customer_id,
  valid_from,
  valid_to,
  lag (valid_to,1,valid_to) 
    over (partition by dim_customer_id order by valid_to) l_valid_to
from f_default_periods
where asset_level ='D'
order by valid_to;

To clarify, we will partition the SQL code into five steps. Step one, shown below, uses a simple LAG function on the SQL statement:

We are only looking at periods in which the account is in default. We identify these periods using the clause WHERE asset_level = “D”.

And we get this data set:


DIM_CUSTOMER_IDVALID_FROMVALID_TOL_VALID_TO
900002/01/1606/01/1606/01/16
900001/01/1608/01/1606/01/16
900003/01/1610/01/1608/01/16
900007/01/1611/01/1610/01/16
900012/01/1602/01/1711/01/16


In part two, we use a running max of the l_valid_to data:

with part_one as (
  select 
    dim_customer_id,
    valid_from,
    valid_to,
    lag (valid_to,1,valid_to) 
       over (partition by dim_customer_id order by valid_to) l_valid_to
  from f_default_periods
  where asset_level ='D')

select 
  dim_customer_id,
  valid_from,
  valid_to,
  max(l_valid_to) over (partition by dim_customer_id order by valid_to) running_max
from part_one
order by valid_to;

We are laying the groundwork to identify any gaps in our time period.


DIM_CUSTOMER_IDVALID_FROMVALID_TORUNNING_MAX
900002/01/1606/01/1606/01/16
900001/01/1608/01/1606/01/16
900003/01/1610/01/1608/01/16
900007/01/1611/01/1610/01/16
900012/01/1602/01/1711/01/16


If we see that the running_max value is lower than the current valid_from value, we know that we have formed one merged group. And we can apply the previous “snapshot to Type 2” algorithm.

In part three, we flag the groups:

with part_one as (
  select 
    dim_customer_id,
    valid_from,
    valid_to,
    lag (valid_to, 1, valid_to) 
        over (partition by dim_customer_id order by valid_to) l_valid_to
   from f_default_periods
   where asset_level ='D'),

part_two as (
  select 
    dim_customer_id,
    valid_from,
    valid_to,
    max(l_valid_to) over (partition by dim_customer_id order by valid_to) running_max
  from part_one) 
select 
  dim_customer_id, 
  valid_from,
  valid_to,
  running_max,
  case when running_max >= valid_from then 0 else 1 end flag_gap
from part_two
order by valid_to ;

The data in this step is:


DIM_CUSTOMER_IDVALID_FROMVALID_TORUNNING_MAXFLAG_GAP
900002/01/1606/01/1606/01/160
900001/01/1608/01/1606/01/160
900003/01/1610/01/1608/01/160
900007/01/1611/01/1610/01/160
900012/01/1602/01/1711/01/161


In part four, we apply the running_max sum to the flag_gap, just as in the previous algorithm:

with part_one as (
  select 
    dim_customer_id,
    valid_from,
    valid_to,
    lag (valid_to,1,valid_to ) 
      over (partition by dim_customer_id order by valid_to) l_valid_to
  from f_default_periods
  where asset_level ='D'),
part_two as (
  select 
    dim_customer_id,
    valid_from,
    valid_to,
    max(l_valid_to) over (partition by dim_customer_id order by valid_to) running_max
  from part_one),
part_three as (
  select 
    dim_customer_id, 
    valid_from,
    valid_to,
    running_max,
    case when running_max >= valid_from then 0 else 1 end flag_gap
from part_two)
select 
  dim_customer_id, 
  valid_from,
  valid_to,sum(flag_gap) over (partition by dim_customer_id order by valid_from) run_flag_gap
from part_three
order by valid_to;

In this step, the data is the same as in the previous step because we only have two merge groups:


DIM_CUSTOMER_IDVALID_FROMVALID_TORUN_FLAG_GAP
900002/01/1606/01/160
900001/01/1608/01/160
900003/01/1610/01/160
900007/01/1611/01/160
900012/01/1602/01/171


Finally, in the fifth and final step we simply aggregate the data and get the time periods of the merged groups:

with part_one as (
  select 
    dim_customer_id,
    valid_from,
    valid_to,
    lag (valid_to,1,valid_to ) 
         over (partition by dim_customer_id order by valid_to) l_valid_to
  from f_default_periods
  where asset_level ='D'),

part_two as (
  select 
    dim_customer_id,
    valid_from,
    valid_to,
    max(l_valid_to) 
       over (partition by dim_customer_id order by valid_to) running_max
  from part_one),

part_three as (
  select 
    dim_customer_id, 
    valid_from,
    valid_to,
    running_max,
    case when running_max >= valid_from then 0 else 1 end flag_gap
  from part_two),

part_four as (
  select 
    dim_customer_id, 
    valid_from,
    valid_to,
    sum(flag_gap) over (partition by dim_customer_id order by valid_from) run_flag_gap
from part_three)

select 
  dim_customer_id, 
  min(valid_from) valid_from, 
  max(valid_to) valid_to
from part_four
group by dim_customer_id, run_flag_gap
order by valid_to ;

The final result data shows our two merged periods:


DIM_CUSTOMER_IDVALID_FROMVALID_TO
900001/01/1611/01/16
900012/01/1602/01/17


This algorithm was a little bit tricky! But I am sure that once you master it, it will give you the foundation to come up with something exciting on your own.

Build Your Own SQL Algorithms

In this post, we saw some common SQL algorithms used to work with batch data. We started with something simple and built up to more complex statements. As you observed with the snapshot to Type 2 and period merge use cases, SQL algorithms can and often do build on each other. I encourage you to create something on your own, and in pure SQL. The language has all the strength – not to mention beauty – to meet any data manipulation need. And when you do come up with something, share it in the comments section below!

go to top

Our website uses cookies. By using this website, you agree to their use in accordance with the browser settings. You can modify your browser settings on your own. For more information see our Privacy Policy.