Mastering SQL analytical functions is a necessity for an aspiring BI/DWH professional. In this article, we’ll explore the history of SQL in a BI environment. We’ll also introduce some of the more common SQL analytical functions, including RANK, LEAD, LAG, SUM, and others.
SQL can be used for business intelligence (BI) applications that combine data retrieval, analysis, and computation. These computations are more complex than what we see in OLTP systems. If you’re interested in getting into data warehousing (DWH) or business intelligence, you need to go beyond plain SQL and start digging into the analytical functions (also known as window functions, windowing functions, or OVER clauses).
As stated, this article will get you started with SQL analytical functions. First, let’s have a quick look at the history of using SQL in BI. We’ll also consider the basic processes involved in using SQL to partition, order, and analyze data.
BI Meets SQL: A Brief History
BI applications are used for a variety of data-related activities. Common computations include:
- Best or worst performers in various organizational units (based on some qualitative measurement)
- Best or worst contributions by percentage threshold, such as the bottom 5% of stores by revenue
- Trends over time and across time; this information yields interesting insights about where the organization is heading
- Contributions to a whole, such as a commission percentage
While using relational database management systems (RDMS) as DWH/BI platforms, organizations found that the SQL
SELECT statement wasn’t adequately supporting their BI applications.
Plus, analysts were required to possess a complex skill set involving data retrieval, procedural coding, external tool usage and business domain knowledge. This mix of skills is not common, and companies found it difficult to locate individuals who could do everything needed.
During this time, the use of standard SQL in BI was causing poor development productivity and poor app performance. Productivity was poor because a BI application developer had to use SQL relational statements mixed in with procedural code (Oracle PL/SQL). Performance was slow because the SQL compiler often created poor execution plans for complex statements, and even worse ones for mixed statements.
The need was clear: overcome these deficiencies and enable RDBM systems to support BI/DWH more fully. This resulted in major SQL extensions for both processing and specification. You know these extensions as the analytical functions we find in all of the major DBMS vendors’ SQL syntaxes.
SQL Analytical Functions: The Basics
Executing analytical functions organizes data into partitions, computes functions over these partitions in a specified order, and returns the result.
Processing analytical functions is done after
GROUP BY and row filtering (
WHERE). This is so that the calculations can be performed on the grouped results. The returned rows are ordered after the functions have been processed.
We place analytical functions in the
SELECT list, just as we would with any other result columns in a SQL
Let’s take a look at the basic syntax of an analytical function:
We can see that the analytical function consists of a function with an
over keyword and an optional list of columns for ordering. The ordering indicates criteria for function evaluation, not the final ordering of results. This will become clear later as we look at examples.
We will now consider the most common analytical functions used in DWH/BI. To do this, we will access a simple data mart and build our queries on that.
The following data mart model is a simple star schema for tracking and analyzing retail sales:
Let’s begin with the ranking functions.
RANK is a simple analytical function that does not take any column list arguments. It returns a number or a rank based on the ordering of the rows; the ordering is based on a defined condition.
Let’s look at an example product list. We will rank products based on their unit price, from the cheapest to the most expensive.
SELECT id, product_desc, product_brand, product_unitprice, RANK() OVER (ORDER BY product_unitprice) Rank_UnitPrice FROM dim_product;
ORDER BY clause is mandatory in a
RANK function, as we will be listing products in either ascending or descending order.
The result set in ascending order would be:
|1||Plain Gibson razor||Gibson||15||1|
|4||Schick Xtreme3 for professionals||Schick Xtreme3||20||2|
|2||Gillette Venus Swirl for women||Gillette Venus Swirl||25||3|
|3||Gillette Fusion for young men||Gillette Fusion||25,5||4|
The descending version, where we go from most to least expensive, would be delivered by this code snippet:
RANK() OVER (ORDER BY product_unitprice DESC)
DENSE RANK works like RANK in that it needs no additional arguments and it ranks items in descending or ascending order. The only difference is that DENSE RANK does not allow “gaps” between groups.
These “gaps” occur when the ordering argument is the same for one or more groups. To illustrate this, think of competition rankings. Let’s say some programmers join a hackathon. Two programmers finish in second place with identical scores.
DENSE_RANK would return a “1” for the first-place programmer, a “2” for both programmers in second place and a “3” for the third-place finisher. RANK would return a “4” for the third place which, in this case, does not make sense.
So, if we expand the rowset from the previous example...
SELECT id, product_desc, product_brand, product_unitprice, product_category, RANK() OVER (ORDER BY product_unitprice) Rank_UnitPrice, DENSE_RANK() OVER (ORDER BY product_unitprice) Rank_UnitPrice FROM dim_product WHERE product_category = 'Razor'
… we get:
|1||Plain Gibson razor||Gibson||15||Razor||1||1|
|5||Gillette Plain for men||Gillette Plain||20||Razor||2||2|
|4||Schick Xtreme3 for professionals||Schick Xtreme3||20||Razor||2||2|
|2||Gillette Venus Swirl for women||Gillette Venus Swirl||25||Razor||4||3|
|3||Gillette Fusion for young men||Gillette Fusion||25,5||Razor||5||4|
DENSE_RANK computes the row after the identical product_unitprice razors as the next ordinal number;
RANK leaves a gap so that the next value displayed after an identical-value group is larger for the count. In our case,
RANK shows a “4” after the two identical values;
DENSE_RANK shows a “3”.
NTILE divides rows into equal groups and returns the number of the group that the row belongs to. This function is not as widely used as either RANK or DENSE RANK.
NTILE can be used to divide rows into three divisions:
SELECT id, product_brand, product_unitprice, RANK() OVER (ORDER BY product_unitprice) Rank_UnitPrice, DENSE_RANK() OVER (ORDER BY product_unitprice) Rank_UnitPrice, NTILE(3) OVER (ORDER BY product_unitprice) Ntile_UnitPrice FROM dim_product WHERE product_category = 'Razor';
We see five values divided into three groups:
|2||Gillette Venus Swirl||25||4||3||2|
ROW NUMBER is different than DENSE RANK and RANK in that it does not treat identical values in any special way. It simply lists them as they occur in some order.
Let’s look at the example:
SELECT id, product_brand, product_unitprice, RANK() OVER (ORDER BY product_unitprice) Rank_UnitPrice, DENSE_RANK() OVER (ORDER BY product_unitprice) DenseRank_UnitPrice, NTILE(3) OVER (ORDER BY product_unitprice) Ntile_UnitPrice, ROW_NUMBER() OVER (ORDER BY product_unitprice) ROWNUMBER_UnitPrice FROM dim_product WHERE product_category = 'Razor';
And the result is:
|2||Gillette Venus Swirl||25||4||3||2||4|
You may be wondering how
ROW_NUMBER ranks these identical values – which one gets a higher ranking. The answer is: This is totally undetermined. In one execution, a product could be assigned an ID of “5”; in another execution, the same product could be assigned a “4”.
So beware. Many mistakes are made by the misuse of
ROW_NUMBER on datasets with non-unique arguments.
We commonly use ranking functions when filtering data. So if we would like to filter the top three razors by price using
DENSE_RANK, it would look like:
WITH top_product AS (SELECT id, product_brand, product_unitprice, RANK() OVER (ORDER BY product_unitprice) Rank_UnitPrice FROM dim_product WHERE product_category = 'Razor' ) SELECT * FROM top_product WHERE rank_unitprice < 3
And the result is:
POSITIONAL ANALYTICAL FUNCTIONS
LEAD returns an offset (incrementally increased) value of an argument column. The offset amount can be defined in the code; its default amount is “1”. The new value is returned in the same row.
Here’s the syntax for
LEAD(value, offset, default_value)
The default value is returned if the offset points to non-existent row. The offset value follows the initial value and is applied by the direction of order.
Let’s return our previous example and use the
LEAD function to look at price increases. We’ll define product_unitprice as the value, with an offset of 1 and a default value of null:
SELECT id, product_desc, product_brand, product_unitprice, product_category, lead(product_unitprice,1,null) OVER (ORDER BY product_unitprice) Lead_UnitPrice FROM dim_product WHERE product_category = 'Razor';
Let’s look at the result set:
|1||Plain Gibson razor||Gibson||15||Razor||20|
|5||Gillette Plain for men||Gillette Plain||20||Razor||20|
|4||Schick Xtreme3 for professionals||Schick Xtreme3||20||Razor||25|
|2||Gillette Venus Swirl for women||Gillette Venus Swirl||25||Razor||25,5|
|3||Gillette Fusion for young men||Gillette Fusion||25,5||Razor||null|
To make this clearer, let’s visualize the shift in data values :-->
We see the
LEAD function is “taking” the 1 offset value in the direction of the analytical function’s order.
LAG is the opposite of LEAD. We can even implement LAG using LEAD and vice versa. The difference is in the direction we look for the offset value.
Here’s the syntax for LAG:
LAG(value, offset, default_value)
As the lead function takes the values in the window that are in the lead of the current value the lag function takes the values that are lagging. Hence the name LAG. The usage of the lead or lag is interchangeable, you can use them to implement the same behavior.
If we rewrite the previous example with a changed order direction, we get the same results.
SELECT id, product_desc, product_brand, product_unitprice, product_category, lead(product_unitprice,1,null) OVER (ORDER BY product_unitprice) Lead_UnitPrice, lag(product_unitprice,1,null) OVER (ORDER BY product_unitprice DESC) Lag_UnitPrice FROM dim_product WHERE product_category = 'Razor';
|3||Gillette Fusion for young men||Gillette Fusion||25,5||Razor||null||null|
|2||Gillette Venus Swirl for women||Gillette Venus Swirl||25||Razor||25,5||25,5|
|5||Gillette Plain for men||Gillette Plain||20||Razor||20||25|
|4||Schick Xtreme3 for professionals||Schick Xtreme3||20||Razor||25||20|
|1||Plain Gibson razor||Gibson||15||Razor||20||20|
LAG are important in the development of ETL using pure SQL batch statements. We will cover the implementation of various algorithms in later posts.
SUM gives us the sum of an entire set or group of values.
This function is pretty self-explanatory. If we define the sum function as an analytical function, we get the sum of the whole set or group defined in every row.
Here’s the syntax for
So if we look at this example...
SELECT product_category, SUM(sales_units) AS SUM_SALES_UNITS FROM f_sales, dim_product WHERE f_sales.dim_product_id = dim_product.ID GROUP BY product_category;
...we get the sum of every product category in every row:
The same applies to
AVG (a.k.a. MEAN), or any other summary function. Their syntaxes are similar enough that we won’t go into them here, but this is what they look like:
SELECT product_category, SUM(sales_units) AS SUM_SALES_UNITS, AVG(sales_units) AS AVG_SALES_UNITS, MEDIAN(sales_units) AS MEDIAN_SALES_UNITS FROM f_sales, dim_product WHERE f_sales.dim_product_id = dim_product.ID GROUP BY product_category;
The result is:
RUNNING SUMMARY FUNCTIONS
Summary functions can be defined with the
ORDER BY clause. When we include this clause, the function is calculated on the current row and every preceding unique row.
Look at this
SELECT DISTINCT DIM_TIME_ID AS TIME_ID, SUM(sales_units) OVER (ORDER BY DIM_TIME_ID) AS SUM_SALES_UNIT FROM f_sales ORDER BY DIM_TIME_ID
We know that the summary of sales_units is accumulative for every unique date – the date’s uniqueness is important – as defined in the
ORDER BY part of the analytical function. Below, we see the results:
If we would have omitted this unique value requirement on the
ORDER BY column, the return value would have been the sum of the non-unique row and every preceding row.
This type of summarization is called a running sum. It comes in handy when we’re dealing with common algorithms, such as the ones we will cover in upcoming articles.
Partitioning is optional in an analytic function. A partition statement is followed by a list of columns.
PARTITION BY defines the partition, -- the group of rows on which a function is performed. Do not confuse
PARTITION BY with the
GROUP BY part of an SQL statement.
We can redefine our previous ranking example by creating partition groups for the product_category column like this:
SELECT id, product_brand, product_unitprice, product_category, RANK() OVER (PARTITION BY product_category ORDER BY product_unitprice) Rank_UnitPrice, DENSE_RANK() OVER (PARTITION BY product_category ORDER BY product_unitprice) DenseRank_UnitPrice, NTILE(3) OVER (PARTITION BY product_category ORDER BY product_unitprice) Ntile_UnitPrice FROM dim_product ORDER BY PRODUCT_CATEGORY, 5;
The ranking is done within product categories. The results are:
|13||Axe for women||29||Deo||1||1||1|
|12||Axe for men||45||Deo||3||3||3|
|2||Gillette Venus Swirl||25||Razor||4||3||2|
The Window Concept
The window concept is used extensively with analytical functions; some even call analytical functions “windowing functions”. A window is a collection of rows in which an analytical function is calculated.
We tie window concepts to changes in numeric variables (i.e. stock prices or sales). Windows are typically defined by time intervals, like years or months. Some examples of window comparisons are:
- Sales growth percentages
- Moving averages of stock prices (30 day,90 day, 180 day)
- The performance of an advertising campaign across time
- Cumulative sales performance
The syntax of window functions is optional and is included after the ordering:
Windows can be specified as units of physical rows using the
ROWS keyword, or they can be defined in logical rows using the
Physical windows are specified by absolute values of rows we offset before and after the current row.
Logical windows are specified by values, such as a number of days. Numeric variables are calculated in moving windows so that the variables can be compared across windows.
Windows can be cumulative (as with the running summability example) or sliding, as we will see in the coming examples.
Examples of physical-row windows are:
- ROWS UNBOUNDED PRECEDING – The window is the current row and every preceding row.
- ROWS 2 PRECEDING – The window is the current row and 2 preceding physical rows.
- ROWS 2 FOLLOWING – The window is the current row and 2 following physical rows.
ROWS UNBOUNDED PRECEDING will yield the same results as the running sum example if the ORDER BY column in the analytic definition is unique. If not, then the running analytical function works on the rows of the non-unique group set first. The
ROWS UNBOUNDED PRECEDING window expands one row at a time.
Let’s look at the example:
SELECT DISTINCT DIM_TIME_ID AS TIME_ID, SUM(sales_units) over (ORDER BY DIM_TIME_ID ) AS SUM_SALES_UNIT, SUM(sales_units) over (ORDER BY DIM_TIME_ID ROWS UNBOUNDED PRECEDING) AS SUM_SALES_UNIT_PRECEDING FROM f_sales ORDER BY DIM_TIME_ID
And the result set:
As we see in the result set, the
ROWS_UNBOUNDED window expands to include the current row and all the following rows. The running sum, on the other hand, summarizes values by the 2016-01-01 date.
Now let’s take a look at the logical window specification.
Logical windows are indicated by the
RANGE keyword coupled with an ordering variable, such as a number of years or a shipment date. These values indicate the start and end of a window.
Unlike physical windows, where we specify the number of rows, in logical windows we specify the logical bounds of the window. To clarify this, let’s look at some examples of logical windows and how they differ from their physical counterparts.
Examples of logical windows are:
- ORDER BY TIME_ID RANGE 30 PRECEDING – The window is the current row and 30 days in the past.
- ORDER BY TIME_ID RANGE BETWEEN 365 PRECEDING AND 365 FOLLOWING – The window is bound by 365 days in the past and 365 days in the future.
- ORDER BY TIME_ID DATE RANGE BETWEEN INTERVAL '1' YEAR FOLLOWING AND INTERVAL '1' YEAR PRECEDING – The window is bound by 1 year in the past and 1 year in the future.
A window bound by ranges specified in days and years will be the same most of the time. Can you guess when it won’t be the same?
This picture best describes the difference between logical and physical windows:
The left side displays the window for
ROW BETWEEN 1 FOLLOWING AND 1 PRECEDING and the right side displays the windows for
RANGE BETWEEN 1 FOLLOWING AND 1 PRECEDING. The
RANGE windows take into account all the days that precede the current date because they are all one logical unit. The
ROW window only takes one preceding physical row.
Let’s return to our example with the sales fact table:
SELECT DISTINCT DIM_TIME_ID AS TIME_ID,sales_units, SUM(sales_units) over (ORDER BY DIM_TIME_ID ) AS SUM_SALES_UNIT, SUM(sales_units) over (ORDER BY DIM_TIME_ID ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING) AS SUM_SALES_UNIT_ROW, SUM(sales_units) over (ORDER BY TO_DATE(DIM_TIME_ID,'YYYYMMDD') RANGE BETWEEN 1 PRECEDING AND 1 FOLLOWING) AS SUM_SALES_UNIT_RANGE FROM f_sales ORDER BY DIM_TIME_ID,sales_units;
Since we are defining logical rows for dates, we had to cast the
DIM_TIME_ID to the date value to define the date logical row. We should stress that a single
DIM_TIME_ID represents a single day and that every day has a corresponding time_id. We enforce this with a 1..n relationship from the fact table to the dimensional table.
We get different results for each analytical function:
We can see that
SUM_SALES_UNIT is summarizing the result for the date 01.01.2016,
ROW is moving its physical window through the rows, and
RANGE is summing up all the sales in a three-day window.
Using SQL analytical functions can be a two-edged sword. They provide better performance and productivity, but the results and calculations can often be counterintuitive. There are many other functions that we didn’t cover in these article, but we have discussed the building blocks of most of them. I challenge you to explore analytical functions further. In the next article, we’ll learn about some DWH algorithms that use analytical functions for common ETL patterns.