Understand what a database index does and how it can improve SQL query performance.
Modern databases store terabytes or petabytes of data and are a key resource for organizational operations and decision-making. A database index is a special data structure that allows quick access to specific pieces of information without having to read all data stored in a particular table. They ensure database performance in transactional environments.
How Does a Database Index Enhance Access to Information?
Let’s forget about databases and tables and use a “real life” example to describe and solve a problem. Assume that while you are on the bus heading to university, you suddenly remember that you have an assignment to write a few lines about Fitzgerald Scott. Unfortunately, you forgot to charge your cell phone, so you do not have internet access. Fortunately, you have just bought the book the professor suggested on the subject. You have 10 minutes to find information about the author in a 300-page book; unless you are The Flash, you cannot read the entire book!
Here is when the book’s index comes to help! At the very end of the book, the Index section has an ordered list of important words where you can easily find what you are looking for (in our scenario, author names). Next to each topic, there are some numbers that tell you in which pages of the book the author is mentioned.
Now, you can quickly turn to pages 120, 121, and 197 and read them to get the information you need. You complete the assignment before the bus reaches the university!
Database Tables and Indexes
Database tables are, in a way, like books. Data that is inserted into a table is internally stored on blocks or pages (the name varies depending on the database engine, but the meaning is the same: a set of contiguous bytes on disk) on the first empty space that is available in a block assigned to a table, with no specific order. When a block is full, a new one is assigned to the table and new data is inserted on the new block. When someone deletes data, an empty space is created on the existing blocks; that space may be used by the next insertions.
If we do not have an index and we need to find some specific information, we need to read the entire “book” (table) from beginning to end. Although modern computers and storage devices are really fast, the massive volume of some database tables would require a lot of time to find specific information.
That is the reason why database indexes are created: to allow quick access to data in a table. You can create indexes on different table columns. They help in the same way as a book index does, since they store all the values existing in that column together with a reference to all the blocks (or pages) where the values are stored or mentioned.
Indexes are automatically maintained by the database engine. This means that whenever a DML (an insert, update or delete) operation is performed on the table, the same operation (insert, update or delete) is done on each of the indexes of the table.
Take a look at All About Indexes: The Very Basics to get details on how indexes help enhance database performance when searching for specific data on large datasets.
Not All that Glitters Is Gold
We have seen the pros of database indexes, but there are some cons that must be considered when creating indexes.
First, an index is an additional data structure. That means that it needs space, both in “disk” to be stored and in “memory” to be accessed. Indexes consume blocks like a table does, so creating unnecessary indexes in a table will use more storage and consume more memory when data in the table is being accessed, inserted, updated, or deleted.
Second, the fact that indexes are automatically maintained means that the database engine needs to perform additional actions whenever a DML sentence is issued in the table. For example, if the last name of a customer is changed from “Thompson” to “Johnson”, then the database engine needs to update the block where the data is stored with the new value, remove the old “Thompson” value from any indexes where the last_name column is used, and insert the new “Johnson” value on the same indexes. This requires CPU time and disk access. Thus, keeping unnecessary indexes makes DML operations slower.
Like any other tool, indexes can be of great help and at the same time can be a problem if they are created needlessly or without good planning.
Table and Index Structures
In this article, we are going to use some examples based on a very simple Vertabelo data model for a small car rental company:
Regular tables (also known as heap tables) store data in no specific or guaranteed order. The image below represents the storage of the Vehicles table. It is composed of 14 blocks (or pages) containing the data shown on the diagram (not all columns are represented in the image). It clearly shows that the data is not ordered in any particular way:
Indexes are implemented using a binary tree structure. (This is called B-Tree or B+Tree, depending on whether it stores pointers only on leaf tables or not. You can find out more in All About Indexes Part 2: MySQL Index Structure and Performance.) The values of the column or columns used as the index key are stored, ordered, and organized in a root block, branch blocks (if required), and leaf blocks:
In the example shown above, which uses a B+Tree structure, we have an index created on the Brand column (values in blue), with a root block (block #70) with ranges of values and a pointer (in red) to each leaf block (#71, #72 and #73) where values for that range are located. Leaf blocks include each indexed value (again in blue) and a pointer (also in red) to the table block or blocks where data is present in the table.
If we want to search for customers with a particular last name and we don’t have an index, we would have to read the entire table (14 blocks). Having an index reduces that search to 3 blocks:
- First read the root block to find the leaf block showing where the desired value is located in the index.
- Next, read the leaf index block to find the specific last name and the block (or blocks) where data is located.
- Finally, read the blocks with the data.
As tables get bigger, indexes may require intermediate branch blocks to store more values. In this example, we need to read 3 blocks (root, branch, and leaf) to find a specific value:
The syntax for index creation may vary slightly between different database engines, but it is usually in this form:
CREATE INDEX IndexName ON TableName (IndexedColumn [, IndexedColumnN]);
Example (available in the Vertabelo diagram):
CREATE INDEX IX_Vehicle_Brand ON Vehicle (Brand);
Note: Indexes can be created on more than one column, as shown in the syntax. They are effective as long as the filter condition includes all leading columns. If we have an index on “
ColumnB” and “
ColumnC”, then any query that filters only by “
ColumnC” will not benefit from the index, since it is not the leading column.
There are different ways to classify indexes. We’ll review some of the most used.
Clustered Indexes (Index Organized Tables)
When we mentioned that data is not stored in a table in any particular way, we were referring to how many database engines work by default. Database engines allow data to be stored in an order defined by the user; this kind of table is referred to as a clustered index (SQL Server) or an Index Organized Table (Oracle).
This kind of structure is a mix between a table and an index. Index key column data is ordered (like in an index), but instead of having a pointer to where the actual data is in the table, it includes the rest of the data (in green; only some columns are shown) after the indexed column or columns (in blue). You can see this in the image below:
Clustered indexes are a very efficient way to retrieve any information, since there is no need to use the pointer to retrieve the actual data once the value has been found on the index. However, they have some caveats; the most important one is that the column or columns chosen as the index key should not change in order to avoid “moving” data from one block / page to another.
CREATE CLUSTERED INDEX IndexName ON TableName (IndexedColumn [, IndexedColumnN]);
Note 1: Tables can have only one clustered index, since it actually replaces the table storage. Additional non-clustered indexes can be added to the table.
Note 2: By default, SQL Servers creates a clustered index whenever a primary key is defined for a table. however, you can choose other columns for the index key or create the table as a normal heap table.
Some database engines like SQL Server or PostgreSQL allow you to specify a filter condition for an index. This reduces the number of rows that are indexed, making the index smaller and quicker to access.
CREATE INDEX IndexName ON TableName (IndexedColumn [, IndexedColumnN]) WHERE Condition;
Example (available in the Vertabelo diagram):
CREATE INDEX IX_Rental_PlannedReturn_fEffectiveReturnIsNULL ON Rental (PlannedReturn) WHERE EffectiveReturn IS NULL;
This will create an index on the
Rental table. It's based on the
PlannedReturn column, but only for those rows where
EffectiveReturn is NULL (meaning the car has not been returned yet, so the rental is still active).
Note: Clustered Indexes cannot be filtered, since they must store all the data in the table.
Index with INCLUDE (Covering Index)
Some database engines like SQL Server allow users to specify an INCLUDE clause. This lets values for additional columns be included (but not used for ordering) besides the index key columns. We can consider an index with INCLUDE like a reduced version of a clustered index, since we store only some columns together with the index key, rather than the entire row.
Including frequently-used columns may reduce the need to access the actual table to find values for those columns, making queries even faster.
CREATE INDEX IndexName ON TableName (IndexedColumn [, IndexedColumnN]) INCLUDE (Column1 [,Column2]);
Example (Available on the Vertabelo diagram):
CREATE INDEX IX_Vehicle_LicensePlate_iBrandModel ON Vehicle (LicensePlate) INCLUDE (Brand, Model);
With this index, the following query can be executed without reading data from the table, as the filter condition and the columns are available in the index:
SELECT LicensePlate, Brand, Model FROM Vehicle WHERE LicensePlate = ‘6TRJ999’;
When creating indexes with the INCLUDE clause, remember:
- Such indexes require more space. They need to store the index key columns, the pointer to the actual data, AND the included columns.
- If the included column values are changed, they need to be updated both in the table and in each index where they are included. This causes overheads to DML operations.
Note: Clustered Indexes cannot use the INCLUDE clause, since they actually store all the table columns.
The BITMAP index type, available in Oracle, PostgreSQL, and Teradata, allows you to efficiently index low cardinality columns (like Boolean columns or columns having few distinct values); such columns do not store each entry.
Bitmap indexes use a block for each distinct value in a column and then set a bit for each row in the table. This bit indicates if the row matches the value of the block. In the example below, we can see that row #1 has “Red” defined as its color; the first bit (which represents the first row) in the RED block is set to true. It is set to false in the other three blocks.
Bitmaps indexes are extremely efficient for bitwise comparisons (including OR) and consume less space. Still, the cost to maintain bitmap indexes is higher than normal B-Tree indexes, so they are not recommended for columns where data is frequently updated. They are very frequently found in data warehouses, where data is relatively stable.
CREATE BITMAP INDEX IndexName ON TableName (IndexedColumn [, IndexedColumnN]);
Example (Available on another Vertabelo diagram designed for Oracle):
CREATE BITMAP INDEX IX_Vehicle_Color ON Vehicle (Color);
Function-Based (Expression) Index
Some database engines like Oracle, DB2, Informix, or PostgreSQL allow users to create indexes on deterministic expressions including system provided or user defined functions without having to create a calculated or computed column on the table. (SQL Server requires a computed column to be created.)
CREATE INDEX IndexName ON TableName (Expression);
CREATE INDEX IX_Rental_RentalYear ON Rental (EXTRACT(YEAR FROM RentalDate));
With this index, the following query will use the index to find all products where characters in positions 3 to 6 of the ProductCode are ‘TECH’:
SELECT * FROM Rental WHERE EXTRACT(Year FROM RentalDate) = 2021;
Other Database Index Types
There are several other index types available on different database engines; they are designed to solve specific issues (like index size) or to deal with specific data (like JSON, XML or spatial). The following is a short list of some of them:
Indexes the contents of CLOB or VARCHAR columns word by word. Available in Oracle, SQL Server, and MySQL.
Indexes internal contents of JSON documents. Available in Oracle, DB2, and SQL Server.
Indexes geographical and spatial data. Available in Oracle and SQL Server.
Indexes the internal contents of JSON documents. Available in Oracle.
Designed to allow a balanced distribution of leaves on indexes with ever-growing values (like creation date or identity / sequence). Available in Oracle.
When multiple columns are defined as index keys, compressed indexes do not store repeated occurrences of leading columns; this saves space. Available in Oracle.