A surrogate key is a type of primary key used in most database tables. It provides a simple, system-generated, business-agnostic column. This column is used as an identifier for each row rather than relying on pre-existing attributes. Learn more, including why surrogate keys are widely used, below.
Before learning about surrogate keys in detail, consider reading the article “On Keys.” This will help you fully understand the concept of keys in a database model and the different types that exist.
First, What Is a Primary Key?
A primary key is a “special” type of unique key in a table. Like any unique key, it is a column or set of columns that uniquely identifies a row in the table. However, there are a couple of additional conditions:
- A primary key cannot accept null in any of its columns.
- There can be only one key defined as primary, versus the possibility of having multiple unique keys in a
Although creating a primary key for each table is not mandatory when designing a database, it is best practice.
Also, the primary key definition should remain constant, since it will most likely be used as the reference for foreign keys in other tables. Changing the primary key definition would require changing all of the foreign keys pointing to it.
Should I Use Existing Columns for a Primary Key?
This question usually comes up when you start designing a data model. Let’s imagine we want to create a database for a new car rental company. To store information about our customers, we can start with a simple table like this:
In this case, we have chosen an existing attribute (DriverLicenseNumber
) as the primary key for our Customer
table. And we used it to reference customers in our Rental
table.
This primary key is known as a natural primary key since it uses data already present in the table to identify each row. This might seem great at first, but soon some problems will arise:
- Some customers will renew their license and get a new license n In such cases, we will need to update both the
Customer
table and any related tables (likeRental
) with the new value. Changing the value of a primary key row is a costly operation. It may, in some cases, even require disabling the foreign key while the update is performed.
- Customers from other states and countries will start using the system. So, the license number will not be enough to uniquely identify each We will need to add columns to store the state and country in both our
Customer
table and any related tables (likeRental
). As you can imagine, changing the definition of a primary key is a costly operation (in terms of both time and money) when our database is already in use.
So, what seemed to be a quick, simple, and logical solution has some disadvantages. Every time we want to create a table that references a customer, we will need to include the three columns that we now use as the primary key, as we show below:
Surrogate Primary Keys Save the Day!
The two types of problems described in the previous section are common occurrences. They are also the main reasons for choosing a surrogate key instead of a natural key as the primary key.
A surrogate key is an additional column with no business value. Its sole purpose is to provide a unique value that identifies each row without considering any of the data stored in the table.
By using a surrogate key, we can establish relationships between our tables using only one column. This is much better than needing three columns to define a key for our customers.
The main benefits of surrogate primary keys are as follows:
- We need only one column for our primary key, while many entities require two, three, or even more columns to uniquely identify a row when using natural keys. This can be important in many ways:
- It will be simpler to write
JOIN
conditions on complex queries. Plus, quicker execution will consume less memory. - Having fewer columns on child tables could mean significant storage savings.
- It will be simpler to write
- They do not change. A license or passport number might change when it’s renewed or if the person moves. However, a surrogate key is immutable because it does not store meaningful data. It is generated when the row is created and is never updated.
- They have a uniform format. Usually, surrogate keys are automatically generated, while natural keys, like license numbers, can have many different formats. (For example, some countries use letters in addition to numbers, some countries issue licenses at the national level while others do at the state level, etc.).
On the other hand, surrogate primary keys have some setbacks worth mentioning:
- A new column is required on each table, increasing the size. This is usually compensated by the child tables requiring fewer (For example, we would have an additional column for the surrogate key in both our
Customer
andRental
tables, but we would also replace the three columns holding license information in theRental
table with just one column pointing to theCustomer
table, thus reducing the total number of columns). - Surrogate keys do not give context to the data. If we see a column in a stock price table that says “GOOG” or “MSFT,” we can quickly identify that the price is for Google or Microsoft. However, it would not be clear if the values were 1342 or 4235. We would need to
JOIN
our table with master stock data to know that 1342 means Google and 4235 means Microsoft.
To learn more about the differences between natural and surrogate keys, you can read the article “Designing a Database: Should a Primary Key Be Natural or Surrogate?”
Data Type and Size
Numbers are the preferred data type for surrogate keys. Numeric values use less storage space than strings and are simple to generate.
Databases designed for distributed solutions may use UUID data types. Some other scenarios may require using strings. Otherwise, unless you have very specific requirements, using a numeric (integer) data type is recommended.
Besides the data type, choosing the right size is also important. Master tables, like OrderStatus
or DocumentType
, would probably require just one or two digits since they store only a few rows. But transactional tables, like Invoice
or OrderDetail
, might require bigger numbers due to the amount of data to be stored.
Remember to plan ahead because changing the definition of a primary key column is an expensive operation. If you use a UUID, the size is already defined. It requires 16 bytes, which consumes more space than most integer types (they use 1, 2, 4, or 8 bytes at most).
Extra Tip
To calculate the size savings, rather than considering the data in the parent table where the primary key is defined, consider the data in the child tables where it is referenced.
Using 8 bytes (BIGINT) rather than 1 byte (TINYINT) won’t generate too much wasted space in an OrderDetailStatus
parent table with only a few rows. However, it could mean a lot of wasted space in the OrderDetail
table, which could hold millions of rows.
How To Populate a Surrogate Primary Key
Now that we have seen the benefits of surrogate keys, let’s take a quick look at our car rental design. We have now implemented surrogate keys in both tables:
Whenever we add a new customer to our car rental company, we will ask for the “business” data that we require (name, address, date of birth, and license information). But how do we generate a value for the new column, named CustomerID
?
Most database engines provide solutions to make this operation simple. Both SQL Server and newer versions of Oracle (the two most popular RDBMSs nowadays) include the IDENTITY
option when defining a numeric column. (Other RDBMSs offer similar options, like AUTO_INCREMENT
in MySQL).
The column that is defined as “identity” (there can be only one in each table) is automatically populated using numeric values. So, you do not have to implement any logic or programming.
Most RDBMSs also include the option of defining sequences (number generators). They easily assign the values they return to a column, either setting the sequence as the default value for the column or programmatically by using a trigger.
Extra Tip
In case you are using a UUID data type as the primary key, SQL Server includes two functions, NEWID()
and NEWSEQUENTIALID()
, that can be set as default values for your UUID primary keys.
Do I Still Need Unique Keys?
When we add a surrogate key as the primary key, we simplify the way we uniquely identify a row in a table. It will therefore be simpler to establish relationships with other tables. We will also avoid changes in the relationship if the business definition of a unique combination varies.
But we still need to validate that the business rules that define uniqueness in the data apply. So, the answer to the above question is always, “Yes!”
In our example, we created a CustomerID
column and defined it as the primary key. However, we still need to have a unique key in the three columns that have license information to avoid entering a new customer with the same information as an existing one.
The diagram below shows a more complete physical model (created for SQL Server using the Vertabelo Database Modeler) for our car rental company. It includes additional tables with surrogate keys of different data types by using the IDENTITY
option.
Should I Always Use Surrogate Primary Keys?
Using surrogate keys is usually the most secure way to define a primary key. It provides a single-column relationship between tables that does not change when business data or rules change.
But there are situations, like junction tables used to implement many-to-many relationships, that do not require an additional surrogate key. This is because they do not have any business data (they are just a technical implementation to solve a complex relationship).
(To learn more about the different types of relationships, check out the articles “What Is a Many-to-Many Relationship in a Database?” and “What Is a One-to-Many Relationship in a Database?”)
An example of a junction table is shown below:
But, even if it is not required, you might consider adding a surrogate key to get consistent behavior in your model. This will also ensure that you already have a surrogate key if, in the future, you need to add additional business data to the table or relate to new child tables.
What’s Next?
So, we reviewed why primary keys are an important aspect in designing a database model. And we learned about the advantages and disadvantages of using surrogate keys over natural keys, including some practical examples using the Vertabelo Database Modeler as our database modeling tool.
Now, it is your turn to write! Use the comments section to tell us what you think about the article, ask questions, or share your database model concerns regarding surrogate keys.