Do you have a database modeling interview coming up? Make sure to prepare to give yourself the best chance of getting the job! Find out what you’ll be asked in addition to “What are your strengths and weaknesses?” and “Where do you see yourself in five years?”
So, you’re applying for a data modeling position...congratulations! It is a challenging job that involves great responsibility. Data modelers create a solid foundation for the information systems that manage the vital data of an organization.
Throughout my career, I’ve been on both sides of the interview table. I have been interviewed as an applicant, and I have interviewed applicants for jobs like the one you’re applying for. So, I know the types of questions included in interviews for database designer jobs.
First, we’ll take a look at the basic database interview questions. These questions will show how much you know about data modeling. If you can confidently answer all of the basic questions, you will be in a solid position to potentially get the job.
However, if you want to make sure that you stand out from the pack, I suggest checking out the advanced database interview questions.
There might also be some SQL questions in the interview, so you might want to check the most common advanced SQL interview questions and their answers. I also suggest checking out the common database design errors and how to avoid them.
Basic Questions
1. What are the different types of data models, and what does each represent?
Every database designer should be able to answer without hesitation that there are three types of data models: conceptual, logical, and physical.
A conceptual model represents the vision that the user has of the data. It lacks technical elements, such as tables and relationships, and instead is synthetic and shows a high-level view of the data. It is used to establish an agreement regarding which information the database will contain and what its structural outline will be.
A logical model shows the entities that make up the data model and the relationships between them without detailing technical issues such as database-specific data types, constraints, or indexes.
A physical model expresses the data model with all the implementation details for a database engine. This model extends the logical model by adding everything that describes how the information is physically stored in the database.
2. What is an ERD, and what is it for?
ERD stands for entity relationship diagram. The ERD graphically depicts a data model, showing the entities that comprise it and the relationships between them.
3. What are entities and relationships?
Entities are the objects for which information needs to be stored, and relationships indicate how these objects are associated with each other in real life.
Entities have a name that identifies them. For example, in a medical record system, there should be at least these three entities: Patient, Doctor, and Consultation. The Consultation entity is related to Patient and Doctor because, in real life, each consultation must have a patient and a doctor associated with it.
A simple ERD for a medical record system.
4. What are the attributes of an entity?
Attributes are the characteristics that describe an entity. Each attribute has a name and a data type or a defined set of valid data (a domain). Data types and domains of attributes are loosely defined in the conceptual and logical diagrams, but in the physical diagram, they must be clearly stated.
5. What are subtype and supertype entities?
Sets of entities with common attributes and characteristics can appear in a data model. When this happens, you can create a hierarchy of entities.
Supertype entities are higher in the hierarchy. They bring together the attributes that their subordinate entities (or subtype entities) have in common. Subtype entities inherit the common attributes and add the attributes that distinguish them from their siblings.
Let’s draw an example to clarify. Imagine you have to design a data model for a company that sells goods through three different channels: at its store, through domestic shipments, and through foreign shipments.
The three types of sales will have several attributes in common, such as the date of sale, products sold, and customer data.
And each type of sale will have particular attributes: the sub-entity representing in-store sales must include the name of the salesperson; the one representing domestic shipment sales must include the destination address, and the one representing foreign shipment sales must include special tax information and the destination country.
An ERD showing inheritance relations between supertype and subtype entities.
The use of inheritance relations, together with supertype and subtype entities, lets you build simpler and more efficient data models with fewer repetitive fields.
6. What’s the difference between a primary key and foreign key?
A primary key is an attribute or set of attributes of an entity that identifies each element of that entity. In a table, the values of the fields that make up the primary key cannot be repeated in different rows of the table. In relational databases, each table can only have one primary key.
A foreign key is an attribute or set of attributes of an entity that are linked to the attributes that make up the primary key of another entity. The attributes that link both tables must be of the same type, and it is convenient (although not mandatory) that they have similar names.
7. What do you know about normalization?
Normalizing a database means improving its design to better achieve the following goals:
- Eliminate unnecessary redundancy in the data
- Maintain the integrity of the information
- Avoid update anomalies
- Optimize storage space
- Simplify queries that involve multiple tables
There are six levels of normalization, called normal forms, but only the first three are commonly used. Each normal form encompasses a set of dependencies and properties that a database schema must satisfy to guarantee that certain failures will not occur.
The first normal form states that all attributes belong to a domain of atomic or indivisible values, such as dates, strings, integers, or coordinate pairs (to name a few). There should be no attributes that can be decomposed into multiple elements of different types.
The second normal form states that, in addition to complying with the first normal form, all non-key attributes of an entity depend on the primary key of that entity.
The third normal form states that, in addition to complying with the second normal form, there must be no transitive dependencies between the attributes of an entity. This means that an entity cannot have an attribute that depends on another attribute that in turn depends on the primary key.
If you would like, investigate the rest of the normal forms, like the Boyce-Codd normal form and the fifth and sixth normal forms. However, it’s unlikely that you will be asked about them. So, if you believe that they will confuse you more than they will help you, it’s sufficient to know that they exist.
If your interviewer enjoys digging into your knowledge, they will likely show you some ERDs and ask you which normal form they are in. I suggest you look at as many ERDs as possible before going to the interview. This will help you learn to see the degree of normalization.
8. What does cardinality mean in a database?
In data modeling, the concept of cardinality is applied to relationships between entities. In this sense, cardinality can be one-to-one, one-to-many, or many-to-many.
However, regarding the use and maintenance of databases, particularly when it comes to performance optimization, the term cardinality has a different use. It refers to the number of different values in a column of a table in relation to the total number of rows in the table.
To differentiate both concepts and avoid confusion, the first is called relationship cardinality, and the second is called data cardinality.
As we will see in the advanced questions below, data cardinality is important when defining dimensions in a data warehouse model.
9. What is a surrogate key?
A surrogate key is an attribute that serves as the primary key for an entity or table when it is inconvenient or impractical to use the attributes that make up the natural primary key.
The main reasons for creating a surrogate key are if the natural primary key for an entity is made up of many fields of complex types or if their values change over time, making it difficult to update related entities.
A surrogate key is artificially defined as a simple numeric value, typically an ID field of type integer. It will not vary over time, thus identifying the row to which it belongs as long as it exists.
It will score you points in the interview if you mention that the use of simple numerical values for surrogate keys improves performance in the creation and maintenance of indexes. This in turn makes queries and other operations on the data faster.
10. What is metadata, and what is its purpose?
In the field of data modeling, metadata is the information that describes the structure or schema of a database. The schema of a database is defined through various elements, such as tables, columns, keys, constraints, views, triggers, indexes, etc.
The information that describes these elements is the metadata. Relational database management systems (RDBMSs) maintain a special database where they store the metadata for all its databases.
The metadata repository can be queried to find out information about the objects that make up the schema of a database. For example, if you need to drop a table, it is convenient to query the metadata to verify that the table exists and avoid causing an error.
11. What is the difference between a database designer and a database administrator (DBA)?
Although both roles work on databases, each one does it at different times and in different situations.
Designers create new data models and/or modify existing ones. DBAs solve problems (such as performance or crashing issues) in operational databases, usually without altering the data model.
Designers usually work on development environments, that is, databases that are not yet operational. DBAs, on the other hand, work on databases in production.
In a healthy team, there should be communication and interaction between DBAs and designers. Designers share data models with DBAs, and DBAs must be able to read them and understand how the database is structured.
DBAs can suggest changes to the data model to solve operational problems. Furthermore, DBAs can help designers create and optimize scripts in forward-engineering tasks (see below).
12. What is the difference between a database designer and a data analyst?
Data analysts use the information contained in a database. Their job is to “squeeze” the data with mathematical functions to get insights.
However, databases intended for analytical processing have certain design requirements that are usually well-known to data analysts. This is why analysts and designers must work side-by-side when designing data models for analytical processing.
Data analysts must master the tools for querying and obtaining data, such as the DQL (data query language) subset of SQL commands. And, depending on their needs, they can suggest changes to the data model to achieve improvements that facilitate their tasks.
However, in small organizations, it is common for the roles of designer, analyst, and even DBA to be consolidated into a single person who is simply called “the data person.” So, make sure you get a clear idea of whether the position you are applying for is only to design data models or if you will also have to perform other database-related tasks.
Advanced Questions
If you fully understand the basic database modeling interview questions and answers, you have a great chance of succeeding in the interview. But if you want to show that you are capable of giving 110%, check out the following advanced questions and answers.
13. What is a data warehouse?
A data warehouse is a repository of information with large volumes of data.
Data warehouses are large databases used to store information for data analysis and reporting. They serve as support for OLAP (online analytical processing) tools, unlike more common databases, used for OLTP (online transaction processing).
Two types of tables are usually found in a data warehouse: fact tables and dimension tables. Read these articles to learn about how they differ and what they are used for.
14. What are dimensions and measures in a data warehouse?
In the entities that make up the model of a data warehouse, there are two main types of attributes: measures and dimensions.
Measures are numerical values that quantify information. Mathematical and statistical functions can be applied to the measures, such as sum, average, median, variance, maximum, minimum, etc.
Dimensions are qualitative attributes used to group and classify the facts recorded in the database. Dimensions are said to provide the who, what, when, where, and why of events in a business process.
Here is an example. Imagine a data warehouse that stores sales information from a supermarket. The typical dimensions would be product, item, brand, year/month, day-of-week, etc. The typical measurements would be quantity sold, amount sold, unit price, cost, etc.
15. What is denormalization?
Denormalization is a technique to add useful redundant information to the tables of a schema.
But...doesn’t normalization solve problems for us? Why would we want to denormalize tables?
Although they might seem to be opposing concepts, denormalization does not mean undoing normalization. Instead, it is an optimization technique that is applied to a data model after normalizing it to simplify query writing and reporting.
16. Which kinds of dimensions can be found in a data warehouse?
There could be many different kinds of dimensions in a data warehouse, depending on their use and characteristics. Let’s take a look at some of the most common ones.
- Conformed dimension: This dimension is related to more than one fact table or is repeated in different schemas, but it always maintains the same meaning. Conformed dimensions help maintain consistency between different reports by categorizing and describing different events with the same information. A date is a typical conformed dimension because its meaning does not change, no matter which fact table it is related to.
- Role-playing dimension: This conformed dimension can appear with different meanings in different reports. The Date dimension, which we mentioned before as a conformed dimension, can be Sales Date, Shipment Date, Reception Date, etc., all of them related to a single Dates table. In this case, the dimension becomes a role-playing dimension because it plays different roles in different situations.
- Outrigger dimension: This dimension contains a reference to another dimension table. For example, a Product dimension can have a reference to a Manufacturer dimension. Since outrigger dimensions can make it difficult to write queries and create reports, they are usually included in fact tables for convenience. In the example above, the fact table should contain both a reference to the Product dimension and the Manufacturer dimension.
- Degenerate dimension: This dimension’s only content is in its primary key. For example, the tracking number dimension in a data warehouse that stores shipment data needs no description other than the number itself. In these cases, a separate table is not required for the dimension because it can be located directly in the fact table.
- Junk dimension: This is a low-cardinality, abstract dimension, such as a flag or an indicator. It is common to gather all junk dimensions in a single table with all the possible value combinations.
There are many more types of dimensions, but surely there will not be enough time in the interview to cover them all. So, by mentioning the five above, you can show that you know what you are talking about.
If you want to research other types of dimensions, you will find plenty of reading material on the web. You can search for terms such as slowly changing dimensions, rapidly changing dimensions, shrunken rollup dimensions, swappable dimensions, or step dimensions.
17. What’s the difference between fact tables and dimension tables in a data warehouse?
Fact tables are the central tables of a data warehouse, containing measures, metrics, and facts associated with a specific business process. Dimension tables surround the fact tables, supplementing them with descriptive information for their attributes. Fact tables are related to dimension tables using foreign keys.
In the design of a data warehouse, the fact tables should be minimalist, containing the minimum set of fields necessary to reflect the facts of the business process that the data warehouse represents.
The dimension tables, on the other hand, should have all the descriptive information necessary to complement the fact tables. This information must be complete and “clean.”
For example, a dimension table shouldn’t have two rows with the same value in the description field, even with an almost identical value, such as the same text with different capitalization.
18. Which models are the most common for data warehouse schemas?
The most common database schema models for a data warehouse are star, snowflake, and galaxy.
The star schema is the simplest. It has a central fact table surrounded by several dimension tables. Queries to a data warehouse with a star schema are formed with simple joins.
A basic data warehouse schema with a star design.
The snowflake schema is somewhat more complex than the star schema. Some of its tables with complex dimensions require opening into a set of hierarchically related tables. This means that there are dimension tables surrounded by other dimension tables. So, queries to a data warehouse with a snowflake schema require several nested joins.
A snowflake-shaped data warehouse schema.
The galaxy or fact constellation schema contains multiple fact tables with dimension tables shared between them. It is called a galaxy because it is a combination of several star schemas.
19. What is the difference between reverse engineering and forward engineering?
In data modeling, reverse engineering is the process used to recreate the design of a data model (commonly, an ERD) from an existing database.
Forward engineering is the generation of the necessary scripts to create a database from the design of a data model. These scripts contain the DDL (data definition language) commands that will allow you to create or modify a schema on a database engine.
Go Get ’Em, Tiger!
Here’s one last tip before you head into the interview. If you don’t have data modeling experience, practice as much as you can by creating data models for domains you know well.
It could be a model to record all the information for a soccer tournament, a movie, or a music collection. Design as many models as you can, and practice until you feel confident.
Remember that the interview is only the first step. Knowing how to answer the above questions will help you get started. But after that, you will have to keep learning and improving.