Database modeling has its best practices and its guidelines. But even if you follow all the rules, sometimes things don’t work out. Here’s what I find challenging about database design.
Database modeling is the process through which a database developer or architect creates a data model for an application. The data model they create will describe the structure of the database, including the tables, the relationships between tables, and the data that’s stored in the database.
The goal of data modeling is to represent how database objects interact and how they solve a business problem. It seems like a straightforward practice, but designing and working with database models can be hard. Before we tackle that, let’s talk about why we make data models in the first place.
Why Is Database Modeling Important?
Database modeling enables you to translate high-level business requirements into a technical representation. When you have this technical representation, it’s easier to build the data structures needed.
This is even more important in our current world of data analytics and Big Data technologies. Most organizations want to store and then analyze all of the data they create. For this, they need databases; to create databases, it’s best to start with a plan – a data model.
As we create the data model and evolve it from the conceptual (simplest stage) to the logical and physical stages (most detailed and specific), we are creating more than a data framework. We’re creating order in what is most often a data mess.
Establishing a data model can be difficult, but it’s important to have the model in place. It provides companies with the flexibility to build new products and use data-fueled technologies like AI. Clean, organized data can simplify and speed up the insight-gathering process.
The worth of a good database model is often underestimated, but it is hugely important.
Why Can Database Modeling Be Difficult?
Although the general principles of database modeling are relatively simple, there are often unforeseen edge cases when you implement the database in the real world. These edge cases (uncommon errors or unexpected use cases) can impact the structure of the data model. They might even require tracing things back to the conceptual model and making changes from there.
Creating New Systems
When creating new systems, edge cases are usually identified only when the physical data model is not flexible enough to ingest all of the data. As businesses evolve, so does their data; this is why the data models in database systems must evolve as well. But with continuous evolution, we might hit a point when the advantage of continuing to modify the current data model is less visible than creating a new model from scratch.
When creating the initial conceptual model, database designers usually try to focus on the business requirements presented by the business team. Then, after the business has approved the conceptual model, the designers start transforming the conceptual model into a logical data model. Even here, we have few chances of finding out if the data model is sufficiently flexible to fit any new data without requiring major system modifications.
The most difficult metric to estimate and design for is performance. It’s always difficult for the business team to answer questions about the variety or amount of data they expect their system to ingest. Because this information is lacking, the general answer always tends to point to upper limits, like “millions of data points per day” and “any type of data”. There’s also very little information about the read/write ratios, which are important when designing tables, indexes, partitions, ETL processes, etc.
These unknowns cause difficulties in finding the right normalization levels for database tables, choosing the right data types, tuning for JOIN statements, etc.
Updating Older Systems
When working on an older system that needs an upgrade, you might also face problems, sometimes even more than when starting a new model.
This is because older systems frequently do not have documentation on their design. Sometimes these systems are so old, documentation wasn’t a recognized best practice; common shared knowledge was the norm. Now, documentation makes it into most discussions regarding software projects.
True Story: Updating an Old Data Model Is Not Easy!
I worked on a system where, due to technical limitations when it was built (the late 1980’s), column names were limited to 10 characters. There was no documentation and all of the initial developers were retired, which made things immensely difficult.
The business team was fairly young, with about 15 years of experience; they inherited the role and product. When it was time for modernizing, nobody knew how to explain the data model, the columns, or the relationships between tables. The only thing left to do was debug the application, which was also written into Gupta, and see how it interacted with the database.
Another issue was a complete lack of explicit relationships between tables, mostly because this was not possible in the original system. There were no naming conventions, so I couldn’t look for similar column names and use this to infer relationships between tables.
Normalization was also not fully implemented. The older system preferred to work with tabular data, so it was almost completely denormalized. Due to those past technical limitations, queries always read the entire table and then selected the data that was needed inside the application code. So, for performance reasons and due to limitations, the data model was never fully normalized.
In this scenario, when the software is business-critical and has been running fine for decades, it’s hard to convince the business team that a full redesign of the data model is required. If there’s no or very little documentation, the tasks can be titanic. Still, they can be overcome through teamwork.
These are only a few of the problems that can appear when updating a legacy data model. However, these issues and others can be solved by implementing some best practices for legacy databases.
The Most Common Problems in Database Modeling
As mentioned above, there are multiple possible problems with your data model that reveal themselves only with time. Still, it’s useful to know the most common problems, how to spot them, and how to avoid or fix them before they turn into a bigger issue. Below are some of the problems I’ve encountered:
- Not taking into consideration the company’s budget. Nearly all companies have a budget for their project. The data modeler needs to respect this budget and use their experience to stay within it. They need to listen to the business team’s requirements and ask the right questions to determine the scale and scalability of the model. Most companies will expect a state-of-the-art data model with infinite scalability. But when you explain to them that a cutting-edge data model is very costly, they will most likely back off and lower their expectations to a realistic level. That is where you should propose a model to solve their most critical business goals.
- Not designing for reporting, even when reporting is not a direct requirement. Even though a project does not explicitly ask for a data model that supports reporting, it’s very likely that in the future the company will want to do some form of reporting. Insist the business team think about their top 10 metrics and their top 10 read, write, and update query patterns. Explain to them that it will be worthwhile to monitor their business metrics right from the start; it will enable them to react quickly, before a problem becomes critical. This can gain your client’s trust and demonstrate that you’re a team player. Down the line, this could mean more business for you.
- Unenforced or nonexistent naming conventions. Because a model evolves with time, it is very likely that multiple people will be expanding it. Having a naming convention from the beginning will ensure that new people working on the data model will keep things organized when creating or updating existing database objects.
- Not differentiating between keys and indexes. Foreign keys are used to enforce business rules, whereas indexes are used to improve database access and data read performance. However, I have seen many databases where the expectation was that a foreign key would not only enforce a business requirement but also provide a performance boost.
- Poor or missing documentation. I’ve talked about the disadvantages of having no (or very little) documentation when modelling a database. Although documentation is usually considered at the end of a project, we should be able to document our model as it evolves. I would force myself to find the time to do this – or, if necessary, ask for time to make sure the documentation is created as the project evolves.
Overusing GUID or UUID primary keys. Although GUIDs (Globally Unique Identifiers) or UUIDs (Universally Unique Identifiers) seem to be the perfect match for primary keys, most often they are not. In most databases, a primary key automatically creates a clustered index on that column; if this is the case, then GUIDs or UUIDs are a bad choice.
A general rule of thumb for primary keys is that they should increase in value with each row added to the table. But most GUIDs generate a unique and random string. When inserting this into the table, there will be a lot of shuffling of data to keep the clustered index sorted.
For this reason, I prefer to use auto-incremented integer columns as my primary keys; they also provide decoupling between business data and the columns used for relationships or database integrity/performance.
- Failing to plan indexing before deploying the model. As mentioned above, it’s important to ask the business what their most common read, write, and update scenarios are. This lets you add the most important indexes right at the start, which means you can give the client a fast-performing database that will require little to no intervention for some time.
- Keeping redundant data in the production database. Keeping more data than needed in your production database affects your backup and restore performance. It’s generally considered fine to keep “cold” data in a production database if that data might be modified. But old data that will never be modified – like completed purchases or total sales for data that is over a few years old – can generally be moved to an archived or reporting database.
- Performing soft deletes at scale. This is a tricky one, but it can be an important thing to consider in combination with the point above. Sometimes, deleting data in a large table can cause a significant load on the database. This is because when we’re deleting rows, often all of the indexes in that table need to be updated. This includes the clustered index, which needs to be reshuffled. This might be a scenario when keeping data in your database is a well-thought-out and appropriate decision.
Improperly using semi-structured data types like JSON. Most modern databases, even relational ones, allow you to store semi-structured data in the JSON format. Storing some information (like the contents of HTTP requests) in a JSON column can provide an easy solution to situations where you don’t want to constantly modify your schema. However, it can also lead to a ton of difficulties when you’re trying to read data in the JSON fields.
Sometimes developers choose semi-structured data fields just because it’s easy; they don’t want to take the time to think things through and find the structure they actually need. There’s a very simple way to verify if your JSON data can be converted into a table structure: just check if every instance of your JSON data has the same four to five fields. If it does, then you might have to think about decomposing it.
Are You Still Up for Database Modeling?
Once you get to the more complex parts, database modeling is never easy. But it sure is fun to build a model that scales without problems, provides flexibility, meets the client’s needs, and works faultlessly!
The world of databases is extremely interesting and challenging. If you are up for challenging yourself, learning, and pushing through obstacles, it’s a rewarding career choice. To learn more, please check our other articles on database modeling.