With age comes wisdom. Take it from someone who has spent a few decades dealing with databases, data models, IT guys, users, and software projects.
“The greatest teacher failure is”, Master Yoda said. Decades of making mistakes with databases and data models – and learning from these mistakes – has proved to me that he was right. Yoda also said: “Always pass on what you have learned”. So here I am, passing on some of the lessons I’ve learned from tripping many times over the same data modeling stones.
You will learn a lot about data modeling by attending courses and reading books. But experience teaches you many things you won’t find on any course and any book. Let me save you a few years of learning by summarizing some of the lessons that experience has taught me.
I hope the following lessons can help you avoid some bad moments in your career and make your life easier as a data modeler. You may find some useful tips to improve your modeling practice, your relationships with business users and teammates, and your method of dealing with particular data modeling challenges.
First, would you like something more specific about design errors? Check out these 19 online resources for learning about database design errors and 7 common database design errors.
Think Globally, Act Locally
In a software development project, you have to interact with many different roles. On the one hand, there are the stakeholders: those who will benefit from the software once it is developed. Stakeholders are interested in software that’s developed on time and that works well. On the other hand are the developers: the workers who build the software product, one user story at a time.
In database modeling, stakeholders need to see the forest and not the trees. They need to see a conceptual data model (which doesn’t contain details) to understand how it works on a large scale. In the early stages of a project, the data modeler should provide a conceptual model – viewed from 1,000 feet high – for the stakeholders.
The conceptual model should include only the names of the relevant entities, their main attributes, and the relationships between them. To keep your model simple, ask yourself if the stakeholders really need to see an element before you add it. If the answer is no, just leave it out.
Ideally, a conceptual model should be viewable on one or two pages and should be easily understood by anyone who knows the business.
Conceptual models only show elements relevant to the stakeholders. Use detailed comments to better describe the purpose of the model.
Developers, on the other hand, don't need to see the forest; they need to see the trees. They need a physical model that shows the details of each data object that they have to work with. And they need it ASAP.
Does this mean that, as data modelers, we must create both a conceptual model and a complete physical data model up-front? In a traditional waterfall-style development process, yes. In an Agile development process, no.
In Agile methodologies, each developer takes one or more user stories to work with at the beginning of each development cycle. Our job as data modelers is to provide each developer with a physical sub-model that contains only the data objects that they need for the stories they have. It’s important that each physical sub-model respects the guidelines of the general conceptual model.
In each sub-model, you must include related objects, but only to maintain referential integrity. You don’t have to show them in the sub-model with all their attributes and details.
Detailed physical sub-models provide all that developers need to work on their user stories.
At the end of each development cycle, you merge the sub-models created during that cycle; thus, the complete physical model takes shape in parallel with the development of the application.
The speed you need to work with data models – especially in an Agile environment – can only be achieved if you use an intelligent design tool like Vertabelo to edit the models, validate them, create DDL scripts, and generate models from existing databases.
You can follow this step-by-step guide to database design to organize your daily work as a data modeler.
Always Keep Older Versions of Your Models
Don’t expect to set a data model in stone (so to speak) once you’ve designed it. You must be prepared to adapt the model to changes in requirements, business rules, or infrastructures – or just to correct design errors. But it is extremely critical that, for every change you make, you save a version of the model as it was before making that change. If possible, include a description of the changes made on each new version, together with an explanation of why those changes were made and who asked you to make them. That's a bit more work and bureaucracy, but I’m sure you'll thank me in time. By integrating your design tool with a version control system, the task will be less of a burden.
Keeping versions of a model serves many purposes. If you deleted an object by mistake (or deleted an entire model – we’re only human), you can go back to the previous version to fix the error. If a developer complains that a change in the model broke some program or functionality, you can use the observations of the new version to explain why the change was made and show the developer the differences between subsequent versions.
In Agile projects, it is a good idea to save a version of the model at the end of each development cycle (i.e. at the end of a Scrum sprint) and with each release of the software. These versions will serve as safe restoring points in case changes have to be undone.
Be Permissive in Staging Schemes
The data sources that feed data warehouses will not always be under your control. Most of the time, ETL (Extract, Transform, Load) processes involve data sources managed (or not) by third parties, so you cannot trust their format, consistency, and correctness.
In a typical ETL process, the information goes through staging schemas to be cleaned and purified. When doing database modeling for these schemas, you should apply more permissive criteria than you do with final schemas, where the coherence and consistency of the information must be strictly maintained.
A typical scenario in staging schemas for ETL processes is needing to get information from comma-separated value (CSV) files. Having been in that scenario hundreds of times, I have adopted a habit: I prepare myself for the worst. Even though the source data providers swear that the CSV has a certain format, that the first line is always for headers, that it uses commas to separate fields and CR + LF to separate lines, that the alphanumeric data is quoted, etc., I assume that none of that is necessarily true.
So, to process such files, I include a certain table in my staging scheme. I always use “Bulk” as the suffix to the table name, and I include a single VARCHAR field long enough to hold even the longest possible line of the CSV. Each line of the CSV file is stored in one row of the that table, in that one field.
Staging schemas for ETL processes tend to be less strict (no PK, no FK, and so on), since they cannot support restrictions that are imposed on the final schemas.
Once the content of the CSV is dumped in the table, I apply processes that verify the format of the imported data. This filters out rows that cannot be processed because they do not meet the specifications. Then I apply a parsing process on the “good” rows, sending the parsed data to a fully structured table where each column corresponds to a field from the CSV file.
But I still have no guarantees of consistency. In this second table, I will not be able to define a primary key or relationships with other tables because I still do not know if the data maintains those relationships. Another process should be applied to verify consistency and filter rows that break relations or primary key constraints.
In the design of my staging schema, I must include tables for garbage collection. All the information that does not pass the verification processes will be saved in these tables so that the user is aware of all the information that is not reaching the data warehouse.
Primary Key Do’s and Don’ts
If you make a mistake when determining the unique identifier of an entity or the primary key of a table, it will be very difficult – or downright impossible – to correct it. Although RDBMSs allow updating the values of fields that make up a primary key to correct an error, this is something you should never do on a live database; it can bring you a landslide of problems. For this reason, when determining the primary key of a table, I suggest you apply the following principles:
- Primary key values should never change. You should never include an attribute that may change at some point in a primary key. You may be tempted by the idea of using a “real world” attribute, such as a license plate ID, a social security number, or a passport number. However, although all that data in real life does not change, in a database it can (and will) change –e.g. if a data entry assigns the wrong social security number to a person.
- Try to use small fields, such as integers or short VARCHARs, for primary keys whenever possible. They are better managed by database systems and can dramatically improve query performance.
- Understand when to use surrogate keys, i.e. when a table design can’t comply with the above criteria. If your natural keys are subject to change, are too long, or are too complex, an integer surrogate primary key is the way to go. You can always create unique indexes on natural keys to avoid duplicates.
- Multi-column primary keys may be the best option, particularly for cases like junction tables in many-to-many relationships.
When and Why to Use Character Types for Numeric Data
During the construction of a data model, it is common to find numerical attributes that do not represent quantities or amounts. They are simple identifiers: phone numbers, zip codes, passport numbers, etc. Since you will not use these attributes for any mathematical operations, you don’t need them to be of a numeric type. (Is there any use in getting the average from a list of zip codes?) So it’s sometimes a good practice to define these as character strings.
I agree that if you create these attributes as numeric types, you will be sure that they don’t accept strange symbols, such as hyphens, spaces, or leading zeros that make 012345 different from 12345. Numeric data types – integers, more precisely – also offer huge performance advantages when used to index tables. But defining them as numeric fields has some drawbacks you may want to avoid.
For starters, the size. When defining an attribute as numeric, you must determine its range of valid values. For example, if you define something as INT in an SQL database, any value greater than 2147483647 (2 ^ 31-1) will be invalid; you’ll have to change its data type to one that accepts larger numbers. But if data becomes longer than expected in a character type, you simply increase its maximum number of characters – not a drastic change in a data model.
Besides, non-numeric characters are occasionally necessary. In a phone number, a leading zero or a “+” symbol makes a difference. If you want to avoid entering characters outside of a certain set – e.g. other than the numbers and hyphens that make up some zip codes – you can use various rules, custom data types, or triggers to ensure that only the correct characters are allowed.
Another disadvantage with numeric data types is that visualization tools usually assign them default output formats that have decimal symbols and thousand delimiters that distort their presentation. You don’t want a passport number to appear with a decimal point and two zeros to the right.
And what about the surrogate keys? In the case of attributes created as surrogate keys, it is advisable to use integers for reasons of performance and storage size. In general, the values of these attributes are generated automatically, so there is no risk of invalid data input. They are not displayed to the user, so you don’t have to worry about their output format. You just need to be sure not to generate values for those surrogate keys that exceed the maximum allowed by the data type. If your tables will have billions of records, a surrogate key of the INT type is probably insufficient.
Adopt a Naming Convention and Stick to It
Naming criteria is not trivial. Once the model becomes a database, the names define the links and relationships between objects, so any change in a name can lead to disaster. That’s why a naming convention should be adopted from the beginning. It should leave nothing to chance or to the discretion of each developer. And it should avoid some bad conventions that hinder more than they help.
Additionally, a naming convention is a memory aid so that both designers and programmers can quickly come up with the name they are looking for without having to review object lists or data dictionaries.
The following are just some of the criteria that a naming convention should clearly state:
- Allowed/forbidden characters: The use of any character that is not a letter, a number, or an underscore is commonly prohibited for object names. However, if for some reason it is necessary to use spaces or other characters in the names, the convention must establish that names are always written between delimiters, such as double quotes or square brackets (depending on the peculiarities of the target database engine).
- Reserved words: A naming convention should forbid the use of reserved words, such as commands, data types, or functions – unless the convention expressly indicates that names are always written between delimiters.
- Use of lowercase/uppercase/camel case: There must be a single criterion for the use of upper and lower case in names, as well as whether words in compound names must be pasted together or separated by a character such as an underscore.
- Use of prefixes: If the convention allows the use of prefixes for naming tables, fields, or other objects, it must be clear what function the prefixes serve and what information they provide.
- Plural/singular forms and abbreviations: It must be clear whether the plural or the singular form is used for naming objects. It should also state whether or not users can abbreviate words used in names.
Use Reverse Engineering to Audit Changes to a Database
It’s important to maintain synchronization between the data models and their respective databases. Otherwise, the models fail to be useful and anyone who needs to access the data will simply ignore them.
A security policy can be applied that prevents modifying objects directly on the database to prevent models from getting out of sync. But this is a double-edged sword, as it becomes an obstacle when you need to make a quick change directly on the database to solve a critical issue.
A compromise solution is to carry out periodic audits to detect changes in the database that models don’t reflect. You can do this by generating models of the database through reverse engineering and then comparing them to the models in use. If differences appear, you can adjust the model to reflect the database structure or reverse the changes that broke the synchrony.
Apply Best Practices to Store Authentication Data
Any application that authenticates users must store login information in its database. This information is highly sensitive and should not be exposed to cyberattacks. As a data modeler, it is your responsibility to create a design that protects users' private information and avoids compromising their credentials.
Best practices for storing authentication data establish criteria such as not storing passwords (not even encrypted passwords), as this exposes them to various threats, such as brute force attacks. Instead, the database should store password hashes, which are equally useful for authentication purposes.
How does this affect you as a data modeler? You have the responsibility to define the proper objects to store user data and password hashes.
First, you must define a table to store the users’ basic information: username, full name, email address, user group, user role, and any additional information that the application could require. This table must have a surrogate primary key that respects the aforementioned good practices for primary keys. The username must be unique – which can be enforced with a unique index – but it should not be set as the primary key of the table so that it can be changed if necessary.
A basic set of tables to manage user authentication.
The password hash field can be included in the same user table. But if you want to apply a policy of non-repetition of old keys, a password history should be stored for each user with the date and time it was changed. This also allows a record of each password change to keep the user informed of their account activity.
Password hashes are usually lengthy, even for passwords that are only a few letters long. Therefore, the field to store them must support a sufficient number of characters. To make the authentication system more robust, in addition to storing the password hash, you should include a field to store a “salt” value: a random piece of data that is fed to the hashing algorithm to ensure that the password hash is different for different users, even if they choose the same password. It is also useful to add a field that tells the hashing algorithm used: if at some point you need to change the algorithm to a more robust one, this field will tell if a password was hashed using the previous algorithm.
Add Your Own Data Modeling Lesson
Don’t think that the lessons I shared here are all you need to know to keep yourself from making mistakes. There are plenty of other stones in the designer’s path that you may stumble over. Don’t be too hard on yourself; treat mistakes as a lesson to always improve your skills and wisdom.
Do you have other lessons in data modeling you’ve learned from messing it up? Do as I did and share them with all of us. The comment section is yours.