When you read about normalization you usually get the set of conditions that a database in the nᵗʰ normal form should satisfy and the set of rules, a sort of a cook-book, for obtaining that normal form. But why these rules are safe to apply to your denormalized relations is a skip material. Here, I would like to present some elementary concepts on how we decompose relations and what can go wrong.
The concepts I want to present may seem a bit complicated but, on the other hand, they are so basic that you rarely see it written. Still, they are essential for understanding how the normalization process works. One may prepare a perfect meal just by following a recipe but no one can master the art of cooking without understanding what's going on behind-the-scenes. The same holds true with databases.
I will talk about relations with functional dependencies only. Some time ago, Agnieszka did an excellent exposition on functional dependencies. Here, I only briefly review a basic fact. In a relation R, an attribute B is functionally dependent on attributes A1, …, An.(denoted as A1, …, An → B), if for every possible scenario there cannot be two tuples in R that are equal on all A's but differ on B.
I will also need a concept of a natural join. Well, I bet you know what a natural join is but let me say something, just in case. If you have two relations, say Dogs (dog_id, dog_name, owner_id) and Owners(owner_id, owner_name, address) then a natural join of Dogs and Owners consists of all tuples from Dogs merged with tuples from Owners that have the same value on the columns with the same name. In our case this column is owner_id and we get a relation with all information about a dog and its owner in one row. A notation for a natural join is Dogs⋈Owners and it has columns dog_id, dog_name, owner_id, owner_name, address (a common column is not repeated). But be careful, if instead of dog_name and owner_name we had in both tables a column name, then a natural join would give us only dogs and their owners that have the same name! Usually this is not what we want.
How Do You Decompose a Relation?
Now, let’s go back to the subject at hand. Our working example will be the relation LazyAuthors.
Author | Title | Year |
---|---|---|
John Doe | The story of my life | 2011 |
John Doe | The story of my uncle's life | 2012 |
Mary Major | Quantum gravity for dummies | 2012 |
Let us assume that no author publishes two books in the same year (they are indeed lazy authors) and that no two authors publish books under the same title. We allow that there are two editions of the same book in different years. We can write these dependencies as:
author, year → title and title → author.
What happens when you decompose a relation? Let’s decompose LazyAuthors into two relations, AuthorYear and TitleYear. We choose for the first relation the attributes author and year and perform a projection of LazyAuthors onto these columns. Thus, AuthorYear may be understood as the result of the following query: select author, year from LazyAuthors
. Similarly, TitleYear may be understood as the result of: select title, year from LazyAuthors
. And here we have our new relations!
AuthorYear | TitleYear | |||
Author | Year | Title | Year | |
---|---|---|---|---|
John Doe | 2011 | The story of my life | 2011 | |
John Doe | 2012 | The story of my uncle's life | 2012 | |
Mary Major | 2012 | Quantum gravity for dummies | 2012 |
You may often meet a notation for projection with π and the set of attributes on which we project. Then, we could write AuthorYear as π{author, year}(LazyAuthors) and TitleYear could be written as
π{title, year}(LazyAuthors). In what follows I will use this notation.
Why can’t I just split my relation into two?
Splitting is just one side of a normalization process. Once you split an immediate question arises: how could I retrieve my original relation from the two split ones? After all, you do not want to lose some information by decomposing, do you? A usual answer for this is: use natural join! It is a good answer but it works only if we split our original relation neatly. Do you remember our two relations AuthorYear and TitleYear? If we perform a natural join AuthorYear⋈TitleYear we get something like
Author | Title | Year |
---|---|---|
John Doe | The story of my life | 2011 |
John Doe | The story of my uncle's life | 2012 |
John Doe | Quantum gravity for dummies | 2012 |
Mary Major | Quantum gravity for dummies | 2012 |
Mary Major | The story of my uncle's life | 2012 |
Look, we got some new tuples! Oops! Since two authors wrote a book in the year 2012 they were both joined with two titles by the equality of the year column. What’s worse, if we only have relations AuthorYear and TitleYear there is no way to reconstruct the original LazyAuthors. We will never know who wrote Quantum gravity for dummies.
Thus, if I decompose a relation R as πX(R) and πY(R), how could I secure the equality R = πX(R)⋈πY(R)? It would be enough that the intersection A of X and Y, A = X ∩ Y, implies all attributes in Y (or that A implies values of all attributes in X, for that matter). If it holds then any tuple from R generates exactly one tuple in πY(R) and any tuple in πX(R) will be joined with exactly one tuple from πY(R). This suffices for the equality to hold.
What does it mean for our example? Let's decompose LazyAuthors into TitleYear and AuthorTitle (remember Title → Author!). Now, after joining we get LazyAuthors again. Hurray! So, is such a decomposition correct? Not so easy.
Lazy Authors Shall Never Be Busy!
After we decomposed a relation we work only with the relations we obtained, the original one is lost and may be reconstructed only with a join. Problems may arise when AuthorYear and TitleYear change over time. Assume that we add to TitleYear a tuple (The story of my aunt’s life, 2011). This is legal since there are no constraints on TitleYear. After the join we get something like this.
Author | Title | Year |
---|---|---|
John Doe | The story of my life | 2011 |
John Doe | The story of my aunt's life | 2011 |
John Doe | The story of my uncle's life | 2012 |
Mary Major | Quantum gravity for dummies | 2012 |
Our LazyAuthors are no longer lazy! But why did this happen? After decomposing we lost one functional dependency. The attributes of author, year → title were split into two tables and we lost this constraint in our database. A lesson to remember is that when we decompose a relation, we should think about all possible data our database may acquire. Therefore, we distribute functional dependencies of our original table as well. If we lost a functional dependency then it is possible that sometime in the future we insert data to decomposed tables that will not satisfy our original constraints after a join.
Two Conclusions
We ended with two rules. When decomposing a relation R into πx(R) and πy(R), with A= X ∩Y, we should have:
- A implies all attributes in X or in Y
- All functional dependencies of R may be reconstructed from functional dependencies of πX(R) and πY(R)
Now, if all constraints of R are functional dependencies and if you meet the above requirements then you will have R=πX(R)⋈πY(R) (point 1) and you will always get a legal relation R after a join (point 2).
Of course, there are more constraints than functional ones and, of course, there is a whole theory behind the second point but this is already a long post. At last, let me mention that the functional dependencies of our LazyAuthors are based on an example by Beeri and Bernstein from 1979[1]. With these examples they showed that you cannot always get a Boyce-Codd normal form. But this is a topic for another story.
1 Beeri, Catriel and Bernstein, Philip A. Computational problems related to the design of normal form relational schemas. ACM Transactions on Database Systems 4(1979), p. 30-59.