Back to articles list

May 24, 2017 - 9 minutes read

Example er diagrams

The Book of All Knowledge: An Online Encyclopedia Data Model

Emil Drkušić

Database designer and developer, financial analyst.

Tags:

Ever wonder about the data model behind Wikipedia and other online encyclopedias? In this article, we look at what it takes to power that kind of site.

Knowledge is what we leave behind us after we’re gone. The next generations build on our ideas and assumptions and produce new ideas. Humanity has used physical materials, like ink and paper, to store knowledge. Now, we have a new and powerful medium – the Internet – to share and analyze information. We still need a structure to manage a digital online encyclopedia system. A data model that meets this challenge is the topic of this article.

What Should We Know About Encyclopedias?

Before taking a closer look at the data model, let’s explain a few things: how collecting and categorizing information started, where we are today, and what the requirements are for an online encyclopedia.

It all started with libraries.

Cataloguing knowledge started more than 5,000 years ago in the Middle East, especially in Mesopotamia (modern day Iraq) and ancient Egypt. During this era, the most important libraries were the Library of Ashurbanipal at Nineveh and the Library of Alexandria.
Why do we need encyclopedias?

Although this might seem like a stupid question, it’s not.☺

Libraries were “on the market” for a few thousand years before the first encyclopedia emerged. Unlike libraries, which store many different kinds of items, encyclopedias cover as many different topics as possible, usually in a set of books. The information isn’t detailed, but it gives insight into many subjects. Plus, encyclopedias were (and are) far more portable than the local library. Even if libraries were the best place to learn about something, encyclopedias provided enough information for the curious reader.
Who started it all?

Approximately 2,000 years ago in ancient Rome, there lived a naval and army commander who was also a philosopher, naturalist, and author. His name was Gaius Plinius Secundus and he wrote the first encyclopedia, Naturalis Historia. This became the model for all other encyclopedias. In this masterpiece, he covered topics from botany, zoology, geography, mathematics, astronomy, and many other fields.
How are encyclopedias organized?

We’ve all seen “paper” encyclopedias from different areas. They usually cover either general topics or one specific area like medicine or technology. They are organized into an alphabetized series. This enables us to find a topic quickly. Most encyclopedias include references to related subjects as well as the book and page number where we can find that information.
How are encyclopedias changing in the digital era?

I still remember the first time I saw a digital encyclopedia. It was the CD edition of the Encyclopedia Britannica. I was a kid at the time, but when I saw it, it became clear to me that “paper” encyclopedias were on the way out. Searching the encyclopedia may give a feeling of closeness to the subject you’re researching (especially if you’re doing history research) but it’s impractical compared to digital editions. And online encyclopedias moved things even further. We can find more terms and better descriptions than in paper editions because the cost of the materials is really low.

Having almost an entire body of knowledge online, available anytime, anywhere, and at no cost may be one of the most important achievements of modern civilization.
What should our model contain?

I won’t focus on any specific subject like technology, but rather I’ll create a model that could manage any encyclopedia type. We should be able to store articles or topics, categorize and tag them, and relate them with other topics. Having a digital format provides many advantages, such as the ability to support many different languages. Of course, “paper” editions come in multiple languages, but you would need a new set of encyclopedias for each language.

We’ll also need to track the history of changes and who made these changes. This is very important, especially in cases where many different contributors write articles, as with Wikipedia.

The Data Model

The data model consists of five main subject areas:

Articles
Categories & tags
Authors
Languages
Modifications

We’ll describe each subject area in the same order they’re listed.

Section 1: Articles

The Articles subject area contains two tables that store articles related to all the terms in our encyclopedia.

article_url – The UNIQUE URL address of the article.
article_title – A title describing the article subject. If the article title is “French Revolution”, we can expect info about the overthrow of the monarchy and the establishment of the First French Republic.
article_text – The body of the article. It can include text and links to related media files (pictures, videos, text documents, etc). We can expect that these media files will be stored outside the database on a content-sharing platform (e.g. YouTube videos, Scribd documents).
time_created – When the article was created in our system.
time_updated – The most recent time an article was updated. If the article was created and not updated after its creation, this column will hold the same value as time_created.
time_published – When the article was published on our site.
default_language_id –References the language dictionary and stores the ID of the article’s default language. In most cases, we can expect this will be English.

The second table in this subject area is the related_article table. For each article, we’ll store a list of all related articles. We can expect that the author who created a new article or updated an existing one will also add links to related articles. This will enable readers to perform additional research on a topic.

This table will store only an ID and a UNIQUE pair of references to the original article (article_id) and the related article (related_article_id). Please notice that we could have the same pair but with the article_id and related_article_id values switched. This could be a desired behavior, but if we want to store only one such pair and use it in both articles (article_id points to the related_article_id but related_article_id also points to the article_id), we should use code to disallow entering the “same” pair (where these two values are switched).

Section 2: Categories and Tags

The second most important thing in encyclopedias is that users can easily find the topic they want. We have already mentioned that we’ll store relationships between articles or topics. This simplifies the search for additional information and encourages readers to perform more research on the topic. But readers must find their topic first. This is where categories and tags help.

This first branch in this subject area is related to categories and subcategories. It consists of the following three tables.

The category table contains a list of UNIQUE category names used to organize articles. Users can filter articles by category and subcategory. As its name says, the subcategory table will hold a list of UNIQUE subcategory names, which more closely describe the subject of the articles. Each subcategory can be assigned to only one category.

The associated_subcategory table contains only UNIQUE foreign key pairs. These pairs enable us to see which articles appear in one or more subcategories. Our example article on the French Revolution could be listed in the “France” category, in the “French History” subcategory, and in the “Revolution” subcategory.

The second branch in this subject area deals with tags. It has only two tables: tag and associated_tag. The tag table stores a simple list of UNIQUE tag names, while the associated_tag table holds UNIQUE article_id – tag_id pairs that describe articles more closely. The idea behind tags is to assign each article as many tags as possible. This lets users search for more information using tags.

Section 3: Authors

The Authors subject area is simple and consists of only two tables. The idea is to store a list of all authors that contributed to any article. This lets users search for other articles written by the same author.

The list of all authors (contributors) to our encyclopedia is kept in the author table. I’ve decided to store only their first and last names, but we could add many other attributes like email, username, and password.

In the related_author table, we’ll list all UNIQUE pairs of articles and the authors that contributed to them. We won’t store details about author’s exact contribution, just that they did something with the referenced article.

Section 4: Languages

We’ve mentioned this before, but let’s have it one more time: One of the greatest advantages of digital encyclopedias is that users can easily access articles in many languages. Of course, we’ll need to translate the text, but the cost of publishing another article is almost zero. Going back to Wikipedia, we can also expect that we won’t have any translation costs because it will be done for free by volunteers.

We’ll store the list of all languages available in the language dictionary. It contains only two values, the short language_code (e.g. “EN”) and the full language_name (“English”). Both values must be UNIQUE.

A list of all translated articles is stored in the article_translation table. The table has almost the same structure as the article table; we need to store the article_url, article_title and article_text in the selected language. We’ll also keep the times when that article was created, updated, and published. Besides these values, we’ll store references to the related article and language. The combination of these two foreign keys forms the UNIQUE key of this table.

We could store all articles and their translations in the article table and create parent-child relationships between them, but I prefer this method.

The last table in this subject area is the translated table. This is the place where we’ll store information about the authors who worked on article translations and the article’s new target language. For each article translation, we’ll store every author_id that contributed to it as well the time when it happened.

Section 5: Modifications

The last subject area in our model is the Modifications subject area. It is intended to store the history of all changes to all articles. To do so, we’ll use two tables: the modification_type table and the modification table.

The modification_type table is a simple dictionary with a UNIQUE list of type_name attributes. We can expect values like “writing”, “translating”, and “editing”.

All modifications as well as their details are stored in the modification table. For each change, we’ll store:

article_id – The original article that was modified. In case the value of the article_translation_id attribute IS NULL, then modifications were performed on the original article (the article in the default language).
article_translation_id – References a translated article that was modified (if it was). If modifications were performed only on the original article (in the default language), this attribute shall contain a NULL value.
modification_type_id – The ID of the related modification type.
time_modified – The timestamp when that modification happened.
author_id – The ID of the author who made that modification.
details – All details that are important to that modification – e.g. a list of changes or the whole text of the modified article.

What Do You Think?

Today, we’ve discussed a data model that we could use to build an online encyclopedia similar to Wikipedia. The model presented could be used as the backbone of such an application.

There are many ways to improve this model, especially in the way we store media files and modifications. What do you think? What would you do to improve it?

Tags:

What Should We Know About Encyclopedias?

The Data Model

Section 1: Articles

Section 2: Categories and Tags

Section 3: Authors

Section 4: Languages

Section 5: Modifications

What Do You Think?

You may also like

Five Common Dimensional Modeling Mistakes and How to Solve Them

A Data Model for a Medical Appointment Booking App

A Database Model for a MOOC Platform