If you have this guide at hand at all times, you don’t need to distract your database experts every time a data need arises during a software development process.
Software engineering jobs are a good fit for people capable of dealing with diverse concepts. These concepts range from requirements analysis, team leadership, project management, scripting languages, testing techniques, to continuous integration, just to name a few. Then, there are a bunch of important database concepts for a software engineer to know: normalization, denormalization, SQL, No-SQL, ERDs, query optimization, etc. The list goes on!
In short, software engineering is for those who can do a little bit of everything while paying a lot of attention and care to each task. If you are one of those people, congratulations! You have a great career full of interesting challenges ahead of you.
I am a connoisseur of the database universe with experience on several software development teams. Let me guide you through the most important database concepts for a software engineer.
If you like checklists – if you are or expect to be a software engineer, you should like them – let's start by looking at a quick checklist of the database concepts you need to know:
- Database design
- SQL language
- Relational database queries and optimization
- Transactions
- Object-relational mapping
We will analyze these in depth below.
Of course, there are more besides the ones included in this shortlist. To get a full view, read this article about the best online resources for learning SQL and database concepts.
Database Design
Part of a software engineer’s job is to make sure all deliverables passed from one member of the development team to another meet their objectives. For example, a requirements analysis should be clear, concise, and complete. A use case should contain all the information necessary for a coder to implement it. A piece of source code must fulfill the function explained in the use case. A test report should accurately describe all the defects found during a test run.
Also, a data model needs to fit a set of requirements and serve for generating a database with good performance and data integrity. Normally, software engineers do not spend time creating data models on their own. But they need to be able to sit down with a data modeler and analyze a model. They also need to determine if the model is well done. That is why ERD knowledge stays among the most critical database concepts for a software engineer.
How to Look at an ERD
Entity-relationship diagrams (ERDs) provide a graphical representation of the relationships between the objects that make up a data model. Data modelers use them primarily as tools to document and communicate design decisions. As a software engineer, you should, at a minimum, be able to read an ERD, understand its logic, know what it represents, and determine if it correctly reflects the requirements of the software product being developed.
Having all that information, you can ask the data modelers the reasons for their design decisions and verify if those decisions are the right ones. You also need to detect, by looking at an ERD, if the database designer has misinterpreted a requirement or introduced an error in the data model that may lead to a serious bug in the software.
Concepts related to the interpretation of an ERD you should understand are the cardinality of relationships (one-to-one, one-to-many, or many-to-many), the choice of primary keys, the meaning of certain schema structures such as parent-child relationships, and common data warehousing schema types.
As a software engineer, you should be able to read almost any ERD and understand what it means.
ERDs are presented at three different levels: conceptual, logical, and physical. It is usually sufficient for the software engineer to be able to read and understand the conceptual and logical models since the physical models are derived from them, only adding information necessary to implement the model in a particular database system.
On the other hand, with the aid of an intelligent data modeling tool such as Vertabelo, the physical diagrams can be generated automatically from the logical diagrams with complete confidence they are error-free. For this reason, a software engineer usually does not need to worry about reviewing physical diagrams.
One important thing every software engineer needs to be able to see in an ERD is whether the database schema is normalized and whether it needs to be. This brings us to the next item on our checklist.
Normalization and Denormalization
In transactional databases, normalization ensures database insert/update/delete operations do not produce anomalies or compromise the quality and integrity of the information. For identifying whether a design is normalized, important database concepts for a software engineer include primary keys, foreign keys, attribute dependencies, and surrogate keys.
An example of the problems associated with a non-normalized database is the potential anomalies that may appear in an e-commerce application. Such problems include the same product appearing twice in a sales report with two different names as if they were two different products.
In addition to avoiding these anomalies, normalization eliminates redundancies, which in turn reduces the storage space required by a database and improves query speed.
There are many normal forms, each one with its set of conditions. But you don’t need to memorize all of them. You can always read a guide on how to remember database normal forms. Most of the time, it’s sufficient to create schemas that comply with just the second and third normal forms.
In databases intended for analytical processing rather than transactional processing, you may need to make concessions to normalization so that you improve the performance of certain queries. These concessions are known as denormalization techniques. They usually involve adding some redundant attributes to avoid an excess of lookup tables. This helps deal with queries that add complexity and cost (in time and processing resources) for their resolution by the database engine.
Denormalization techniques are used only in schemas where data update is carried out by automatic or controlled processes. This is to avoid the risk of anomalies due to arbitrary information updates.
The SQL Language
Every software engineer needs to have a basic knowledge of SQL (Structured Query Language) for querying databases or for creating or modifying tables, indexes, views, or even a stored procedure or a trigger when needed. This knowledge allows you to perform some basic database tasks without taking time away from a DBA or database programmer.
A thorough knowledge of SQL concepts for a software engineer is probably not necessary. But you should have at least a full understanding of the SELECT
command with its many clauses. It is especially important to understand the usefulness of JOIN
clauses with all their variants – LEFT
, RIGHT
, INNER
, and FULL
– and the differences among them.
Set theory may be helpful to better understand SQL JOINs.
To make good use of the database engine for writing efficient queries, you need to grasp the logic behind the relationships between the tables in an ERD. This helps you write the JOINs correctly in queries that include multiple tables. As a basic rule of thumb, fields involved in foreign key relationships between two tables are usually best suited for JOINs between them in a SELECT
. For example, in the following ERD, you see the tables PAINTINGS
and BUYERS
are linked by the fields BUYER_NAME
in PAINTINGS
and NAME
in BUYERS
.
You may want to use the fields involved in a foreign key relationship to JOIN
tables in a SELECT
statement.
This gives you the reassurance that your query results in good performance if you use the fields involved in a foreign key relation to join both tables in a SELECT
.
SQL commands are subdivided into groups according to their functionality. The SELECT
command with all its clauses forms a group in itself called DQL (Data Query Language).
There are two other commonly used groups of commands. DML (Data Manipulation Language) is used to insert, delete, or update rows in tables. DDL (Data Definition Language) is used to alter the structure of objects in a database. Examples where DDL is used include creating new tables, creating new fields in a table, and creating a view.
Query Optimization
Software engineers have too many things to do on a day-to-day basis for making performance improvement of a query a priority. Ideally, they should delegate this task to a SQL programmer, a DBA, a data modeler, or better yet, all of them together.
But even so, it is good to know what optimizing a query consists of, and in particular, how the creation of an index sometimes reduces the time a query takes to execute from hours to seconds. It is also good to be able to assess whether a DBA is telling the truth or just wants to avoid the task when he/she tells you a query cannot be optimized any more.
Query analysis tools allow you to “get inside the head” of an RDBMS to know what logic it applies when resolving a query. Query analysis breaks down each step required to get the data for that query and the cost in microseconds for each of those steps.
Optimizing a query often consists of finding the most time-consuming steps in the query execution plan and creating indexes to speed them up. You can read all about the very basics of index creation and solve some basic database performance problems yourself.
When you analyze a query execution strategy applied by an RDBMS, pay special attention to the steps that require the most work from the RDBMS. These include traversal of all the records in a table (called full table scan) or sequential traversal of the entries in an index (index scan).
A normalized database facilitates query optimization, as it does many other aspects of regular database use. This is yet another reason to spend time ensuring the data models are normalized.
Transactions
When an application sends data to a database, it commonly sends a sequence of insert, update, and delete operations. For example, recording data for an invoice may involve inserting rows in some tables, updating rows in others, and perhaps deleting rows in others.
All of these operations must be completed in their entirety or not run at all. If an error interrupts the sequence of these operations and it fails while executing, the information in the database can become inconsistent. This causes all sorts of data errors.
Transactions avoid this problem by preventing a sequence of interrelated operations from being partially executed. When a transaction is started, any error in the middle of the sequence causes the database to roll back to the time before the start of the sequence, leaving the data as it was before.
It is the responsibility of back-end programmers to open a transaction each time they start a sequence of interrelated write operations and close that transaction when finished.
Transactions must lock the tables they use to ensure the atomicity of a sequence of operations. One sequence should not collide with another that uses the same tables. A part of the database engine, called the transaction coordinator, acts as a traffic manager and instructs transactions to wait until another transaction using the same tables finishes.
The wait may be very long if a transaction takes too long to execute. Programmers must design their code intelligently to make transactions as fast as possible. This, of course, depends on the data model being correct and properly normalized (I cannot stress enough the model must be normalized). It is also vital that all tables involved in a transaction have indexes that minimize transaction time thus making the job easier for the database engine.
Another strategy to help avoid concurrent transaction problems applies particularly to data maintenance of master tables. If it is left to the users’ discretion to update, for example, the product or customer tables at any time, this may lead to deadlocks on these tables that prevent normal order entry. The correct strategy for these cases is to have a staging scheme for master data, from which the information is periodically and massively dumped to the production database on scheduled days or times.
Object-Relational Mapping
Object-relational mapping is a common area of work for backend programmers and database designers. As a software engineer, you need to know how work is done there and decide what part is for the database designer and what is for the back-end programmer.
Some examples of popular ORMs are Entity Framework, Hibernate, Django, and SQL Alchemy. Each of these depend on the programming language (e.g., Python, Java, C#) or the framework used to program the backend of an application.
ORMs make the logic of an application independent of the particularities of a database engine. This is particularly useful when it is known in advance an application must work (without adaptations) with different database engines.
This feature is very attractive for the backend programmer since it avoids the need to know the implementation details of a database. For the database designer, ORMs are not as appealing since they take away visibility into how the application interacts with the database.
As a software engineer, it is often your responsibility to arbitrate between these two opposing forces. Database designers want to resolve all the interactions between the application and the database with views, stored procedures, triggers, SQL scripts, jobs, and other tools of their choice. Backend programmers, on the other hand, want database designers to limit themselves to defining tables and indexes. They want everything else to be solved without leaving the realm of the source code in which they live.
It’s hard for me as a database designer to be impartial. But as a software engineer, you must be impartial and apply the right criteria to determine when ORMs are a solution and when they are a problem.
As I have said, ORMs seek to have all the logic of an application defined in the source code. In addition to decoupling the application from the database engine, the ORM ensures that a change in the application logic affects only the source code. If the application is installed in many places, upgrading it is simply a matter of distributing the new version to all these places, which can even be done automatically and at a minimal cost.
If, on the other hand, the application logic is implemented in database objects, upgrading the application to a new version may be very risky if it is installed in many places. No matter how careful you are in generating the upgrade scripts, you never know for sure what effect it has on the databases. Some data may cause an upgrade script to fail, and you have to spend time analyzing and resolving the failure, upgrade the application manually, then make sure the failed upgrade has not damaged the data.
If the application and the database are both installed in a single location and the version upgrade is done by deploying to that single location, the situation changes radically. In that case, it does not matter how many tables, views, stored procedures, or functions you have to update in the database. You can generate scripts and test them thoroughly until you are absolutely sure they work in the production database without causing any failure.
Not using ORMs has another advantage: it allows you to use all kinds of database engine tools to optimize performance and to debug database operations. This greatly reduces the costs and efforts required for application and database maintenance.
These Are Important Database Concepts for a Software Engineer, but Avoid Micromanagement
We’ve now seen all the important database concepts for a software engineer to master. Ideally, the engineer never needs to use this knowledge on a day-to-day basis for the development team to perform at its best and operate in a healthy climate. There may be exceptions in some cases, such as when the SQL programmer, the DBA, or the data modeler are on vacation or call in sick.
If the development team includes resources suited for working with databases, it is good for them to know the software engineer has mastered the essential database concepts. That way, they know there is a containment network to prevent errors from reaching the final product and the user’s hands if they miss an error. Everyone works with less pressure and performs better. This is especially important to keep in mind while working with remote teams, which may include developers in Latin America, Eastern Europe, and other international tech hubs.
Good team leaders trust their team members to let them do what they do best. So, limit your oversight of the database people to a general review of the finished deliverables to curb any unintentional errors. And by all means, avoid looking over their shoulders to see if they are defining a primary key correctly, writing a SELECT
correctly, or employing transaction handling properly to avoid deadlocks. Remember this knowledge is to help the team work better together, not to create frictions!