Relational Database Normalization 101

Database Normalization is something that needs to be taken into consideration whether the project is a simple personal blog or a fully-featured e-commerce site; the more complicated the project, the more important it is to follow normalization best practices. Although this concept sounds a little bit hard to grasp for the juniors, it can easily be understood through a couple of real-world examples and this is what I want to cover in this post.  

What Is Database Normalization?

The process of designing a relational database such as MySQL in a way that can be managed easily by either a database administrator or an application is Database Normalization which was introduced by Edgar Codd a few years ago. This process eliminates redundancy by dividing tables into smaller ones and linking them using relationships. If this process is achieved properly, it leads to a couple of advantages such as:

- Absence of any anomaly (In paragraphs ahead I'll go through them.)
- More efficiency 
- Data accuracy
-  Deduction in data storage space

No need to say that a non-normalized database is inefficient, incorrect, and messed-up!

Impacts of A Bad Database Design

A non-perfect database design almost results in a couple of anomalies as follows:

- Insert anomaly: If a database isn't normalized, a specific piece of data will be stored in more than one place.

- Update anomaly: If data is stored in an improper manner, it could lead to an inconsistent state. For example, when we want to update something, it probably results in updating just a portion of the intended data and a few instances won't get updated.

- Delete anomaly: When we try to delete some data in a database which is not well-structured, we could possibly leave some data undeleted because of unawareness. This anomaly could also happen when we need to delete data but at the end of the day we end up deleting more than when we wanted to.

For a database to be tagged as normalized, we need to apply a couple of rules and any of these rules transform our data to a certain state called Normal Form. All in all, there are six normal forms (at the time of writing this blog post) but the first three of them are the most common ones which will be covered in this tutorial.

First Normal Form

The First Normal Form (aka 1NF) is the first and foremost step in making a database design normalized and must be achieved before any other forms. In short, 1NF defines that all columns in a table must have atomic values; in other words, the values must be indivisible units. Furthermore, each row must be unique. To get our hands dirty, let's dive into an example:

+-------------+------------------+------------+----------------------------------+--------------+
| name        | languages        | dob        | address                          | company_name |
+-------------+------------------+------------+----------------------------------+--------------+
| John Fisher | PHP, JS          | 1984-03-16 | 184, Avenue Rd., Brampton, 11258 | Amazon       |
| Mary Moore  | Java, Kotlin, Go | 1990-10-21 | 16, Eagle St., Newmarket, 22369  | Google       |
+-------------+------------------+------------+----------------------------------+--------------+

In the above schema, a table called developers is created which includes a couple of fields (columns) such as name, languages, and so on. This table isn't defined as normalized based on 1NF. Let's make an example to elaborate my claim more. What if another developer named "John Fisher" is added to this table? This way, we cannot say that every single row in this table is unique and for this reason, we need to add another column to play the role of a unique field which is called primary key:

+----+-------------+------------------+------------+----------------------------------+--------------+
| id | name        | languages        | dob        | address                          | company_name |
+----+-------------+------------------+------------+----------------------------------+--------------+
|  1 | John Fisher | PHP, JS          | 1984-03-16 | 184, Avenue Rd., Brampton, 11258 | Amazon       |
|  2 | Mary Moore  | Java, Kotlin, Go | 1990-10-21 | 16, Eagle St., Newmarket, 22369  | Google       |
+----+-------------+------------------+------------+----------------------------------+--------------+

From now on, each and every row in this table is unique because another field called id is added which plays the role of a primary key. In fact, by id we can easily identify records no matter how many developers under the title of "John Fisher" we have due to the fact that their ids are unique. 

The other point that must be applied base on 1NF is that all fields in our table need to have atomic values; in other words, the values in an atomic field are indivisible units. For example, the values of the languages field aren't atomic at all because the programming languages can be separated. To solve this problem, let's make a change in our records as follows:

+----+-------------+-----------+------------+----------------------------------+--------------+
| id | name        | languages | dob        | address                          | company_name |
+----+-------------+-----------+------------+----------------------------------+--------------+
|  1 | John Fisher | PHP       | 1984-03-16 | 184, Avenue Rd., Brampton, 11258 | Amazon       |
|  2 | Mary Moore  | Java      | 1990-10-21 | 16, Eagle St., Newmarket, 22369  | Google       |
|  3 | John Fisher | JS        | 1984-03-16 | 184, Avenue Rd., Brampton, 11258 | Amazon       |
|  4 | Mary Moore  | Kotlin    | 1990-10-21 | 16, Eagle St., Newmarket, 22369  | Google       |
|  5 | Mary Moore  | Go        | 1990-10-21 | 16, Eagle St., Newmarket, 22369  | Google       |
+----+-------------+-----------+------------+----------------------------------+--------------+

Although the values stored in languages fields are now atomic, we'll have lots of duplicate rows and this is what can be tackled by 2NF (When it comes to the name column, there are arguments whether combining both first name and last name together is atomic or not but for the sake of simplicity, I'd assume it atomic.)

Second Normal Form

Before moving on to the Second Normal Form or 2NF, we need to make sure that 1NF is fully-applied. In a nutshell, 2NF states that each non-key attribute must be functionally dependent on the primary key and no partial dependency should exist. As this definition sound sort of too hard to grasp, let's ask a couple of questions to make it easier to understand:

- Is the name field dependent on the primary key? Yes.
- Is the languages field dependent on the primary key? No; because more than one developer can code with a certain language.
- Is the dob field dependent on the primary key? Yes.
- Is the address field dependent on the primary key? Yes.
- Is the company_name field dependent on the primary key? No; because a developer may work in more than just one company.

To apply the concepts of 2NF, first of all, we need to create another table called languages:

+----+--------+
| id | name   |
+----+--------+
|  1 | PHP    |
|  2 | JS     |
|  3 | Java   |
|  4 | Kotlin |
|  5 | Go     |
+----+--------+

This table consists of just two fields and guarantees that the names of all programming languages are store in just one place which can easily be updated without impacting other tables. Now, let's make a change in the schema of the developers table to link these two tables together:

+----+-------------+-------------+------------+----------------------------------+--------------+
| id | name        | language_id | dob        | address                          | company_name |
+----+-------------+-------------+------------+----------------------------------+--------------+
|  1 | John Fisher | 1           | 1984-03-16 | 184, Avenue Rd., Brampton, 11258 | Amazon       |
|  2 | Mary Moore  | 3           | 1990-10-21 | 16, Eagle St., Newmarket, 22369  | Google       |
|  3 | John Fisher | 2           | 1984-03-16 | 184, Avenue Rd., Brampton, 11258 | Amazon       |
|  4 | Mary Moore  | 4           | 1990-10-21 | 16, Eagle St., Newmarket, 22369  | Google       |
|  5 | Mary Moore  | 5           | 1990-10-21 | 16, Eagle St., Newmarket, 22369  | Google       |
+----+-------------+-------------+------------+----------------------------------+--------------+

The languages field name is changed to language_id to be more semantic and this field, so to speak, is a foreign key. In fact, a foreign key is a field in one table that refers to the primary key in another table; in other words, the id field in languages table is a primary key which is referred to by the language_id field in developers table as a foreign key. In short, a foreign key is used to link a row in one table to another row in another table based on its unique identifier which in this case is the primary key.

Let's recall the No replies to our beforehand questions. Although we have already made some changes in our design, we still don't comply with 2NF because there are two fields (language_id and company_name) that aren't specific to every single developer; in other words, a programming language can be assigned to more than just one developer and a developer can work in more than just one company or vice versa.

Relationships like this are called Many to Many because one developer can code with many programming languages, and at the same time,  a programming language can belong to more than one developer (It's true for the relationship between developers and companies as well).

In situations like this, we need to know a new concept called a joining table (aka pivot table or linking table) which is responsible for storing the relationships between the two tables. A pivot table such as developers_languages usually has two fields as follows:

+--------------+-------------+
| developer_id | language_id |
+--------------+-------------+
|            1 |           1 |
|            1 |           2 |
|            2 |           3 |
|            2 |           4 |
|            2 |           5 |
+--------------+-------------+

The developer_id is a foreign key to the developers table and language_id is a foreign key to the languages table. To interpret the data stored in this table, we can infer that:

- Developer with an id of 1 (John Fisher) codes with a language with an id of 1 (PHP)
- Developer with an id of 1 (John Fisher) codes with a language with an id of 2 (JS)
- Developer with an id of 2 (Mary Moore) codes with a language with an id of 3 (Java)
- Developer with an id of 2 (Mary Moore) codes with a language with an id of 4 (Kotlin)
- Developer with an id of 2 (Mary Moore) codes with a language with an id of 5 (Go)

Now, we can safely delete redundant data in developers table which is down-sized to:

+----+-------------+------------+----------------------------------+--------------+
| id | name        | dob        | address                          | company_name |
+----+-------------+------------+----------------------------------+--------------+
|  1 | John Fisher | 1984-03-16 | 184, Avenue Rd., Brampton, 11258 | Amazon       |
|  2 | Mary Moore  | 1990-10-21 | 16, Eagle St., Newmarket, 22369  | Google       |
+----+-------------+------------+----------------------------------+--------------+

As you can see, the language_id field is removed plus duplicate rows are deleted as well because in our linking table called developers_languages we have stored all related data.

This process must be exactly followed for the company_name field in developers table for a couple of reasons (I've already said these reasons but for the sake of better learning, I'll elaborate once more). The company that a developer works in has nothing to do with the primary key of a row in the developers table; that's why we need to extract these data to another table and as the relationship is Many to Many, we also need a linking table. To begin, let's create a table called companies as follows:

+----+--------------+
| id | company_name |
+----+--------------+
|  1 | Amazon       |
|  2 | Google       |
+----+--------------+

As our next step, we need a linking table with whatever name we want such as developers_companies table:

+--------------+------------+
| developer_id | company_id |
+--------------+------------+
|            1 |          1 |
|            2 |          2 |
+--------------+------------+

The above resultset from developers_companies table means that a developer with id of 1 (John Fisher) works in a company with id of 1 (Amazon) and a developer with id of 2 (Mary Moore) works in a company with id of 2 (Google). The next step is to remove the company_name field from developers table:

+----+-------------+------------+----------------------------------+
| id | name        | dob        | address                          |
+----+-------------+------------+----------------------------------+
|  1 | John Fisher | 1984-03-16 | 184, Avenue Rd., Brampton, 11258 |
|  2 | Mary Moore  | 1990-10-21 | 16, Eagle St., Newmarket, 22369  |
+----+-------------+------------+----------------------------------+

In fact, if we want to get to know which company a specific developer works in, we just need to join developers_companies then join companies table to get the name of the company.

Third Normal Form

The Third Normal Form, which is abbreviated as 3NF, basically is the final phase in the normalization process which works just fine for most applications. 3NF deals with the fact that there shouldn't be any transitive dependency. Non-technically speaking, the cities which are stored in the address field are identified by both the primary key of each developer and the zip code as well. This is called transitive dependency. So to comply with 3NF, we need to create another table called zip_codes as follows:

+----+-------+-----------+
| id | code  | city      |
+----+-------+-----------+
|  1 | 11258 | Brampton  |
|  2 | 22369 | Newmarket |
+----+-------+-----------+

This way, we can easily handle more than just one district which has to do with a specific zip code. Now, let's change the developers table as follows:

+----+-------------+------------+-----------------+--------+
| id | name        | dob        | address         | zip_id |
+----+-------------+------------+-----------------+--------+
|  1 | John Fisher | 1984-03-16 | 184, Avenue Rd. |      1 |
|  2 | Mary Moore  | 1990-10-21 | 16, Eagle St.   |      2 |
+----+-------------+------------+-----------------+--------+

A new field called zip_id is added which relates each row in the developers table to a zip code in zip_codes table and the zip code in address field is removed, Before making these changes, the city name in the address field could be identified by both a developer primary key (id) and zip code which violated 3NF but after applying the above changes, the city name can be identified just by the zip_id foreign key.

Most DBAs stop at this phase because for most projects this level of normalization suffices.

by Behzad Moradi on 2019-10-27