Database Normalization Process

Why are tables designed this way?

Yat Man, Wong
14 min readJun 2, 2022
Tables… tables everywhere

Introduction to database concepts

  • Candidate keys are a minimum sets of attributes that can uniquely identify a row in a table
  • Primary key is one of the Candidate key chosen to uniquely identify a row in the table
  • Super key is a Candidate key + other attributes
  • All Candidate keys are Super keys. But the vice-versa is not true
  • There can be multiple Super keys and Candidate keys in a table, but there can be only one Primary key in a table
  • Alternate keys are those Candidate keys that were not chosen to be the Primary key of the table
  • Composite key is a Candidate key that consists of more than one attribute
  • Foreign key is an attribute which is a Primary key in its parent table but is included as an attribute in the host table
  • Prime Attributes are attributes that make up of a candidate key. Non prime Attributes are not included in any candidate key

Normalization Process

What is Normalization?

The process of removing redundant data by taking complex entities and extract simpler entities from them.

Why is it important?

  • eliminate duplicate data
  • avoid unnecessary coding to maintain and keep the data in sync
  • avoid having complex tables that is hard to query and modify
  • smaller tables lower the number of indexes per table

Topics:

  • First Normal Form
  • Second Normal Form
  • Third Normal Form
  • Boyce-Codd Normal Form
  • Fourth Normal Form
  • Fifth Normal Form

First Normal Form

All attribute must be atomic

Consider whether you would ever need to deal with part of a column without the other parts of the data in the same column

  • If we have a column like an address that can logically be broken down into parts but we never need a part individually then it is acceptable to store the data in one column
  • examples: email, phone numbers, names, mailing address
Person table

Programming Anomalies

  • Modifying Lists in a single column (store multiple emails/permission into the same column then update/delete will be difficult)
  • Modifying Multipart Values (update area code in the address)

The fix is to break into simpler entities

All instances in an Entity Must Contain the Same Number of attribute

  • no repeating columns
  • If there are optional fields like payment1, payment2; this is an attempt to allow multiple values into a single attribute
Personal Payment table

Programming Anomalies

Dealing with a variable number of facts in an instance (payment1, payment2)

  • the query to insert need to check which column is null
  • add payment1Status and payment2Status column?
  • update/ delete had to determine which payment slot
  • what if there is a payment3?

Strong cue that you need a new table for this repeating group

All occurrences of an entity Type in an Entity Must Be Different

(don’t store 2 rows of the same thing)

  • The purpose is no two instances(row) represent the same thing
  • Adding a different artificial key into the same entity would comply with 1NF but defeat the purpose
  • Except using time as part of the logical identification for logging event is ideal

Programming Anomalies

Don’t store duplicate data! You don’t want to update multiple rows or have inconsistent data!

Clues that an existing design is not in first normal form

  • string data that contains separator-type characters (data is likely a multivalued attribute)
  • Attribute names with numbers at the end
  • tables with no or poorly defined keys (not doing a good job keeping the row values unique)

Second Normal Form

source: wiki
  • entity must be in First Normal form
  • cannot have a non-prime attribute that depends on part of any candidate key (no partial dependency)

A non-prime attribute is an attribute that is not a part of any candidate key of the relation

Second normal form is only relevant in tables with composite candidate keys

Example that a table is in 1NF but not 2NF:

describe author’s earning from writing a book

BookTitle is not unique because there could be co-author for the same book. {AuthorSSN, BookTitle} doesn’t necessarily be unique if we allow books with the same names.

Therefore a candidate key is {AuthorSSN, BookIsbn} pair to uniquely identify a row. ({AuthorSSN, BookTitle} could also be unique unless we allow the same author to write 2 books with the same name)

The RoyaltyPercentage attribute defines the earning author is receiving for the book, so this refers to the entire key.

However, AuthorFirstName/LastName/country only describe AuthorSSN, and BookTitle only describe BookIsbn. So this is a violation.

Programming Anomalies Avoided by Second Normal Form

  • cannot represent a author without identifying a book
  • if one author writes more than one books, this author’s info is duplicated in many rows
  • if we need to update an author info, we need to find all books this author write and update all rows
  • If we have a book co-authored by many authors, to update this book’s info, we need to find all authors of this books and update all rows
  • if we want to delete a book and keeps the author around, set the book to null? but what if the author write 2 different books and deleted both books now you can have identical rows for authors
  • all the above problem require all operation carefully coded to ensure data are consistent

Solution

Break into 3 tables

Understand functional dependency

Most table will have an artificial id column as primary key. Now we have {AuthorSSN, BookIsbn}, {id} as candidate key. Would that make any table with id into 2NF or break the table to not be in 2NF anymore?

Same table above with id

To understand this, we need to write out the functional dependency:

id is just a unique representation of the candidate key {AuthorSSN, BookIsbn}

{AuthorSSN, BookIsbn} → id ✅

id → {AuthorSSN, BookIsbn} ✅

{AuthorSSN, BookIsbn} → RoyaltyPercentage ✅

id → RoyaltyPercentage ❌ ← this is the one we care for 2NF

RoyaltyPercentage → id ❌

If we created a new pair of {AuthorSSN, BookIsbn}, we need to assign it a new id. Given {AuthorSSN, BookIsbn} we can tell the id, because the id is something we manually make up (system auto assign).

If we are only given an id can we find the matching {AuthorSSN, BookIsbn}? Maybe, there is no logical relation.

If we don’t like the id 13 in for that entry and changed to 14, do we need to update the RoyaltyPercentage? no. If we updated RoyaltyPercentage do we need to change the id? No. We don’t have a direct dependency between RoyaltyPercentage and id at all.

This shows we should think in term of candidate key (behind every primary id there is a candidate key it is representing). This also explain why we can leave id out when going through examples.

Relationship on multiple candidate keys

Notice the official definition for 2NF is cannot partially depend on any candidate key. Next we look at an example of why 2NF requires a non-prime attribute to satisfy all candidate key.

Relation={A,B} {C,D} E

{A,B} and {C,D} are both candidate keys. E is fully dependent on {A,B} but not {C,D}, is it possible? No. The goal of 2NF is to prevent E from getting duplicated, and that cannot happen because there will not be a different row with A,B (candidate key)

This is in 2NF

the second row can’t happen because AB pair is duplicated

But if we have

Relation={A,B} {C,D} E

E is dependent on A but not B, doesn’t even need to check about {C,D} it is already a 2NF violation

because you can have A with different combination of B to create duplicate of E

Real example of the above ^

Table with multiple candidate keys are hard to come up with, so we do need to manually add in some constraint.

See how 2000 is duplicated for every 30 Warwick St at 2020?

We do not allow the same action to apply on the same building with overlapped years. If we are already installing solar on 40 bank street from 2020–2025, it does not make sense to have another action to install solar again from 2020–2023. It is the same action. Similarly, we cannot install solar again on the same building from 2021–2025.

Then the followings are unique:

{actionName, buildingName, actionStartYear}, {actionName, buildingName, actionEndYear}

functional dependency

{buildingName, actionStartYear} → businessAsUsualEmission

businessAsUsualEmission has no partial dependency on {actionName, buildingName, actionEndYear}

The businessAsUsualEmission does not depend on actionName. It is the same value for all types of action. Therefore this is a partial dependency on {actionName, buildingName, actionStartYear}.

Doesn’t matter how many candidates there are, as long as there is 1 partial dependency. It violated 2NF and we can start getting duplicate data on this attributes.

To see the consequence when violating, see the last row in the table

This is why the definition is if we have multiple candidate keys, for all non-prime attribute, for all candidate key, either fully dependent on the candidate key or does not depend on any part of the candidate key.

Clues that an entity is not in second normal form

  • repeating key attribute prefixes
  • repeating groups of data
  • composite keys without a foreign key reference
  • fun fact: some old database choose to keep the problem data, rather than remodeling the table into proper structure because of hardware limitation and performance reasons, but it is rare now

Summery:

When you have a table with composite primary keys {a,b}, avoid having a non-prime attribute {c} that depends on {b}, because {b} can combine with different values of {a} can create many duplicate rows with {c}

Most tables don’t have to think about this because they don’t have composite primary key

Third Normal Form

source: wiki
  • must be in second normal form
  • Every non-prime attribute of R is non-transitively dependent on every key of R (non-key attributes cannot be dependent on other non-key attributes)

Example in 2NF but not 3NF

isbn can uniquely identify a row {isbn}, it is in 2NF because there is no composite candidate key. Not in 3NF because PublisherCity, PublisherContact are dependent on PublisherName

There is a transitive dependency, isbn → publisherName → publisherCity

Programming Anomalies Avoided by Third Normal Form

  • cannot represent a publisher without first identifying a book
  • if one publisher publish more than one books, this publisher’ info is duplicated in many rows
  • if we need to update the publisher info, we need to find all books this publisher publish and update all rows
  • if you want to delete a book and keeps the publisher around, set the book to null? but what if the publisher publish 2 different books and deleted both books now you can have identical rows for publisher
  • all the above problem require all operation carefully coded to ensure data are consistent

Different from Second Normal Form

2NF prevents non-prime attributes from partially dependent on any candidate keys, but a non-prime attribute can still dependent on another non-prime attribute. 3NF focus on eliminating this dependency altogether.

Solution

Boyce-Codd normal form (BCNF)

source: wiki

With the exception of trivial dependencies, every functional dependency in a table must be a dependency on a candidate key (or a superset of a candidate key)

left side of any FD in the table must be a superkey (or a superset of a candidate key)

Different from Third Normal Form

2NF/3NF only apply to non-prime attributes, there is a loophole that a prime-attribute can be dependent on other non-prime attribute(s)

Only applies when a table have overlapping candidate keys. It is not called a fourth-normal form because it is a stronger replacement for both second and third normal forms

An example of a 3NF table that does not meet BCNF:

when you have a primary attribute depend on a non-candidate key

When authors want to publish books to different countries, they usually have multiple publishers publish the same book. Lets add in some constraint, a publisher only operate in a country and there is no need to have multiple publishers’ service in the same country.

candidate keys are:

{isbn, publishCountry}

{isbn, publisherName}

There is now a functional dependency PublisherName→PublisherCountry, given the publisherName you can tell the country.

This comply with 3NF because PublisherCountry is a prime attribute. It doesn’t comply with BCNF because in fucntion dependency PublisherName→PublisherCountry, PublisherName by itself is not a candidate key.

Programming Anomalies Avoided by BCNF

Prevent inconsistent data, look at last row, PublisherName Apress cannot be from both US and England at the same time!

Solution

Move publisherName & publisherCountry to a different table

Fourth Normal Form

4NF deals with attribute stored as a single value in a column but in reality the attributes contains multiple values.

source: wiki

A table is in 4NF if and only if, for every one of its non-trivial multivalued dependencies, X ↠ Y, X is a superkey

A multivalued dependency exists when there are at least three attributes (like X,Y and Z) in a relation and for a value of X there is a well defined set of values of Y and a well defined set of values of Z. However, the set of values of Y is independent of set Z and vice versa.

The only kinds of multivalued dependencies allowed in a table are multivalued dependencies on the key

Example1 Obvious case

An organization can have many buildings and many members, they both exist regardless of the others

organization↠building

organization↠member

It is in 2NF & 3NF because there is no non-prime attribute

It is in BCNF because there is no prime-attribute depends on a non-prime attribute

It violated the definition of 4NF because organization is not a superkey

But what is the problem?

we can now store redundant data

If we add another member to organization 1, we add new rows = number of buildings

If we add another column to the table (userGroup[admin, standard]), we add new rows = number of buildings * member * userGroup

Example 2 Ternary Relationship

What permission does a user have in a role

Double arrow means it is a multivalue dependency. A member can have many roles, and a role can have many permissions.

role↠member

role↠permission

Therefore we need all 3 columns to be candidate key. It is BCNF since it doesn’t have any non-prime attributes. But it breaks 4NF because role alone is not a candidate key (or super key).

You can reason through this relationship by saying: given a role, we can find a list of members belong to that role regardless of what permissions are included in that role.

Similarly, given a role, we can find a list of permissions that are defined for that role, regardless of what members are assigned to that role.

Programming Anomalies Avoided by 4NF

The problem with this table is we have represented an admin can invite user twice. In fact duplicated for every new member with admin role.

Solution

We recognized both columns depends on role, so we break into role and 2 relation tables

Distinguish from 4NF

Not all tables with 3 columns are violating 4NF!

if we have a table to store what building belong to a scenario

candidate key is {organization, scenario, buildings}, it is in BCNF

The different here is {organization, scenario} → building, not organization↠building

we don’t have to worry about duplicate data, because it represents a new relation of a building now belong in a different scenario.

Fifth Normal Form

Normal forms beyond 4NF are mainly of academic interest, as the problems they exist to solve rarely appear in practice! 90% of the time in real life, you will find that if you’ve reached BCNF, you are also in 5NF

A table is said to be in the 5NF if and only if every non-trivial join dependency in that table is implied by the candidate keys

All relationships are broken down to binary relationships when the decomposition is lossless.

5NF is saying if you can break the table into smaller tables without losing information. You should. (It must not be possible to describe the table as being the logical result of joining some other tables together)

Recognize when a table is already in 5NF

User role in organization table

The permission of a role is dependent on the organization, in organization 1, a standard role can delete while in organization 2 it cannot.

candidate key: {organization, role, permission}

Automatically BCNF because no non-prime attribute

It is in 4NF because {organization, role} → permission

organization↠role ✅

organization↠permission ❌ need role

role↠organization✅ You might find all the organization that have “admin”

role ↠permission ❌ need organization

It is not a multivalued dependency

If we do break down the previous table into smaller tables

We can infer that:

organization 1 has admin role

organization 1 can delete

admin role can delete

Organization 1’s admin can delete ❌

Incorrect, organization 2’s admin can delete

Since we cannot break down this table, the original table is already in 5NF.

5NF ensure that any ternary relationships that still exist in 4NF that can be decomposed into entities are decomposed. If it is not possible to break down the table without losing information, it is in 5NF.

Programming Anomalies Avoided by 5NF

Finally, why we need a 5NF? What problems does it address that 4NF doesn’t?

5NF is trying to eliminate any use case that if we want to add a new role “developer” that would require us to insert multiple rows into the table. But for some reason we missed some rows that would be a huge problem.
If we only have small tables, we just need to add 1 row to the correct table.

I am unable to come up with an example without directly copying others. But this video have an example.

https://www.youtube.com/watch?v=GFQaEYEc8_8&t=1537s

Reference

Pro SQL Server 2008, Louis Davidson (helpfulness 4/10)

https://www.youtube.com/watch?v=GFQaEYEc8_8&t=1537s (helpfulness 8/10)

https://www.youtube.com/watch?v=mbj3HSK28Kk (helpfulness 6/10)

--

--

Yat Man, Wong

Android developer, problem solver, real man in training