Database Normalization Process
Why are tables designed this way?
Introduction to database concepts
- Candidate keys are a minimum sets of attributes that can uniquely identify a row in a table
- Primary key is one of the Candidate key chosen to uniquely identify a row in the table
- Super key is a Candidate key + other attributes
- All Candidate keys are Super keys. But the vice-versa is not true
- There can be multiple Super keys and Candidate keys in a table, but there can be only one Primary key in a table
- Alternate keys are those Candidate keys that were not chosen to be the Primary key of the table
- Composite key is a Candidate key that consists of more than one attribute
- Foreign key is an attribute which is a Primary key in its parent table but is included as an attribute in the host table
- Prime Attributes are attributes that make up of a candidate key. Non prime Attributes are not included in any candidate key
Normalization Process
What is Normalization?
The process of removing redundant data by taking complex entities and extract simpler entities from them.
Why is it important?
- eliminate duplicate data
- avoid unnecessary coding to maintain and keep the data in sync
- avoid having complex tables that is hard to query and modify
- smaller tables lower the number of indexes per table
Topics:
- First Normal Form
- Second Normal Form
- Third Normal Form
- Boyce-Codd Normal Form
- Fourth Normal Form
- Fifth Normal Form
First Normal Form
All attribute must be atomic
Consider whether you would ever need to deal with part of a column without the other parts of the data in the same column
- If we have a column like an address that can logically be broken down into parts but we never need a part individually then it is acceptable to store the data in one column
- examples: email, phone numbers, names, mailing address
Programming Anomalies
- Modifying Lists in a single column (store multiple emails/permission into the same column then update/delete will be difficult)
- Modifying Multipart Values (update area code in the address)
The fix is to break into simpler entities
All instances in an Entity Must Contain the Same Number of attribute
- no repeating columns
- If there are optional fields like payment1, payment2; this is an attempt to allow multiple values into a single attribute
Programming Anomalies
Dealing with a variable number of facts in an instance (payment1, payment2)
- the query to insert need to check which column is null
- add payment1Status and payment2Status column?
- update/ delete had to determine which payment slot
- what if there is a payment3?
Strong cue that you need a new table for this repeating group
All occurrences of an entity Type in an Entity Must Be Different
(don’t store 2 rows of the same thing)
- The purpose is no two instances(row) represent the same thing
- Adding a different artificial key into the same entity would comply with 1NF but defeat the purpose
- Except using time as part of the logical identification for logging event is ideal
Programming Anomalies
Don’t store duplicate data! You don’t want to update multiple rows or have inconsistent data!
Clues that an existing design is not in first normal form
- string data that contains separator-type characters (data is likely a multivalued attribute)
- Attribute names with numbers at the end
- tables with no or poorly defined keys (not doing a good job keeping the row values unique)
Second Normal Form
- entity must be in First Normal form
- cannot have a non-prime attribute that depends on part of any candidate key (no partial dependency)
A non-prime attribute is an attribute that is not a part of any candidate key of the relation
Second normal form is only relevant in tables with composite candidate keys
Example that a table is in 1NF but not 2NF:
BookTitle is not unique because there could be co-author for the same book. {AuthorSSN, BookTitle} doesn’t necessarily be unique if we allow books with the same names.
Therefore a candidate key is {AuthorSSN, BookIsbn} pair to uniquely identify a row. ({AuthorSSN, BookTitle} could also be unique unless we allow the same author to write 2 books with the same name)
The RoyaltyPercentage attribute defines the earning author is receiving for the book, so this refers to the entire key.
However, AuthorFirstName/LastName/country only describe AuthorSSN, and BookTitle only describe BookIsbn. So this is a violation.
Programming Anomalies Avoided by Second Normal Form
- cannot represent a author without identifying a book
- if one author writes more than one books, this author’s info is duplicated in many rows
- if we need to update an author info, we need to find all books this author write and update all rows
- If we have a book co-authored by many authors, to update this book’s info, we need to find all authors of this books and update all rows
- if we want to delete a book and keeps the author around, set the book to null? but what if the author write 2 different books and deleted both books now you can have identical rows for authors
- all the above problem require all operation carefully coded to ensure data are consistent
Solution
Understand functional dependency
Most table will have an artificial id column as primary key. Now we have {AuthorSSN, BookIsbn}, {id} as candidate key. Would that make any table with id into 2NF or break the table to not be in 2NF anymore?
To understand this, we need to write out the functional dependency:
id is just a unique representation of the candidate key {AuthorSSN, BookIsbn}
{AuthorSSN, BookIsbn} → id ✅
id → {AuthorSSN, BookIsbn} ✅
{AuthorSSN, BookIsbn} → RoyaltyPercentage ✅
id → RoyaltyPercentage ❌ ← this is the one we care for 2NF
RoyaltyPercentage → id ❌
If we created a new pair of {AuthorSSN, BookIsbn}, we need to assign it a new id. Given {AuthorSSN, BookIsbn} we can tell the id, because the id is something we manually make up (system auto assign).
If we are only given an id can we find the matching {AuthorSSN, BookIsbn}? Maybe, there is no logical relation.
If we don’t like the id 13 in for that entry and changed to 14, do we need to update the RoyaltyPercentage? no. If we updated RoyaltyPercentage do we need to change the id? No. We don’t have a direct dependency between RoyaltyPercentage and id at all.
This shows we should think in term of candidate key (behind every primary id there is a candidate key it is representing). This also explain why we can leave id out when going through examples.
Relationship on multiple candidate keys
Notice the official definition for 2NF is cannot partially depend on any candidate key. Next we look at an example of why 2NF requires a non-prime attribute to satisfy all candidate key.
Relation={A,B} {C,D} E
{A,B} and {C,D} are both candidate keys. E is fully dependent on {A,B} but not {C,D}, is it possible? No. The goal of 2NF is to prevent E from getting duplicated, and that cannot happen because there will not be a different row with A,B (candidate key)
This is in 2NF
But if we have
Relation={A,B} {C,D} E
E is dependent on A but not B, doesn’t even need to check about {C,D} it is already a 2NF violation
because you can have A with different combination of B to create duplicate of E
Real example of the above ^
Table with multiple candidate keys are hard to come up with, so we do need to manually add in some constraint.
We do not allow the same action to apply on the same building with overlapped years. If we are already installing solar on 40 bank street from 2020–2025, it does not make sense to have another action to install solar again from 2020–2023. It is the same action. Similarly, we cannot install solar again on the same building from 2021–2025.
Then the followings are unique:
{actionName, buildingName, actionStartYear}, {actionName, buildingName, actionEndYear}
functional dependency
{buildingName, actionStartYear} → businessAsUsualEmission
businessAsUsualEmission has no partial dependency on {actionName, buildingName, actionEndYear}
The businessAsUsualEmission does not depend on actionName. It is the same value for all types of action. Therefore this is a partial dependency on {actionName, buildingName, actionStartYear}.
Doesn’t matter how many candidates there are, as long as there is 1 partial dependency. It violated 2NF and we can start getting duplicate data on this attributes.
To see the consequence when violating, see the last row in the table
This is why the definition is if we have multiple candidate keys, for all non-prime attribute, for all candidate key, either fully dependent on the candidate key or does not depend on any part of the candidate key.
Clues that an entity is not in second normal form
- repeating key attribute prefixes
- repeating groups of data
- composite keys without a foreign key reference
- fun fact: some old database choose to keep the problem data, rather than remodeling the table into proper structure because of hardware limitation and performance reasons, but it is rare now
Summery:
When you have a table with composite primary keys {a,b}, avoid having a non-prime attribute {c} that depends on {b}, because {b} can combine with different values of {a} can create many duplicate rows with {c}
Most tables don’t have to think about this because they don’t have composite primary key
Third Normal Form
- must be in second normal form
- Every non-prime attribute of R is non-transitively dependent on every key of R (non-key attributes cannot be dependent on other non-key attributes)
Example in 2NF but not 3NF
isbn can uniquely identify a row {isbn}, it is in 2NF because there is no composite candidate key. Not in 3NF because PublisherCity, PublisherContact are dependent on PublisherName
There is a transitive dependency, isbn → publisherName → publisherCity
Programming Anomalies Avoided by Third Normal Form
- cannot represent a publisher without first identifying a book
- if one publisher publish more than one books, this publisher’ info is duplicated in many rows
- if we need to update the publisher info, we need to find all books this publisher publish and update all rows
- if you want to delete a book and keeps the publisher around, set the book to null? but what if the publisher publish 2 different books and deleted both books now you can have identical rows for publisher
- all the above problem require all operation carefully coded to ensure data are consistent
Different from Second Normal Form
2NF prevents non-prime attributes from partially dependent on any candidate keys, but a non-prime attribute can still dependent on another non-prime attribute. 3NF focus on eliminating this dependency altogether.
Solution
Boyce-Codd normal form (BCNF)
With the exception of trivial dependencies, every functional dependency in a table must be a dependency on a candidate key (or a superset of a candidate key)
left side of any FD in the table must be a superkey (or a superset of a candidate key)
Different from Third Normal Form
2NF/3NF only apply to non-prime attributes, there is a loophole that a prime-attribute can be dependent on other non-prime attribute(s)
Only applies when a table have overlapping candidate keys. It is not called a fourth-normal form because it is a stronger replacement for both second and third normal forms
An example of a 3NF table that does not meet BCNF:
when you have a primary attribute depend on a non-candidate key
When authors want to publish books to different countries, they usually have multiple publishers publish the same book. Lets add in some constraint, a publisher only operate in a country and there is no need to have multiple publishers’ service in the same country.
candidate keys are:
{isbn, publishCountry}
{isbn, publisherName}
There is now a functional dependency PublisherName→PublisherCountry, given the publisherName you can tell the country.
This comply with 3NF because PublisherCountry is a prime attribute. It doesn’t comply with BCNF because in fucntion dependency PublisherName→PublisherCountry, PublisherName by itself is not a candidate key.
Programming Anomalies Avoided by BCNF
Prevent inconsistent data, look at last row, PublisherName Apress cannot be from both US and England at the same time!
Solution
Move publisherName & publisherCountry to a different table
Fourth Normal Form
4NF deals with attribute stored as a single value in a column but in reality the attributes contains multiple values.
A table is in 4NF if and only if, for every one of its non-trivial multivalued dependencies, X ↠ Y, X is a superkey
A multivalued dependency exists when there are at least three attributes (like X,Y and Z) in a relation and for a value of X there is a well defined set of values of Y and a well defined set of values of Z. However, the set of values of Y is independent of set Z and vice versa.
The only kinds of multivalued dependencies allowed in a table are multivalued dependencies on the key
Example1 Obvious case
An organization can have many buildings and many members, they both exist regardless of the others
organization↠building
organization↠member
It is in 2NF & 3NF because there is no non-prime attribute
It is in BCNF because there is no prime-attribute depends on a non-prime attribute
It violated the definition of 4NF because organization is not a superkey
But what is the problem?
we can now store redundant data
If we add another member to organization 1, we add new rows = number of buildings
If we add another column to the table (userGroup[admin, standard]), we add new rows = number of buildings * member * userGroup
Example 2 Ternary Relationship
Double arrow means it is a multivalue dependency. A member can have many roles, and a role can have many permissions.
role↠member
role↠permission
Therefore we need all 3 columns to be candidate key. It is BCNF since it doesn’t have any non-prime attributes. But it breaks 4NF because role alone is not a candidate key (or super key).
You can reason through this relationship by saying: given a role, we can find a list of members belong to that role regardless of what permissions are included in that role.
Similarly, given a role, we can find a list of permissions that are defined for that role, regardless of what members are assigned to that role.
Programming Anomalies Avoided by 4NF
The problem with this table is we have represented an admin can invite user twice. In fact duplicated for every new member with admin role.
Solution
We recognized both columns depends on role, so we break into role and 2 relation tables
Distinguish from 4NF
Not all tables with 3 columns are violating 4NF!
if we have a table to store what building belong to a scenario
candidate key is {organization, scenario, buildings}, it is in BCNF
The different here is {organization, scenario} → building, not organization↠building
we don’t have to worry about duplicate data, because it represents a new relation of a building now belong in a different scenario.
Fifth Normal Form
Normal forms beyond 4NF are mainly of academic interest, as the problems they exist to solve rarely appear in practice! 90% of the time in real life, you will find that if you’ve reached BCNF, you are also in 5NF
A table is said to be in the 5NF if and only if every non-trivial join dependency in that table is implied by the candidate keys
All relationships are broken down to binary relationships when the decomposition is lossless.
5NF is saying if you can break the table into smaller tables without losing information. You should. (It must not be possible to describe the table as being the logical result of joining some other tables together)
Recognize when a table is already in 5NF
User role in organization table
The permission of a role is dependent on the organization, in organization 1, a standard role can delete while in organization 2 it cannot.
candidate key: {organization, role, permission}
Automatically BCNF because no non-prime attribute
It is in 4NF because {organization, role} → permission
organization↠role ✅
organization↠permission ❌ need role
role↠organization✅ You might find all the organization that have “admin”
role ↠permission ❌ need organization
It is not a multivalued dependency
If we do break down the previous table into smaller tables
We can infer that:
organization 1 has admin role
organization 1 can delete
admin role can delete
Organization 1’s admin can delete ❌
Incorrect, organization 2’s admin can delete
Since we cannot break down this table, the original table is already in 5NF.
5NF ensure that any ternary relationships that still exist in 4NF that can be decomposed into entities are decomposed. If it is not possible to break down the table without losing information, it is in 5NF.
Programming Anomalies Avoided by 5NF
Finally, why we need a 5NF? What problems does it address that 4NF doesn’t?
5NF is trying to eliminate any use case that if we want to add a new role “developer” that would require us to insert multiple rows into the table. But for some reason we missed some rows that would be a huge problem.
If we only have small tables, we just need to add 1 row to the correct table.
I am unable to come up with an example without directly copying others. But this video have an example.
https://www.youtube.com/watch?v=GFQaEYEc8_8&t=1537s
Reference
Pro SQL Server 2008, Louis Davidson (helpfulness 4/10)
https://www.youtube.com/watch?v=GFQaEYEc8_8&t=1537s (helpfulness 8/10)
https://www.youtube.com/watch?v=mbj3HSK28Kk (helpfulness 6/10)