Relational Databases: Data Normalization
Relational databases need to be designed so that all data is in a state that is called ‘normalized.’ Normalization is the process of splitting all the data categories to be recorded in a database into groups. The categories are called attributes; the bodies they are grouped into are called entities. Normalization is a requirement for the construction of relational databases. This process dictates the design of a relational database. However, it does not always form the optimal solution.
Overview
Imagine creating a database of invoices. Where do you begin? The obvious place to start is by looking at the invoices that currently exist on paper. The headings on that invoice, like Invoice number, Date Item number, Description, etc, will become the headings of each element in the database.
Normalization Process
In the terminology of database design, the data going into a database should be at least in “third normal form.” Normalization is a series of steps and each step dictates a state of the data that the database designer should have achieved. Each step is called a “normal form”: first normal form, second normal form, third normal form, etc.
Steps
For first normal form all repeating groups are split out. A repeating group is a set of data that happens many times for one example of some other data in the group. So, attributes of an invoice should be separated into those attributes that occur only once per invoice, like invoice number or invoice date and those elements that will have many instances per invoice, which are the invoice lines. Second normal form splits out partial key dependencies. The key is a field or combination of fields that will uniquely identify each record. Third normal form separates out attributes that are more dependent on another attribute rather than key. For example, on an invoice line, part description is more dependent on the part number than the invoice line number.
Problems
Normalization requires rigid application of the rules. The designer should not stab a guess at the division of attributes between entities. However, some systems, when the rules are correctly applied, result in a large number of very small entities. The Entities resulting from normalization become the tables in the database and the attributes are the columns in those tables. IF data becomes so fragmented that no single table can provide meaningful data, then the normalization process has created an unworkable database.
Solution
Denormalization improves performance of a database and can reduce a number of small, rarely accessed tables. It involves flattening out the database structure slightly to improve response times. The process of denormalization is a judgment call on the part of database administrators and designers. Usually, problems with the normalization of data only become apparent once the database is populated with data. Denormalization is a database tuning method to overcome the disadvantages of normalization.