Imagine, if you will, that you have been given the following project:
Create a database system for a video rental store to use to keep track of its customers, cutomers' rental histories, and the status of videos (whether or not they are checked out). The rental store has to be able to add both new movie titles and new customers.
In a flat-file database system creating a good database for this situation would be difficult. In a pure flat-file database, you wouldn't have explicit relations among the tables in the database. This forces all related fields to be grouped together in a large table. A flat-file database might have one table with the following columns:
Where there are 1..M columns for the currently checked out video IDs, and 1..N columns for a video history. Using a flat-file database ignores the relations among entities.
Relational databases, on the other hand, are defined by the relations among the entities. (Relational databases include SQL, Access, Informix, Oracle, etc.) You might have a Customer table, a Movies table, and a CheckOut table, which joins the customerID to the movieID.
So, in a relational database you'd set up your relations and be done, right? Well, probably not. Many developers, when developing relational databases, do not fully relationalize their data models. Doing so can lead to data integrity issues and makes updating and modifying the data model that much more difficult.
The action of making your database fully relational is known as normalization. From Inside SQL Server 6.5, "A normalized database eliminates functional dependencies in the data so that updating the database is easy and efficient." (Soukup 662). While normalizing your database is always good, one can over-normalize their data design, making performance an issue. A fully normalized database would have many small tables, and any query would require several joins. These many relations and joins would indeed make updating the data design much easier, but would slow down most queries.
There are multiple degrees of normalization, referred to as first normal form, second normal form, and so on. Performance studies have shown that databases usually perform best when normalized to third normal form. To acheive third normal form, though, you must first acheive first and second normal form, so let's discuss those first! (Remember that these are all sorts of tests you can run against your data design. If you data design meets all of the requirements outlines in first and second normal form, but not third normal form, then your data design would be normalized to second normal form. Since third normal form is desired, you should take steps in your data modeling to ensure that your database exhibits third normal form.)
First Normal Form:
To achieve first normal form, each field in a table must convey unique information. For example, if you had a Customer table with two columns for the telephone number, your design would violate first normal form. First normal form is fairly easy to achieve, since few folks would see a need for duplicate information in a table.
Second Normal Form:
No fields in a table can be derived from another field in that table. For example, let's say you had a birthdate column in your Customer table. There would be no need for an Age column, since you could easily find their age from knowing their birthdate. Second normal form will sometimes trip up even experienced data design developers. It is important to closely look over your columns to make sure that one column cannot be derived from another.
Third Normal Form: (Mecca!)
No duplicate information is permitted throughout the entire database. In our example, you'd want a table named CheckedOut, which would be a list of the videos currently checked out and by whom. A common mistake would be to have this table store information on the Customer (like name, address, customerID, etc.) and on the movie (genre, title, rating, reviews, etc.). Of course, all this information is duplicate, since the customer's name and address are stored in the Customer table and the movie's rating, title, and genre are already stored in the Movie table. To ensure third normal form, you'd need to have your CheckedOut table have just the CustomerID and MovieID (and perhaps a date/time field).
There are, of course, more than just three normal forms. Few developers bother ensuring more forms of normalization, though, because once past third normal form, performance becomes an issue. (Plus ensuring your design meets fourth or fifth normal form can be a daunting task!) In fact, sometimes third normal form can be too normal. The best approach is to normalize your database to third normal form, then tinker with it to determine how to achieve the best performance. Using this approach is superior to starting with an unnormalized database and normalizing it to a certain performance level.
To surmize, normalization is good; normalization is important. Normalizing your databases allows for you to easily update and maintain your data design. Too much normalization, though, can be a performance killer; the best approach is to normalize your database to third normal form, then, if performance is still an issue, tinker with your data design until you are satisfied with the performance.
Good luck, and happy programming!