To read the article online, visit http://www.4GuysFromRolla.com/webtech/110806-1.shtml

Techniques for Uniquely Identifying Database Records

By Scott Mitchell


Introduction
When designing and creating a data model for a data-driven application, the records in tables must each be uniquely identifiable. By having a unique value associated with each record, individual records can be selected, updated, or deleted. Being able to uniquely identify records is so important and standard in a database system that databases allow those designing a database table to specify what column (or columns) make up the primary key. Once a primary key has been specified, the database system ensures that for each record in the table, the primary key value is unique.

In some circumstances, the data being modeled as a column or some set of columns that will inherently have unique values across each record. The canonical example when modeling people (as employees, patients, students, and so on) is to use the social security number, which is a unique identification record used in the United States. Web applications often use the email address as a primary key. Primary keys that have a logical relationship to the data being modeled are referred to as natural keys.

Sometimes, the data being modeled doesn't have a natural key, or the natural key may be a less desirous choice because it can only be obtained by using many columns together as the primary key. In such a case, a surrogate key can be used. A surrogate key is an additional column added to the table whose sole purpose is to serve as a primary key. This might be a Globally Unique Identifier (GUID) or an auto-incremented numeric value (an IDENTITY column, in Microsoft SQL Server parlance, and an AutoNumber in Access parlance).

In this article we will compare and contrast natural keys and surrogate keys. We'll also look at the pros and cons of using GUIDs versus IDENTITY columns when deciding to use a surrogate key. Read on to learn more!

First, My Personal Bias...
Before we delve into the differences between natural keys and surrogate keys, and discussing using GUIDs versus IDENTITIY columns, let me first state my personal habits and preferences. When designing a data model, I nearly always use a surrogate key (even if there is a suitable, single-column natural key) and virtually always use IDENTITY columns rather than GUIDs. I'll detail the rationale behind my choices throughout the article, but before we started I wanted to share this inclination of mine. So if the article seems to prefer or recommend one approach over another, it's because I'm the one writing this article and so my views and experience are the ones I'm putting forth.

Like many areas in computer science, the debate between natural and surrogate keys can border on zealotry. Back in 2001-2003 I went back to school and got a Masters degree in Computer Science with an emphasis on Information Systems and Database Design. I remember one of my professors was very adamant that database entities should always use a natural key and that surrogate keys should be eschewed. Yet in my real-world experience, I've found the proposed downsides of surrogate keys to be overhyped and the potential pitfalls of natural keys to be quite real. The point, though, is that there are people who believe very strongly that natural keys are the right way to do things, and there are others who believe surrogate keys are the only way to go. Personally, I'm not passionate about either approach, but in my experience have found surrogate keys to be the better choice. I hope this article will help you see both sides of the coin and make your own informed decision.

The Pros and Cons of Natural Keys
A natural key is a primary key composed from one or more columns in the table that are directly related to the information being expressed by the entity. For example, if we needed to model an Employee entity, we might create a database table named Employees with columns like FirstName, LastName, SocialSecurityNumber, HireDate, Salary, and so on. Here, the SocialSecurityNumber field would capture the employee's social security number, which is a unique nine-digit number assigned to people in the United States. Therefore, we might decide to use the SocialSecurityNumber field as the table's primary key. This would be an example of a natural key.

Since natural keys use an existing column (or columns) in the table, they do not clutter the table with additional columns (as surrogate keys do), which can lead to improved performance due to reduced overhead. Moreover, there's a more logical relationship between a record's unique identifier and the record's underlying data. That is, if someone says, "Hey, can you write a query that pulls up information about Bob, his social is 123-45-6789," the resulting query is simple and straightforward:

SELECT columnList
FROM Employees
WHERE SocialSecurityNumber = '123-45-6789'

Having decided upon using the SocialSecurityNumber field as the primary key, we continue on in designing the database. In related tables, there are foreign keys using the SocialSecurityNumber field. Over time, the company grows and grows, and soon there are hundreds and then thousands of records in the Employees table, and thousands of related records in tables that use the SocialSecurityNumber field as a foreign key. And then it happens. The company hires someone working in another country who doesn't have a social security number! What now?

For this particular problem, the new employee can apply for a tax ID, which is composed of the same number of digits as a social security number (but, conceptually, problems can arise because primary keys cannot contain NULL values, so what do you do if a new user doesn't have a value for the natural key)? So our new employee now has a tax ID and has that entered into the system in the SocialSecurityNumber field. Over the years, this employee ends up deciding to move to the United States and become a US citizen. Upon doing so, he applies for a social security number and is granted one. Now what do we do? Do we leave the SocialSecurityNumber field set to their original tax ID or change it to their new social security number? If we change it, we have to update all of the related tables, which won't be a walk in the park because they all (hopefully) have foreign key constraints back to the SocialSecurityNumber field. Urg.

While natural keys can consist of a single column - SocialSecurityNumber, in this example - in many cases a natural key can only be found by making a composite primary key that consists of numerous columns. This can complicate queries to the data since all of the fields that participate in the primary key must be included in the WHERE clause to retrieve, update, or delete a particular record. Furthermore, related tables now need composite foreign keys, which can quickly lead to clutter.

As you can probably guess, I am not a big fan of natural keys, but that's not to say that they don't have their time and place. Natural keys make sense, in my opinion, if the following conditions all hold true:

  • The natural key is composed of exactly one column
  • The natural key value will never, ever, not in a million years, never ever ever change. Social security numbers can change. Email addresses can change. If there's even the slightest chance that a natural key value could change, use a surrogate key.
  • There is no way that a record might not have a natural key value. That means 100% of the records will have a natural key value (not 99.999999999999999%, but 100%). Employees might not have social security numbers. (They should have some kind of tax ID, though, unless they're being hired illegally.) Users might not have an email address.
In very rare circumstances have I been able to find situations in which those three requirements hold true. Consequently, I find myself primarily using surrogate keys.

Using Surrogate Keys with IDENTITY/AutoNumber Columns
A surrogate key is one that has been purposefully added to the table simply to uniquely identify each record and does not have some intrinsic meaning to the data being captured. Rather than use the SocialSecurityNumber field as the natural key for the Employees table, we could instead add a surrogate key, EmployeeID. Surrogate keys can be implemented as a Globally Unique Identifier (GUID) or an auto-incremented numeric value (an IDENTITY column, in Microsoft SQL Server parlance, and an AutoNumber in Access parlance). Let's look at using IDENTITY columns for now; we'll turn our attention to using GUIDs in a moment.

When designing a table in Microsoft SQL Server, you can mark a column as an IDENTITY column. When inserting a new record into the table, a value cannot be provided for the IDENTITY column; instead, the database system automatically uses an increasing numeric value. When designing the table you can specify the IDENTITY column value for the first record, as well as how much each successive record's IDENTITY value is incremented. (Typically IDENTITY columns are configured to use the default initial and increment values, which means that they start at 1 and increment by 1.) See Understanding Identity Columns for more on the technical specifics of creating and defining IDENTITY columns.

Using Surrogate Keys with Globally Unique Identifiers (GUIDs)
Alternatively, a surrogate key can be expressed using a Globally Unique Identifier, or GUID. A GUID is a 128-bit number that is guaranteed to be globally unique. That is, if you create a GUID, no one else anywhere in the world or at anytime in the future, will create a the same GUID. To use this approach with Microsoft SQL Server, add a column to the table of type uniqueidentifier. If you want the system to automatically add a GUID value into this column when inserting a new record, set the column's default value to the system function NEWID() (which creates and returns a new GUID).

Comparing and Contrasting IDENTITY Columns and Globally Unique Identifiers (GUIDs)
Whether you choose to use IDENTITY columns or GUIDs for the surrogate key is a matter of preference. Each has its own set of pros and cons. From a performance standpoint, IDENTITY columns have the edge for a couple of reasons. First, IDENTITY columns are typically an int data type, which only takes up four bytes. A uniqueidentifier, on the other hand, needs 16 bytes. The larger size of uniqueidentifiers could cause more page splits in the index.

Furthermore, IDENTITY columns lead to more readable data and queries. It's much easier to communicate with a fellow DBA and say, "Hey, check out the employee with EmployeeID 4591." This fellow DBA could simply run a query with a WHERE clause like: WHERE EmployeeID = 4591. It's much more of a mouthful to say, "Hey, check out the employee with EmployeeID 8ce30856-8ab2-4d7a-a690-a9fdf954c2bc."

GUIDs offer two advantages over IDENTITY columns: they make it easier to add master and detail records simultaneously; and, being globally unique, there's no chance of key values being reused. To see the master/detail insertion benefit, imagine that in addition to the Employees table we also had a PhoneNumbers table that had a PhoneNumberID as a surrogate key, EmployeeID and a foreign key back to Employees, and a PhoneNumber field to capture a phone number. The idea here is that for each phone number an employee has, she'll have a record in the PhoneNumbers table.

Now, imagine that we have a web page interface that allows a visitor to enter the name, hire date, salary, and so on, of a new employee, along with five textboxes for entering up to five phone numbers. Upon submitting the form, we want to add the new employee to the database and then add the specified phone records to the PhoneNumbers table. Since we need the employee's EmployeeID before we can add the records to PhoneNumbers, we must first insert a new record into Employees, then get the newly inserted EmployeeID value, and then insert a record into the PhoneNumbers table for each phone number entered on the web form.

Database systems provide functions for retrieving the just-inserted IDENTITY column value (for Microsoft SQL Server, see SCOPE_IDENTITY()), but if you are using GUIDs you can use the following pattern in your web application code:

  1. Generate a new GUID programmatically (in .NET, for example, you can create a new GUID using the System.GUID class's NewGUID() method)
  2. Insert the new Employees record, using the GUID from step 1
  3. For each phone number specified, add a new record to the PhoneNumbers table using the GUID from step 1 as the EmployeeID value
Personally, I've written stored procedures that both insert a new record and return the just-inserted IDENTITY column value so many times that this benefit from using GUIDs really is a non-issue for me. But if you are not familiar with inserting a new record and getting back the IDENTITY columns value, the GUID approach may be appealing.

The fact that GUIDs are globally unique, make them ideal for situations in which it's essential that the key values are not repeated. Michael Campbell shared this advantage of using GUIDs with me in a recent email: "GUIDs are very valuable in distributed applications that span more than one physical location - such as in enterprise environments where new records could be created at various branch offices/etc and then merged (typically via replication) into a centralized, single, repository/master copy of the data." Thanks, Mike!

And alert reader Rusty Alderson wrote in to share more on the benefit of GUIDs with regard to replication scnearios: "One downside to IDENTITY [columns] ... is that they become very problematic in replicated databases. If you use two-way replication, you will find it very hard, if not impossible, to use IDENTITY [columns] and keep them in synch. Even one-way replication requires additional effort. The comment from Michael Campbell ... alludes to this, but it [isn't] explicitly stated." Rusty also shares: "In creating very large databases, GUIDs (as generated by MS's NEWID() function) WILL have collisions, or duplicates. In a previous project where I was building a large Customer ODS (Operational Data Store), I typically found about 4 collisions per million rows. Code has to be developed to handle this. This was in SQL Server 2000. Perhaps SQL Server 2005 has remedied this." Thanks, Rusty!

A Case for GUIDs...
Shortly after publishing this article, reader Shan McArthur wrote in to share his opinions on GUIDs vs. IDENTITY columns for surrogate keys:
There are many ... benefits of using GUIDs. One argument is that systems are now being designed with a service-oriented-architecture (SOA), with business classes and web services being used to abstract all data manipulation. These systems are frequently distributed and records are created outside of SQL server before they are actually persisted into the SQL database. For example, in a fictitious library application, there would be classes/services for creating a new book in the library which would include services/classes for creating new authors - persisting these to the database can be done inside a transaction. In this case it is typically bad practice to rely on the database itself to create these keys - it is better to create these keys in the application layer that deals with the logic, and with today's technologies, using a GUID is a far better choice. It does not require inserting a new row before getting a key.

Another argument for GUIDs is that systems are now distributed, and not all of them are within a SQL database - GUIDs are not restricted to databases and are easily created by any application tier. Many large systems are now built with application tiers that are using workflow technologies implemented on top of servers such as BizTalk. GUIDs are much more portable between distributed applications, and do not require that a new composite key be created. In distributed systems, there is frequently a need to synchronize between multiple systems, and GUIDs allow for unique identification of objects, and easier synchronization between systems. A GUID also allows for an object to be initially created on any of the contributing application tiers while making it easier to maintain synchronization. If you need proof, simply examine the internal workings of any modern replication technology - GUIDs are used underneath anything.

There are also benefits of using GUIDs in repplication - identity columns make it very difficult to set up a replication environment, where GUIDs really assist.

GUIDs are also more portable between databases. Oracle uses sequences, Access uses auto numbering, and SQL uses identity columns.

Identity columns also make bulk operations more difficult - inserting data requires changing the database mode and special permissions. GUIDs do not have this problem.

There are also benefits to using GUIDs for the query writers - I have seen so many bad queries that make errors in the table joins that return bad data because there are overlapping key values between different tables. In the case with using GUIDs for keys, these errors are easy to spot because no data ever gets returned. In my opinion, it is safer practice to use GUIDs for primary keys for this very reason.

I used to use identity columns for primary keys, and have completely reverted to using GUIDs everywhere for surrogate primary keys. I have spent a lot of time fixing up old systems that need to be extended or integrated into other systems that used identity columns, but I have never had a single problem integrating systems that used GUIDs. I can speak from experience that an organization can suffer financially from using identity columns instead of GUIDs when designing large scale systems.

Thanks for your comments, Shan!

Conclusion
Database systems use primary keys to specify the column or columns that uniquely identify each record, and primary keys can be either natural or surrogate. A natural key is one that is composed of columns directly related to the data being captured; with surrogate keys, the key column is added just to serve as a unique identifier. Personally, I prefer to use surrogate keys when designing my databases. With surrogate keys, each record can be uniquely identified using GUIDs or auto-increment columns, both of which have their pros and cons.

Happy Programming!

  • By Scott Mitchell


    Suggested Readings:

  • Understanding Identity Columns
  • Identity and Primary Keys

  • Article Information
    Article Title: Techniques for Uniquely Identifying Database Records
    Article Author: Scott Mitchell
    Published Date: Wednesday, November 08, 2006
    Article URL: http://www.4GuysFromRolla.com/webtech/110806-1.shtml


    Copyright 2017 QuinStreet Inc. All Rights Reserved.
    Legal Notices, Licensing, Permissions, Privacy Policy.
    Advertise | Newsletters | E-mail Offers