Techniques for Uniquely Identifying Database Records
By Scott Mitchell
Introduction
When designing and creating a data model for a data-driven application, the records in tables must each be uniquely identifiable.
By having a unique value associated with each record, individual records can be selected, updated, or deleted. Being able
to uniquely identify records is so important and standard in a database system that databases allow those designing a database
table to specify what column (or columns) make up the primary key.
Once a primary key has been specified, the database system ensures that for each record in the table, the primary key value is
unique.
In some circumstances, the data being modeled as a column or some set of columns that will inherently have unique values across each record. The canonical example when modeling people (as employees, patients, students, and so on) is to use the social security number, which is a unique identification record used in the United States. Web applications often use the email address as a primary key. Primary keys that have a logical relationship to the data being modeled are referred to as natural keys.
Sometimes, the data being modeled doesn't have a natural key, or the natural key may be a less desirous choice because it can only be obtained by using many columns together as the primary key. In such a case, a surrogate key can be used. A surrogate key is an additional column added to the table whose sole purpose is to serve as a primary key. This might be a Globally Unique Identifier (GUID) or an auto-incremented numeric value (an IDENTITY column, in Microsoft SQL Server parlance, and an AutoNumber in Access parlance).
In this article we will compare and contrast natural keys and surrogate keys. We'll also look at the pros and cons of using GUIDs versus IDENTITY columns when deciding to use a surrogate key. Read on to learn more!
First, My Personal Bias...
Before we delve into the differences between natural keys and surrogate keys, and discussing using GUIDs versus IDENTITIY
columns, let me first state my personal habits and preferences. When designing a data model, I nearly always use a surrogate key
(even if there is a suitable, single-column natural key) and virtually always use IDENTITY columns rather than GUIDs. I'll
detail the rationale behind my choices throughout the article, but before we started I wanted to share this inclination of mine.
So if the article seems to prefer or recommend one approach over another, it's because I'm the one writing this article and
so my views and experience are the ones I'm putting forth.
Like many areas in computer science, the debate between natural and surrogate keys can border on zealotry. Back in 2001-2003 I went back to school and got a Masters degree in Computer Science with an emphasis on Information Systems and Database Design. I remember one of my professors was very adamant that database entities should always use a natural key and that surrogate keys should be eschewed. Yet in my real-world experience, I've found the proposed downsides of surrogate keys to be overhyped and the potential pitfalls of natural keys to be quite real. The point, though, is that there are people who believe very strongly that natural keys are the right way to do things, and there are others who believe surrogate keys are the only way to go. Personally, I'm not passionate about either approach, but in my experience have found surrogate keys to be the better choice. I hope this article will help you see both sides of the coin and make your own informed decision.
The Pros and Cons of Natural Keys
A natural key is a primary key composed from one or more columns in the table that are directly related to the information
being expressed by the entity. For example, if we needed to model an Employee entity, we might create a database table named
Employees
with columns like FirstName
, LastName
, SocialSecurityNumber
,
HireDate
, Salary
, and so on. Here, the SocialSecurityNumber
field would capture
the employee's social security number, which is a unique nine-digit number assigned to people in the United States. Therefore,
we might decide to use the SocialSecurityNumber
field as the table's primary key. This would be an example of
a natural key.
Since natural keys use an existing column (or columns) in the table, they do not clutter the table with additional columns (as surrogate keys do), which can lead to improved performance due to reduced overhead. Moreover, there's a more logical relationship between a record's unique identifier and the record's underlying data. That is, if someone says, "Hey, can you write a query that pulls up information about Bob, his social is 123-45-6789," the resulting query is simple and straightforward:
SELECT columnList
|
Having decided upon using the SocialSecurityNumber
field as the primary key, we continue on in designing the database.
In related tables, there are foreign keys using the SocialSecurityNumber
field. Over time, the company grows and grows,
and soon there are hundreds and then thousands of records in the Employees
table, and thousands of related records
in tables that use the SocialSecurityNumber
field as a foreign key. And then it happens. The company hires someone working in
another country who doesn't have a social security number! What now?
For this particular problem, the new employee can apply for a tax ID, which is composed of the same number of digits as a
social security number (but, conceptually, problems can arise because primary keys cannot contain NULL
values,
so what do you do if a new user doesn't have a value for the natural key)? So our new employee now has a tax ID and has that
entered into the system in the SocialSecurityNumber
field. Over the years, this employee ends up deciding to move
to the United States and become a US citizen. Upon doing so, he applies for a social security number and is granted one.
Now what do we do? Do we leave the SocialSecurityNumber
field set to their original tax ID or change it to their
new social security number? If we change it, we have to update all of the related tables, which won't be a walk in the park
because they all (hopefully) have foreign key constraints back to the SocialSecurityNumber
field. Urg.
While natural keys can consist of a single column - SocialSecurityNumber
, in this example - in many cases a natural
key can only be found by making a composite primary key that consists of numerous columns. This can complicate queries to the
data since all of the fields that participate in the primary key must be included in the WHERE
clause to
retrieve, update, or delete a particular record. Furthermore, related tables now need composite foreign keys, which can
quickly lead to clutter.
As you can probably guess, I am not a big fan of natural keys, but that's not to say that they don't have their time and place. Natural keys make sense, in my opinion, if the following conditions all hold true:
- The natural key is composed of exactly one column
- The natural key value will never, ever, not in a million years, never ever ever change. Social security numbers can change. Email addresses can change. If there's even the slightest chance that a natural key value could change, use a surrogate key.
- There is no way that a record might not have a natural key value. That means 100% of the records will have a natural key value (not 99.999999999999999%, but 100%). Employees might not have social security numbers. (They should have some kind of tax ID, though, unless they're being hired illegally.) Users might not have an email address.
Using Surrogate Keys with IDENTITY/AutoNumber Columns
A surrogate key is one that has been purposefully added to the table simply to uniquely identify each record and does not
have some intrinsic meaning to the data being captured. Rather than use the SocialSecurityNumber
field as the
natural key for the Employees
table, we could instead add a surrogate key, EmployeeID
. Surrogate
keys can be implemented as a Globally Unique Identifier (GUID) or an auto-incremented numeric value (an IDENTITY column, in
Microsoft SQL Server parlance, and an AutoNumber in Access parlance). Let's look at using IDENTITY columns for now; we'll turn
our attention to using GUIDs in a moment.
When designing a table in Microsoft SQL Server, you can mark a column as an IDENTITY column. When inserting a new record into the table, a value cannot be provided for the IDENTITY column; instead, the database system automatically uses an increasing numeric value. When designing the table you can specify the IDENTITY column value for the first record, as well as how much each successive record's IDENTITY value is incremented. (Typically IDENTITY columns are configured to use the default initial and increment values, which means that they start at 1 and increment by 1.) See Understanding Identity Columns for more on the technical specifics of creating and defining IDENTITY columns.
Using Surrogate Keys with Globally Unique Identifiers (GUIDs)
Alternatively, a surrogate key can be expressed using a Globally Unique Identifier, or GUID. A GUID is a 128-bit number that
is guaranteed to be globally unique. That is, if you create a GUID, no one else anywhere in the world or at anytime in the future,
will create a the same GUID. To use this approach with Microsoft SQL Server, add a column to the table of type uniqueidentifier
.
If you want the system to automatically add a GUID value into this column when inserting a new record, set the column's
default value to the system function NEWID()
(which creates and returns a new GUID).
Comparing and Contrasting IDENTITY Columns and Globally Unique Identifiers (GUIDs)
Whether you choose to use IDENTITY columns or GUIDs for the surrogate key is a matter of preference. Each has its own
set of pros and cons. From a performance standpoint, IDENTITY columns have the edge for a couple of reasons. First, IDENTITY
columns are typically an int
data type, which only takes up four bytes. A uniqueidentifier
, on the other
hand, needs 16 bytes. The larger size of uniqueidentifier
s could
cause more page splits in the index.
Furthermore, IDENTITY columns lead to more readable data and queries. It's much easier to communicate with a fellow DBA and say,
"Hey, check out the employee with EmployeeID
4591." This fellow DBA could simply run a query with a WHERE
clause like: WHERE EmployeeID = 4591
. It's much more of a mouthful to say,
"Hey, check out the employee with EmployeeID
8ce30856-8ab2-4d7a-a690-a9fdf954c2bc."
GUIDs offer two advantages over IDENTITY columns: they make it easier to add master and detail records simultaneously; and,
being globally unique, there's no chance of key values being reused.
To see the master/detail insertion benefit, imagine that in addition to the Employees
table we also had a PhoneNumbers
table that
had a PhoneNumberID
as a surrogate key, EmployeeID
and a foreign key back to Employees
,
and a PhoneNumber
field to capture a phone number. The idea here is that for each phone number an employee has,
she'll have a record in the PhoneNumbers
table.
Now, imagine that we have a web page interface that allows a visitor to enter the name, hire date, salary, and so on, of a new
employee, along with five textboxes for entering up to five phone numbers. Upon submitting the form, we want to add the new employee
to the database and then add the specified phone records to the PhoneNumbers
table. Since we need the employee's
EmployeeID
before we can add the records to PhoneNumbers
, we must first insert a new record into
Employees
, then get the newly inserted EmployeeID
value, and then insert a record into
the PhoneNumbers
table for each phone number entered on the web form.
Database systems provide functions for retrieving the just-inserted IDENTITY column value (for Microsoft SQL Server,
see SCOPE_IDENTITY()
), but if you are using
GUIDs you can use the following pattern in your web application code:
- Generate a new GUID programmatically (in .NET, for example, you can create a new GUID using the
System.GUID
class'sNewGUID()
method) - Insert the new
Employees
record, using the GUID from step 1 - For each phone number specified, add a new record to the
PhoneNumbers
table using the GUID from step 1 as theEmployeeID
value
The fact that GUIDs are globally unique, make them ideal for situations in which it's essential that the key values are not repeated. Michael Campbell shared this advantage of using GUIDs with me in a recent email: "GUIDs are very valuable in distributed applications that span more than one physical location - such as in enterprise environments where new records could be created at various branch offices/etc and then merged (typically via replication) into a centralized, single, repository/master copy of the data." Thanks, Mike!
And alert reader Rusty Alderson wrote in to share more on the benefit of GUIDs with regard to
replication scnearios: "One downside to IDENTITY [columns] ... is that they
become very problematic in replicated databases. If you use two-way replication,
you will find it very hard, if not impossible, to use IDENTITY [columns] and keep
them in synch. Even one-way replication requires additional effort. The
comment from Michael Campbell ... alludes to
this, but it [isn't] explicitly stated." Rusty also shares: "In creating very large databases, GUIDs (as generated by MS's
NEWID()
function)
WILL have collisions, or duplicates. In a previous project where I was
building a large Customer ODS (Operational Data Store), I typically found about
4 collisions per million rows. Code has to be developed to handle this. This
was in SQL Server 2000. Perhaps SQL Server 2005 has remedied this." Thanks, Rusty!
A Case for GUIDs... |
---|
Shortly after publishing this article, reader Shan McArthur wrote in to share his opinions on GUIDs vs. IDENTITY columns for
surrogate keys:
There are many ... benefits of using GUIDs. One argument is that systems are now being designed with a service-oriented-architecture (SOA), with business classes and web services being used to abstract all data manipulation. These systems are frequently distributed and records are created outside of SQL server before they are actually persisted into the SQL database. For example, in a fictitious library application, there would be classes/services for creating a new book in the library which would include services/classes for creating new authors - persisting these to the database can be done inside a transaction. In this case it is typically bad practice to rely on the database itself to create these keys - it is better to create these keys in the application layer that deals with the logic, and with today's technologies, using a GUID is a far better choice. It does not require inserting a new row before getting a key.Thanks for your comments, Shan! |
Conclusion
Database systems use primary keys to specify the column or columns that uniquely identify each record, and primary keys
can be either natural or surrogate. A natural key is one that is composed of columns directly related to the data being
captured; with surrogate keys, the key column is added just to serve as a unique identifier. Personally, I prefer to use
surrogate keys when designing my databases. With surrogate keys, each record can be uniquely identified using GUIDs or
auto-increment columns, both of which have their pros and cons.
Happy Programming!
Suggested Readings: