Maintaining a Log of Database Changes - Part 1By Scott Mitchell
One of the benefits of using Source Code Control (SCC) software like SourceSafe, Perforce, Subversion, Vault, and others is that the software keeps a detailed history of all changes to the source code. With SCC, you can see how the code for a particular file has changed over time, and when and who made the changes. Moreover, with such a log in place it is easy to rollback the system to a previous state. A log that tracks changes to a system is sometimes referred to as an audit log, as it provides a repository of information that can be used to audit a system. In the case of SCC, your boss might want to perform an audit of the system to determine how a particular bug was introduced. By examining this log, your boss could determine who entered the buggy line of code and when, as well look at what the code looked like before the bug was introduced.
The concept of an audit log can be extended to database data as well. With a little bit of effort it is possible to setup a system that records all inserts, updates, and deletes to a specified set of database "history" tables. By logging all changes to the underlying data, it is possible to determine how, exactly, the data in the database was modified over time, as well as who made the changes and when. Database audit logs are especially useful in scenarios where the correctness of the data is essential or when there are many users all working on the same underlying data. In some scenarios audit logs are required by law.
In this article - the first of a two-part series - we will discuss different techniques for storing changes to database data and their advantages and disadvantages. If you have had first-hand experience implementing database audit logging, I invite your feedback, as I'd like to flesh out this article to include suggestions, anecdotes, and warnings from the community at large. Read on to learn more!
An Overview of Logging Database Data Changes
In order to maintain a history of changes to the database's data we need to record every insert, update, and delete to some sort of "history" table. In addition to capturing the data that was inserted, updated, or deleted, we also need to note what user made the modification, as well as the date and time it was made. Furthermore, there are likely lookup tables or other ancillary tables whose history of changes do not need to be tracked. Therefore, we need some way to indicate those tables that need their changes tracked.
The two main challenges in creating an audit log of database changes is determining what mechanism will be used to log changes and how the history of changes will be persisted to the database. The remainder of this article looks at different techniques for persisting the history of changes to the database and their pros and cons. Maintaining a Log of Database Changes - Part 2 explores techniques for logging data modifications to the "history" tables.
|Your Feedback is Wanted!|
|Do you have first-hand experience in maintaining a log of database changes? If so, I would be interested in hearing your suggestions, comments, or anecdotes. Please don't hesitate to send in your feedback to mitchell@4GuysFromRolla.com.|
Techniques for Persisting a History of Changes
In order to correctly maintain a log of data changes we need to record change history whenever a data modification occurs. For example, if we have a table named
Products that we wish to add an audit log for, each time a product is
inserted, updated, or deleted we need to record that data modification to some "history" table. The following list
looks at three different techniques for maintaining this history of changes.
|A Separate "History" Table for Each Table Being Audited|
With this approach, for every table whose history of changes needs to be tracked we need to create an additional
"history" table. This "history" table would have the same schema as the original table plus three additional
columns to capture who made the change and when, and if the change was the deletion of the record. These three
additional columns might be named
For example, imagine that we had a table named
Given this example, imagine that we added a new record to the
This approach is fairly easy to setup. If you are using a code generator like CodeSmith
you can simply write a template that will automatically generate the "history" tables' schemas. (See
CodeSmith-Generated Audit Log Triggers for such a template.)
This approach also makes searching the audit log very easy since there the "history" table's schema is a superset of
the original's. For example, if you want to see all changes that were made to a particular product, you can just run
a query like:
Or if you want to see all changes made to the system by "Scott" between April 10th and 20th, you could run the following query:
|Cons||The major downside of this approach is that since the "history" table's schema is tied directly to the original's, anytime the original table's schema is modified you need to be certain to also modify the "history" table's. Additionally, since this approach creates one "history" table for each table whose changes need to be tracked, it can double the number of tables in your database.|
|Final Thoughts||This is the approach I've used in my projects that required tracking database changes. I have found that it works best with mature data models where changes to the orignial tables' schemas is infrequent.|
|A Consolidated "History" Table for All Tables Whose Changes are Being Tracked|
Rather than creating a "history" table for each table whose changes are being tracked, an alternative approach
is to create a consolidated "history" table. One challenge with a consolidated "history" table is that tables
with differing schemas all need to track their changes to this single table. This can be accomplished in a couple
Using the "Logging to an Amorphous Table" technique would yield the following records in the consolidated "history" table:
|Pros||This approach's main advantage is that the history information is consolidated into a single table. That means that in the future if you decide to start tracking the changes made to additional tables, you do not need to add more "history" tables.|
The main drawback of this approach is that the table is not as searchable as with the
"A Separate 'History' Table for Each Table Being Audited" approach. While you can get all changes to a particular
table or column, it's not as easy to get a clear picture as how changes were made over time or as simple to return
data filtered by a particular column value, especially if you are using the approach where the changes is serialized
to a string and stored in a single column.
Reader Russell M. shared the following: "I've used the 'Logging to an Amorphous Table' approach. It is really easy to implement and has been very useful. However, it is nearly impossible to automatically recreate a record at some time in the past. This is a HUGE weakness!"
|Final Thoughts||I've not used this approach in a production environment. If anyone has used this approach, though, I would be interested in hearing your feedback.|
Several readers have written in with insightful comments and anecdotes, and I welcome your feedback, as well.
Phillip C. shared the following advice:
... Before you start logging changes, you should have a very clear picture of what you want to report on. Using the "Logging the Changes in a Single Column" approach is not really helpful if you want to see something like "all the price changes made to the system by 'Scott' between April 10th and 20th."
Keith B. offerred his experience with both approaches:
I've used both approaches. The approach I use depends on how I want to use the history table.
In one situation I needed an exact audit log and the ability to quickly pull a record together based on a date. This allowed us to have an audit log app where an outside auditor could take the document he was auditing (which is always time stamped) and pull up the information from the system to see exactly what info was in the system at that point in time. As these documents were mortgage and deed based, the legal ramifications for wrong decisions was pretty large for this company. And yes - it really does require a fairly mature data model. The next complication was the workflow requirements. Instead of having just a working and a history table we also had to have a workflow table. In other words, when a SME changed a record it was only recorded in the workflow table. An internal auditor would be notified and they could compare the working data with the workflow data and pass/fail it. If passed, it went onto a manager (publisher) who also checked it. If it was passed there then the change was made in the working table and the old record was moved to the history table. So, lots of work keeping triggers in synch, working with workflow sprocs, etc. When finished the response time for a dated search was only 5-10% slower than using the working table alone.
In another situation we knew the database would change over time as the client had no clue to what they really wanted. Instead of a single history table we ended up with four history tables. We grouped the working tables into four different categories like lookup data, long term data, working data and something else. We had requirements to put together a data row based on a date window but it did not need to behave like a real time system - waiting five minutes was fine. The system was mainly a system with automated end points and very little human interaction. The reporting was run each night and stored in another reporting database. It was only for audit reasons that a UI was hooked to the transaction database. Each time a change (delete/update) was made, [an] XML based extract of the entire row was inserted into the correct history table - very much like your example. Since we used GUIDs as the alternate primary key on every table we stored that with the table name in the history table. Then, in a sproc, we could look up a record from the working table, use it's GUID to lookup all previous records from the history table the use
sp_preparexmldocumentto turn the XML back into a matching table to the working table. It is actually fairly quick. To help keep it from breaking we made a decision to not delete columns from a table unless we really, really had to. If we did, we had to go back to the history, port it out to a backup history and restream in the "corrected" structure. (Hope that makes sense.)
Padgett R. pointed out an article titled A Point in Time Architecture, which examines a data architecture "that can be rolled back to any previous point in time." In short, the Point in Time Architecture allows the database to be examined as it was at any previous point in time without compromising the current state. Implementing this architecture is non-trivial and may be overkill for your scenario.
Lawrence M. shared his experiences using these two techniques:
... I'm using both techniques... The separate history tables allow me to very easily reconstruct the state of the data on any given date. However, a challenge that I continue to think about is related to normalization. My history tables are denormalized. Sometimes that's ok, other times it isn't. These tables are populated via a stored procedure that is invoked by the application save function. Keeping the schema of the history tables up to date was a problem early in development when the schema of the main tables was changing more frequently that occurs under production, as you've stated in your "Final Thoughts".
The "consolidated table" allows me to easily provide an audit log function. I can easily report the exact changes that occurred. Providing that same function from separate history tables would require examination of many columns in many tables just to find that the value was changed from $4.50 to $3.95. This table is populated via triggers. I did not consider using XML to record the changes, so I have additional columns compared to your solution. A challenge I have not solved is data types of the changed data. This impacts reporting, for example, when I might want only monetary values to be displayed as right justified.
As opposed to rolling your own system, there are a number of third-party products designed to simplify tracking changes to the data in a database table. Philip L. wrote in to mention two products from Lumigent: Log Explorer and Entegra.
We only use LogExplorer, which we use specifically for SQL 2000, however I believe both the products work with SQL 2005. The greatest benefit of both products is that the products take care of all logging. There are two sets of logs used:Another popular third-party logging system is ApexSQL Audit.
These products then use SQL Server's transaction logs along with the user log – everything required come from these two sources. Log Explorer gives us the following ability:
- Lumigent creates a database and stores login information against SPIDs
- SQL Server creates Transaction Log (.trn) files during normal backups.
There is an additional benefit: that of instant notification of deadlocks or changes to objects (addition of procedures, alteration of scripts, dropping of objects, etc.). A program runs on the server, monitors for these events then emails the full details when the events occur. All we need to do to keep an exceptionally long history is to zip our transaction logs (which reduce by 80% or more). We keep over six months worth of logs this way for our entire enterprise (over 60,000 individual customers). We could keep logs for longer, but the need has never arisen.
- Observe the original source of any change to a row. If there are multiple changes we can see multiple changes. We can see what the change was, when it was made, the user and application that made the change.
- Undo or redo transactions or a specific change.
- Take a row, or a set of rows back to a specific point in time.
- View a particular application, table, user, etc.
- Recover or search through dropped tables.
- Basic load analysis
All in all we find that Log Explorer is good enough for every change made to the database. The issues that it doesn't track are
SELECTs (which Entegra apparently does), or a listing or SQL/Transaction summary data. For instance, a listing of customers that had transactions deleted before being completed. Therefore we do log some deletions in "history" tables, based on triggers, for tracking. For
SELECTs and SQL summary we log events to a hard drive, import them to a database at night where every SQL command issued to the database is summarised, tabulated and the aggregate statistics stored.
The logging of all activity sent to the database (a trace of completed SQL statement) provides a particularly useful tool. During the summarisation process I analyse the impact of procedures and applications on each other. I determine the duration distribution of one application when a second application is and is not running. I then apply some non-parametric distribution statistics to compare the results, to locate significant differences. This provides us with a list of the most "impacting" applications, which are those applications that have the greatest impact on other applications. This allows for determining applications and procedures that need optimisation, or review, and also tracks whether a particular change to a procedure increased or decreased its impact. All this is done in T-SQL, which made for a fun little project.
Reader Russell M. recommended the coverage of audit trails in Data Modeling Essentials, Third Edition (specifically Chapter 15).
While we have looked at what data would be stored in the "history" tables for various approaches, we have yet to discuss how this data gets there in the first place. That is, when a record is inserted, updated, or deleted from the
table, how is a corresponding record added to the "history" table? There are a couple of options here, as well. Some
techniques are discussed in Maintaining a Log of Database Changes - Part 2.