Converting Flat, Comma-Delimited Values Into a Normalized Data Model
By Scott Mitchell
Introduction
In my job as an independent software developer I help a lot of small businesses enhance their existing company website or internal web applications to include new features or adopt best practices. Many of these businesses have vital line of business applications that were created many years ago by an employee who was not a professional software developer, but perhaps a member of the IT team or someone who was learning how to program or programmed as a hobby. A common mistake made by people without a solid background in creating data-driven applications is using flat, non-normalized data models.
Consider an application used in a healthcare setting may need to record each doctor's professional and educational degrees. Because there are a fixed
number of degrees - PhD, MD, DDS, OB/GYN, RN, etc. - these degrees should be spelled out in a separate database table. And because each doctor can have
multiple degrees, there should be a third table that maps what doctors are associated with what degrees. Such a data model would be normalized. A non-normalized
data model would instead try to capture each doctor's degrees within the same table that contains the doctor's other information (his name, address, DOB, etc.).
This might be implemented as several columns in the table (Degree1, Degree2, Degree3, and so on) or as a single
column that contains a comma-delimited list of degrees, like "PhD, MD, OB/GYN".
While there are certain circumstances where non-normalized data is ideal, in the vast majority of situations having the data expressed in a normalized manner is ideal. Normalized data is easier to work with, is easier to report against, is (usually) more efficient in terms of both disk space and time to execute queries, and is less likely to suffer from any data integrity issues, which are all too common in non-normalized data. I recently helped a client who had a many-to-many relationship implemented in a flat, non-normalized manner convert that data into a normalized data model through the use of a T-SQL script. This article discusses why it is worhtwhile to convert flat, non-normalized data into a normalized data model and steps through how this T-SQL script can be used to normalize your data. Read on to learn more!
An Example of Non-Normalized Data
I recently worked with a client that had a legacy custom web application created by a past employee that was used internally to manage customer and employee information. The application was used primarily to store information about employees and customers and to facilitate searching for customers and employees. For example, each employee can belong to one or more departments, where departments are things like IT, Executive, Sales, and so on. From the web page for editing employees, a user could pick an employee, select zero to many departments from a multi-select list box, and save the results to the database. The search page, which was the most heavily used page on the site, allowed the visitor to select zero to many departments from a list box to filter the results.
The employee who created the application had long since moved on to another job, but the web application was stable enough that it had continued to be
used for many years without issue... until recently! A new branch office opened up on the West coast and to differentiate departments between the two offices, the
departments were replicated for the new site with the new site's name. In other words, there were now departments like "IT" and "IT (West)," and "Sales"
and "Sales (West)." The idea was that the "Sales" department was the people in sales in the original branch office, whereas the "Sales (West)" department
was the sales team in the West coast office. After these new departments were added the search page no longer worked as expected. Searching for employees
in the "Sales" department returned employees in both the "Sales" and "Sales (West)" departments. The person whose job it was to manage this application
concluded (correctly) that the problem was that "Sales" was a substring found in both department names. In an attempt to fix this she changed the
"Sales" name in the Departments table to "Sales (East)", but after doing so searching for "Sales (East)" returned no employees even though
there were dozens of employees in the previously named "Sales" department. This was when I was called in.
The problem was that the developer who created the application had used two tables to model this relationship: Departments, which contained
a record for each department; and Employees, which contained a record for each employee. There should have been a third table used to
model the relationship between employees and departments (since the relationship is many-to-many), but the developer instead used a flat, comma-delimited
column value in the Employees table named Departments. If an employee belonged to the Executive and Sales
departments their Employees.Departments would contain the value "Executive, Sales". The search page filtered employees
by using a series of LIKE statements for each selected filter criteria. If a user searched for employees in the Sales department it
would run a query with a WHERE clause like: WHERE Employees.Department LIKE '%Sales%'. As a result, such a WHERE clause would
return employees in the Sales department and employees in the "Sales (West)" department.
When the "Sales" department was renamed to "Sales (East)", this didn't rename the corresponding value in the Employees.Departments column.
In other words, if the Employees.Departments column value was "Executive, Sales" when the department was named "Sales", it remained
"Executive, Sales" after the department was renamed to "Sales (East)." But after the department was renamed the search page now used the WHERE clause
WHERE Employees.Department LIKE '%Sales (East)%', which returned no records because those employees in the "Sales (East)" department has the string
"Sales" in their Employees.Departments column (and not "Sales (East)").
The Ideal Approach - A Normalized Data Model
Ideally, a many-to-many relationship between entities A and B involve three tables:
- Table A
- Table B
- The Join Table
In case the above sounds like jibberish, let's look at how the data model should have been constructed in the application described above. First we have
the Employees table (Table A), which models employees; each employee is uniquely identified by the EmployeeID column, which is an integer
IDENTITY column. We also have the Departments table (Table B), which likewise has an IDENTITY column named
DepartmentID. The third and final table we need (which was lacking from my customer's data model and was the source of the problems noted
above) is a table named, say, EmployeeDepartments with the following schema:
| EmployeeDepartments | ||
|---|---|---|
| Column | Data Type | Notes |
EmployeeID | int | Composite primary key |
DepartmentID | int | |
This is the table that ties together what employees belong to what departments (and, conversely, the list of employees that makeup a praticular department). Had this data model been in place, none of my client's problems would have surfaced. New departments could be added at any time and the names of departments could be changed as needed. There would be no naming or substring issues because the employees' departments are specified through unique integer values. The names of each department are not construed in this relationship.
After explaining this to my client his response was predictable: "Great, now how do we get this data model in place for our existing application."
Altering the web application to use the preferred normalized data model required three significant changes:
- Adding a new table,
EmployeeDepartments, and converting the existing data in theEmployees.Departmentscolumn into this new table. - Updating the web page where employee data was saved so that instead of saving the value as a comma-delimited value in the
Employeestable, a record was added to theEmployeeDepartmentstable for each selected department. - Updating the
WHEREclause used in the search page to go against theEmployeeDepartmentstable rather than to use a series ofLIKEclauses.
This article focuses on the first task - converting the existing flat, comma-delimited data model into a normalized one.
Creating the EmployeeDepartments Table
The first order of business in converting the existing flat data model into a normalized one is to create the many-to-many Join Table. The schema for this table is shown above. To create this table use the following T-SQL script, which generates the table and adds foreign key constraints to the two "parent" tables,
Employees and Departments:
CREATE TABLE [dbo].[EmployeeDepartments](
|
Adding a Record to EmployeeDepartments For Each Department Specified in Each Employee's Departments Column
The final step in this data convertion is to enumerate through each employee in the
Employees table, parsing the departments out of
the Departments column, and inserting a record into the EmployeeDepartments table for each department for the current
employee. This requires the use of a CURSOR, which is a T-SQL construct that allows us to work with each record returned by a SELECT
statement individually.
The following T-SQL shows the shell of the CURSOR.
DECLARE @EmployeeID int, @Departments nvarchar(4000), @DeptName nvarchar(50), @DepartmentID int
|
Note that the CURSOR starts by defining a SELECT statement. In the above script the values in the EmployeeID and Departments
columns are returned for all employees in the Employees table. The CURSOR then FETCHes one record at a time,
storing the values of the EmployeeID and Departments columns in the @EmployeeID and @Departments
variables. Each record is enumerated via the WHILE loop. For more background on CURSORs be sure to read
Cursors: An Overview.
With each employee record we need to parse the comma-delimited list of departments, which is stored in the @Departments variable.
A previous article here on 4Guys, Creating a User Defined Function in SQL for Comma-Delimited Searches,
shows how to parse a comma-delimited list and work with each value in the list as an individual unit using various T-SQL string processing functions.
I replicated the essence of the code in that article to use for this client. In short, I get each department name from the comma-delimited list and
use it to grab the corresponding DepartmentID value in the Departments table. If a corresponding DepartmentID value
is found I add a record to the EmployeeDepartments table using the @EmployeeID and the just-retrieved DepartmentID values.
...
|
Executing the Above Script
Let's run the script on a sample database to show its effectiveness. (Of course, before running any script on a production database make sure you make a backup beforehand!) The
Employees table contains six employees, five of which belong to a department. As you can see in the
screen shot below, Sam does not belong to any departments. Scott and Ernie belong to one department each, while Alice and Jisun belong to two, and
Tito belongs to three.
The Departments table defines five unique departments.
Running the above T-SQL script generates the following output in the Messages window:
Starting parsing for employee 1As you can see, employee 1 (Scott) and employee 6 (Ernie) each have one record added to the
Adding department IT
Starting parsing for employee 2
Adding department Executive
Adding department IT
Starting parsing for employee 3
Adding department Sales
Adding department Marketing
Starting parsing for employee 4
No departments for the employee!
Starting parsing for employee 5
Adding department Executive
Adding department Sales
Adding department Marketing
Starting parsing for employee 6
Adding department Human Resources
EmployeeDepartments table, whereas
employee 4 (Sam) has none added. Employee 5 (Tito) has three departments added, and so on.
The EmployeeDepartments table contains the appropriate records as evidenced by the output above and the screen shot below.
An Alternative, CURSOR-less Approach | |
|---|---|
Alert reader Leonid K. wrote in to share another technique for converting flat, comma-delimited values into a normalized data model
without the need of a CURSOR. The following INSERT statement does the trick:
One issue to keep in mind with Leonid's approach is that it is imperative that the |
Conclusion
Ideally all applications will use properly normalized data models. However, applications designed by junior level programmers or those who are not professional developers often include flat, non-normalized data models. With a bit of elbow grease it's possible to create a SQL script that will automatically convert the flat data model into a normalized one. This article showed a script for converting a many-to-many relationship modeled by a comma-delimited list into a three-table, normalized data model.
Happy Programming!
Attachments:
Further Reading




