Converting Flat, Comma-Delimited Values Into a Normalized Data ModelBy Scott Mitchell
In my job as an independent software developer I help a lot of small businesses enhance their existing company website or internal web applications to include new features or adopt best practices. Many of these businesses have vital line of business applications that were created many years ago by an employee who was not a professional software developer, but perhaps a member of the IT team or someone who was learning how to program or programmed as a hobby. A common mistake made by people without a solid background in creating data-driven applications is using flat, non-normalized data models.
Consider an application used in a healthcare setting may need to record each doctor's professional and educational degrees. Because there are a fixed
number of degrees - PhD, MD, DDS, OB/GYN, RN, etc. - these degrees should be spelled out in a separate database table. And because each doctor can have
multiple degrees, there should be a third table that maps what doctors are associated with what degrees. Such a data model would be normalized. A non-normalized
data model would instead try to capture each doctor's degrees within the same table that contains the doctor's other information (his name, address, DOB, etc.).
This might be implemented as several columns in the table (
Degree3, and so on) or as a single
column that contains a comma-delimited list of degrees, like "PhD, MD, OB/GYN".
While there are certain circumstances where non-normalized data is ideal, in the vast majority of situations having the data expressed in a normalized manner is ideal. Normalized data is easier to work with, is easier to report against, is (usually) more efficient in terms of both disk space and time to execute queries, and is less likely to suffer from any data integrity issues, which are all too common in non-normalized data. I recently helped a client who had a many-to-many relationship implemented in a flat, non-normalized manner convert that data into a normalized data model through the use of a T-SQL script. This article discusses why it is worhtwhile to convert flat, non-normalized data into a normalized data model and steps through how this T-SQL script can be used to normalize your data. Read on to learn more!
An Example of Non-Normalized Data
I recently worked with a client that had a legacy custom web application created by a past employee that was used internally to manage customer and employee information. The application was used primarily to store information about employees and customers and to facilitate searching for customers and employees. For example, each employee can belong to one or more departments, where departments are things like IT, Executive, Sales, and so on. From the web page for editing employees, a user could pick an employee, select zero to many departments from a multi-select list box, and save the results to the database. The search page, which was the most heavily used page on the site, allowed the visitor to select zero to many departments from a list box to filter the results.
The employee who created the application had long since moved on to another job, but the web application was stable enough that it had continued to be
used for many years without issue... until recently! A new branch office opened up on the West coast and to differentiate departments between the two offices, the
departments were replicated for the new site with the new site's name. In other words, there were now departments like "IT" and "IT (West)," and "Sales"
and "Sales (West)." The idea was that the "Sales" department was the people in sales in the original branch office, whereas the "Sales (West)" department
was the sales team in the West coast office. After these new departments were added the search page no longer worked as expected. Searching for employees
in the "Sales" department returned employees in both the "Sales" and "Sales (West)" departments. The person whose job it was to manage this application
concluded (correctly) that the problem was that "Sales" was a substring found in both department names. In an attempt to fix this she changed the
"Sales" name in the
Departments table to "Sales (East)", but after doing so searching for "Sales (East)" returned no employees even though
there were dozens of employees in the previously named "Sales" department. This was when I was called in.
The problem was that the developer who created the application had used two tables to model this relationship:
Departments, which contained
a record for each department; and
Employees, which contained a record for each employee. There should have been a third table used to
model the relationship between employees and departments (since the relationship is many-to-many), but the developer instead used a flat, comma-delimited
column value in the
Employees table named
Departments. If an employee belonged to the Executive and Sales
Employees.Departments would contain the value "Executive, Sales". The search page filtered employees
by using a series of
LIKE statements for each selected filter criteria. If a user searched for employees in the Sales department it
would run a query with a
WHERE clause like:
WHERE Employees.Department LIKE '%Sales%'. As a result, such a
WHERE clause would
return employees in the Sales department and employees in the "Sales (West)" department.
When the "Sales" department was renamed to "Sales (East)", this didn't rename the corresponding value in the
In other words, if the
Employees.Departments column value was "Executive, Sales" when the department was named "Sales", it remained
"Executive, Sales" after the department was renamed to "Sales (East)." But after the department was renamed the search page now used the
WHERE Employees.Department LIKE '%Sales (East)%', which returned no records because those employees in the "Sales (East)" department has the string
"Sales" in their
Employees.Departments column (and not "Sales (East)").
The Ideal Approach - A Normalized Data Model
Ideally, a many-to-many relationship between entities A and B involve three tables:
- Table A
- Table B
- The Join Table
In case the above sounds like jibberish, let's look at how the data model should have been constructed in the application described above. First we have
Employees table (Table A), which models employees; each employee is uniquely identified by the
EmployeeID column, which is an integer
IDENTITY column. We also have the
Departments table (Table B), which likewise has an
IDENTITY column named
DepartmentID. The third and final table we need (which was lacking from my customer's data model and was the source of the problems noted
above) is a table named, say,
EmployeeDepartments with the following schema:
|Composite primary key|
This is the table that ties together what employees belong to what departments (and, conversely, the list of employees that makeup a praticular department). Had this data model been in place, none of my client's problems would have surfaced. New departments could be added at any time and the names of departments could be changed as needed. There would be no naming or substring issues because the employees' departments are specified through unique integer values. The names of each department are not construed in this relationship.
After explaining this to my client his response was predictable: "Great, now how do we get this data model in place for our existing application."
Altering the web application to use the preferred normalized data model required three significant changes:
- Adding a new table,
EmployeeDepartments, and converting the existing data in the
Employees.Departmentscolumn into this new table.
- Updating the web page where employee data was saved so that instead of saving the value as a comma-delimited value in the
Employeestable, a record was added to the
EmployeeDepartmentstable for each selected department.
- Updating the
WHEREclause used in the search page to go against the
EmployeeDepartmentstable rather than to use a series of
This article focuses on the first task - converting the existing flat, comma-delimited data model into a normalized one.
The first order of business in converting the existing flat data model into a normalized one is to create the many-to-many Join Table. The schema for this table is shown above. To create this table use the following T-SQL script, which generates the table and adds foreign key constraints to the two "parent" tables,
Adding a Record to
EmployeeDepartments For Each Department Specified in Each Employee's
The final step in this data convertion is to enumerate through each employee in the
Employeestable, parsing the departments out of the
Departmentscolumn, and inserting a record into the
EmployeeDepartmentstable for each department for the current employee. This requires the use of a
CURSOR, which is a T-SQL construct that allows us to work with each record returned by a
The following T-SQL shows the shell of the
Note that the
CURSOR starts by defining a
SELECT statement. In the above script the values in the
columns are returned for all employees in the
Employees table. The
FETCHes one record at a time,
storing the values of the
Departments columns in the
variables. Each record is enumerated via the
WHILE loop. For more background on
CURSORs be sure to read
Cursors: An Overview.
With each employee record we need to parse the comma-delimited list of departments, which is stored in the
A previous article here on 4Guys, Creating a User Defined Function in SQL for Comma-Delimited Searches,
shows how to parse a comma-delimited list and work with each value in the list as an individual unit using various T-SQL string processing functions.
I replicated the essence of the code in that article to use for this client. In short, I get each department name from the comma-delimited list and
use it to grab the corresponding
DepartmentID value in the
Departments table. If a corresponding
is found I add a record to the
EmployeeDepartments table using the
@EmployeeID and the just-retrieved
Executing the Above Script
Let's run the script on a sample database to show its effectiveness. (Of course, before running any script on a production database make sure you make a backup beforehand!) The
Employeestable contains six employees, five of which belong to a department. As you can see in the screen shot below, Sam does not belong to any departments. Scott and Ernie belong to one department each, while Alice and Jisun belong to two, and Tito belongs to three.
Departments table defines five unique departments.
Running the above T-SQL script generates the following output in the Messages window:
Starting parsing for employee 1As you can see, employee 1 (Scott) and employee 6 (Ernie) each have one record added to the
Adding department IT
Starting parsing for employee 2
Adding department Executive
Adding department IT
Starting parsing for employee 3
Adding department Sales
Adding department Marketing
Starting parsing for employee 4
No departments for the employee!
Starting parsing for employee 5
Adding department Executive
Adding department Sales
Adding department Marketing
Starting parsing for employee 6
Adding department Human Resources
EmployeeDepartmentstable, whereas employee 4 (Sam) has none added. Employee 5 (Tito) has three departments added, and so on.
EmployeeDepartments table contains the appropriate records as evidenced by the output above and the screen shot below.
|An Alternative, |
Alert reader Leonid K. wrote in to share another technique for converting flat, comma-delimited values into a normalized data model
without the need of a |
One issue to keep in mind with Leonid's approach is that it is imperative that the
Ideally all applications will use properly normalized data models. However, applications designed by junior level programmers or those who are not professional developers often include flat, non-normalized data models. With a bit of elbow grease it's possible to create a SQL script that will automatically convert the flat data model into a normalized one. This article showed a script for converting a many-to-many relationship modeled by a comma-delimited list into a three-table, normalized data model.