To read the article online, visit http://www.4GuysFromRolla.com/articles/110508-1.aspx

Converting Flat, Comma-Delimited Values Into a Normalized Data Model

By Scott Mitchell


Introduction


In my job as an independent software developer I help a lot of small businesses enhance their existing company website or internal web applications to include new features or adopt best practices. Many of these businesses have vital line of business applications that were created many years ago by an employee who was not a professional software developer, but perhaps a member of the IT team or someone who was learning how to program or programmed as a hobby. A common mistake made by people without a solid background in creating data-driven applications is using flat, non-normalized data models.

Consider an application used in a healthcare setting may need to record each doctor's professional and educational degrees. Because there are a fixed number of degrees - PhD, MD, DDS, OB/GYN, RN, etc. - these degrees should be spelled out in a separate database table. And because each doctor can have multiple degrees, there should be a third table that maps what doctors are associated with what degrees. Such a data model would be normalized. A non-normalized data model would instead try to capture each doctor's degrees within the same table that contains the doctor's other information (his name, address, DOB, etc.). This might be implemented as several columns in the table (Degree1, Degree2, Degree3, and so on) or as a single column that contains a comma-delimited list of degrees, like "PhD, MD, OB/GYN".

While there are certain circumstances where non-normalized data is ideal, in the vast majority of situations having the data expressed in a normalized manner is ideal. Normalized data is easier to work with, is easier to report against, is (usually) more efficient in terms of both disk space and time to execute queries, and is less likely to suffer from any data integrity issues, which are all too common in non-normalized data. I recently helped a client who had a many-to-many relationship implemented in a flat, non-normalized manner convert that data into a normalized data model through the use of a T-SQL script. This article discusses why it is worhtwhile to convert flat, non-normalized data into a normalized data model and steps through how this T-SQL script can be used to normalize your data. Read on to learn more!

An Example of Non-Normalized Data


I recently worked with a client that had a legacy custom web application created by a past employee that was used internally to manage customer and employee information. The application was used primarily to store information about employees and customers and to facilitate searching for customers and employees. For example, each employee can belong to one or more departments, where departments are things like IT, Executive, Sales, and so on. From the web page for editing employees, a user could pick an employee, select zero to many departments from a multi-select list box, and save the results to the database. The search page, which was the most heavily used page on the site, allowed the visitor to select zero to many departments from a list box to filter the results.

The employee who created the application had long since moved on to another job, but the web application was stable enough that it had continued to be used for many years without issue... until recently! A new branch office opened up on the West coast and to differentiate departments between the two offices, the departments were replicated for the new site with the new site's name. In other words, there were now departments like "IT" and "IT (West)," and "Sales" and "Sales (West)." The idea was that the "Sales" department was the people in sales in the original branch office, whereas the "Sales (West)" department was the sales team in the West coast office. After these new departments were added the search page no longer worked as expected. Searching for employees in the "Sales" department returned employees in both the "Sales" and "Sales (West)" departments. The person whose job it was to manage this application concluded (correctly) that the problem was that "Sales" was a substring found in both department names. In an attempt to fix this she changed the "Sales" name in the Departments table to "Sales (East)", but after doing so searching for "Sales (East)" returned no employees even though there were dozens of employees in the previously named "Sales" department. This was when I was called in.

The problem was that the developer who created the application had used two tables to model this relationship: Departments, which contained a record for each department; and Employees, which contained a record for each employee. There should have been a third table used to model the relationship between employees and departments (since the relationship is many-to-many), but the developer instead used a flat, comma-delimited column value in the Employees table named Departments. If an employee belonged to the Executive and Sales departments their Employees.Departments would contain the value "Executive, Sales". The search page filtered employees by using a series of LIKE statements for each selected filter criteria. If a user searched for employees in the Sales department it would run a query with a WHERE clause like: WHERE Employees.Department LIKE '%Sales%'. As a result, such a WHERE clause would return employees in the Sales department and employees in the "Sales (West)" department.

When the "Sales" department was renamed to "Sales (East)", this didn't rename the corresponding value in the Employees.Departments column. In other words, if the Employees.Departments column value was "Executive, Sales" when the department was named "Sales", it remained "Executive, Sales" after the department was renamed to "Sales (East)." But after the department was renamed the search page now used the WHERE clause WHERE Employees.Department LIKE '%Sales (East)%', which returned no records because those employees in the "Sales (East)" department has the string "Sales" in their Employees.Departments column (and not "Sales (East)").

The Ideal Approach - A Normalized Data Model


Ideally, a many-to-many relationship between entities A and B involve three tables:
  • Table A
  • Table B
  • The Join Table
The Join Table is what maps elements in A to elements and B and vice-a-versa. It is typically comprised of the primary key(s) of Table A and Table B, and those joint primary keys in the Join Table form a composite primary key.

In case the above sounds like jibberish, let's look at how the data model should have been constructed in the application described above. First we have the Employees table (Table A), which models employees; each employee is uniquely identified by the EmployeeID column, which is an integer IDENTITY column. We also have the Departments table (Table B), which likewise has an IDENTITY column named DepartmentID. The third and final table we need (which was lacking from my customer's data model and was the source of the problems noted above) is a table named, say, EmployeeDepartments with the following schema:

EmployeeDepartments
ColumnData TypeNotes
EmployeeIDintComposite primary key
DepartmentIDint

This is the table that ties together what employees belong to what departments (and, conversely, the list of employees that makeup a praticular department). Had this data model been in place, none of my client's problems would have surfaced. New departments could be added at any time and the names of departments could be changed as needed. There would be no naming or substring issues because the employees' departments are specified through unique integer values. The names of each department are not construed in this relationship.

After explaining this to my client his response was predictable: "Great, now how do we get this data model in place for our existing application."

Altering the web application to use the preferred normalized data model required three significant changes:

  • Adding a new table, EmployeeDepartments, and converting the existing data in the Employees.Departments column into this new table.
  • Updating the web page where employee data was saved so that instead of saving the value as a comma-delimited value in the Employees table, a record was added to the EmployeeDepartments table for each selected department.
  • Updating the WHERE clause used in the search page to go against the EmployeeDepartments table rather than to use a series of LIKE clauses.

This article focuses on the first task - converting the existing flat, comma-delimited data model into a normalized one.

Creating the EmployeeDepartments Table


The first order of business in converting the existing flat data model into a normalized one is to create the many-to-many Join Table. The schema for this table is shown above. To create this table use the following T-SQL script, which generates the table and adds foreign key constraints to the two "parent" tables, Employees and Departments:

CREATE TABLE [dbo].[EmployeeDepartments](
   [EmployeeID] [int] NOT NULL,
   [DepartmentID] [int] NOT NULL,

   CONSTRAINT [PK_EmployeeDepartments] PRIMARY KEY CLUSTERED
   (
      [EmployeeID] ASC,
      [DepartmentID] ASC
   )WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]
GO

ALTER TABLE [dbo].[EmployeeDepartments] WITH CHECK ADD CONSTRAINT [FK_EmployeeDepartments_Departments] FOREIGN KEY([DepartmentID])
REFERENCES [dbo].[Departments] ([DepartmentID])
GO

ALTER TABLE [dbo].[EmployeeDepartments] WITH CHECK ADD CONSTRAINT [FK_EmployeeDepartments_Employees] FOREIGN KEY([EmployeeID])
REFERENCES [dbo].[Employees] ([EmployeeID])
GO

Adding a Record to EmployeeDepartments For Each Department Specified in Each Employee's Departments Column


The final step in this data convertion is to enumerate through each employee in the Employees table, parsing the departments out of the Departments column, and inserting a record into the EmployeeDepartments table for each department for the current employee. This requires the use of a CURSOR, which is a T-SQL construct that allows us to work with each record returned by a SELECT statement individually.

The following T-SQL shows the shell of the CURSOR.

DECLARE @EmployeeID int, @Departments nvarchar(4000), @DeptName nvarchar(50), @DepartmentID int

DECLARE empCursor CURSOR LOCAL FAST_FORWARD FOR
SELECT EmployeeID, Departments
FROM Employees

OPEN empCursor
FETCH NEXT FROM empCursor INTO @EmployeeID, @Departments

WHILE @@FETCH_STATUS = 0
BEGIN
   PRINT 'Starting parsing for employee ' + CAST(@EmployeeID as nvarchar(10))

   ...

   FETCH NEXT FROM empCursor INTO @EmployeeID, @Departments
END

CLOSE empCursor
DEALLOCATE empCursor

Note that the CURSOR starts by defining a SELECT statement. In the above script the values in the EmployeeID and Departments columns are returned for all employees in the Employees table. The CURSOR then FETCHes one record at a time, storing the values of the EmployeeID and Departments columns in the @EmployeeID and @Departments variables. Each record is enumerated via the WHILE loop. For more background on CURSORs be sure to read Cursors: An Overview.

With each employee record we need to parse the comma-delimited list of departments, which is stored in the @Departments variable. A previous article here on 4Guys, Creating a User Defined Function in SQL for Comma-Delimited Searches, shows how to parse a comma-delimited list and work with each value in the list as an individual unit using various T-SQL string processing functions. I replicated the essence of the code in that article to use for this client. In short, I get each department name from the comma-delimited list and use it to grab the corresponding DepartmentID value in the Departments table. If a corresponding DepartmentID value is found I add a record to the EmployeeDepartments table using the @EmployeeID and the just-retrieved DepartmentID values.

...

WHILE @@FETCH_STATUS = 0
BEGIN
   PRINT 'Starting parsing for employee ' + CAST(@EmployeeID as nvarchar(10))

   IF @Department IS NOT NULL AND LEN(@Department) > 0
   BEGIN
      -- Add a comma to the end of @Department (for parsing reasons); this tip suggested by reader Wissam Bishouty
      SET @Department = @Department + ','

      -- Pick apart the @Department string
      WHILE CHARINDEX(',', @Department) > 0
      BEGIN
         SET @DeptName = LTRIM(RTRIM(SUBSTRING(@Department, 1, CHARINDEX(',', @Department) - 1)))

         -- Does @DeptName exist in Departments?
         SET @DepartmentID = (SELECT DepartmentID FROM Departments WHERE DepartmentName = @DeptName)
         IF @DepartmentID IS NULL
            PRINT ' !!! Employee ' + CAST(@EmployeeID as nvarchar(10)) + ' has a department value not found in Departments: ' + @DeptName
         ELSE
         BEGIN
            PRINT ' Adding department ' + @DeptName

            -- Add a record to EmployeeDepartments (if it does not already exist)
            IF NOT EXISTS(SELECT 1 FROM EmployeeDepartments WHERE DepartmentID = @DepartmentID AND EmployeeID = @EmployeeID)
               INSERT INTO EmployeeDepartments(DepartmentID, EmployeeID)
               VALUES(@DepartmentID, @EmployeeID)
            ELSE
               PRINT ' Record already found in EmployeeDepartments'
         END

         SET @Department = SUBSTRING(@Department, CHARINDEX(',', @Department) + 1, LEN(@Department))
      END
   END
   ELSE
      PRINT ' No departments for the employee!'

   FETCH NEXT FROM empCursor INTO @EmployeeID, @Departments
END

...

Executing the Above Script


Let's run the script on a sample database to show its effectiveness. (Of course, before running any script on a production database make sure you make a backup beforehand!) The Employees table contains six employees, five of which belong to a department. As you can see in the screen shot below, Sam does not belong to any departments. Scott and Ernie belong to one department each, while Alice and Jisun belong to two, and Tito belongs to three.

The Employees.Departments column defines the employee's departments in a flat, comma-delimited manner.

The Departments table defines five unique departments.

There are five departments in the database.

Running the above T-SQL script generates the following output in the Messages window:

Starting parsing for employee 1
     Adding department IT

Starting parsing for employee 2
     Adding department Executive
     Adding department IT

Starting parsing for employee 3
     Adding department Sales
     Adding department Marketing

Starting parsing for employee 4
     No departments for the employee!

Starting parsing for employee 5
     Adding department Executive
     Adding department Sales
     Adding department Marketing

Starting parsing for employee 6
     Adding department Human Resources
As you can see, employee 1 (Scott) and employee 6 (Ernie) each have one record added to the EmployeeDepartments table, whereas employee 4 (Sam) has none added. Employee 5 (Tito) has three departments added, and so on.

The EmployeeDepartments table contains the appropriate records as evidenced by the output above and the screen shot below.

The EmployeeDepartments models the many-to-many relationship between employees and departments.

An Alternative, CURSOR-less Approach
Alert reader Leonid K. wrote in to share another technique for converting flat, comma-delimited values into a normalized data model without the need of a CURSOR. The following INSERT statement does the trick:

INSERT EmployeeDepartments(EmployeeID, DepartmentID)
SELECT e.EmployeeID, d.DepartmentID
FROM Employees e
   INNER JOIN Departments d ON
      ',' + e.Departments + ',' LIKE '%,' + d.DepartmentName + ',%'

One issue to keep in mind with Leonid's approach is that it is imperative that the Employees.Departments column have no spaces before or after the department names. In other words, the above approach will work as expected for Employees.Departments values like "IT,Human Resources", but not for values like "IT, Human Resources" (note the space after the comma and before the "H" in "Human Resources").

Conclusion


Ideally all applications will use properly normalized data models. However, applications designed by junior level programmers or those who are not professional developers often include flat, non-normalized data models. With a bit of elbow grease it's possible to create a SQL script that will automatically convert the flat data model into a normalized one. This article showed a script for converting a many-to-many relationship modeled by a comma-delimited list into a three-table, normalized data model.

Happy Programming!

  • By Scott Mitchell


    Attachments:


  • Download the T-SQL Script Used in this article
  • Further Reading


  • Creating a User Defined Function in SQL for Comma-Delimited Searches
  • Returning a Comma-Delimited List of Related Records
  • Cursors: An Overview
  • Article Information
    Article Title: ASP.NET.Converting Flat, Comma-Delimited Values Into a Normalized Data Model
    Article Author: Scott Mitchell
    Published Date: November 5, 2008
    Article URL: http://www.4GuysFromRolla.com/articles/110508-1.aspx


    Copyright 2017 QuinStreet Inc. All Rights Reserved.
    Legal Notices, Licensing, Permissions, Privacy Policy.
    Advertise | Newsletters | E-mail Offers