To read the article online, visit http://www.4GuysFromRolla.com/webtech/120400-1.3.shtml

Common Applications of Regular Expressions, Part 3

By Richard Lowe


  • Read Part 1
  • Read Part 2

  • In Part 2 we looked at how to grab particular "chunks" of HTML from a Web page. In this part, we'll examine how to use regular expressions to parse data files!

    Parsing Data Files
    Data files come in a multitude of formats and descriptions. XML files, delimited text and even unstructured text are often the sources of the data our applications need. The example we'll look at below is a delimited text file that uses qualified strings - delimiters like quotes to indicate strings that must be kept together even if they contain the delimiter character used to split the records into individual fields.

    A very plain and ordinary flat ASCII text data file might look like this:

    LAST NAME, FIRST NAME, PHONE, QUOTE
    Lowe, Richard, 312 555 1212, ASP is good
    Huston, John, 847 555 1212, I make movies
    

    In this file, the data is simply and atomically presented with a header (in caps) and two records with each field delimited by a comma character. Parsing is a simple matter of first splitting the file by rows (newline chars) and then dividing each record up into its fields. But what happens when you want to include a comma in the data itself:

    LAST NAME, FIRST NAME, PHONE, QUOTE
    Lowe, Richard, 312 555 1212, I like ASP, VB and SQL
    Huston, John, 847 555 1212, I make movies
    

    Trying to parse the first record creates a problem because the last record will be considered to be two fields by a parser that only considers commas. In order to circumvent this problem, fields that contain the delimiter character are qualified - distinguished usually by being enclosed in quotes. A text qualified version of the above data file would look like this:

    LAST NAME, FIRST NAME, PHONE, QUOTE
    Lowe, Richard, 312 555 1212, 'I like ASP, VB and SQL'
    Huston, John, 847 555 1212, 'I make movies'
    

    Now there is way to tell which commas should be used to split the record up and which should be left as part of a field, every comma inside the single quotes should be treated as part of the text. All that remains is to implement a regular expression parser that can tell when to split based on the comma and when not to. The challenge here is a bit different from most regular expressions. Typically, you will only be looking at a small portion of text and seeing if that matches your regular expression. But in this case, the only way to reliably tell what is inside the quotes is to consider the entire line at once. Here's an example of what I mean, take this partial line of text from a fictional data file:

    1, Ford, Black, 21, ', dog, cat, duck, ',

    Since there is data to the left of the 1, the above line is really quite ambiguous, we don't know how many single quotes have come before this segment of the data, and therefore we don't know which text is the qualified text (which we should not split up in our parsing). If there are an even number (or no) single quotes before this text, then ', dog, cat, duck, ' is a qualified string and should be kept together. If there are an odd number then 1, Ford, Black, 21, ' is the end portion of a qualified string and should be kept together.

    To solve this our regular expression must examine the entire line of text and consider how many quotes appear in it to determine whether we are inside or outside of a set of quotes:

    ,(?=([^']*'[^']*')*(?![^']*'))

    This regular expression first finds a comma, then looks to make sure there that the number of single quotes after the comma is either an even number or none at all. It works on the premise that an even number of quotes following a comma denotes that the comma is outside of a string. Here's how it breaks down:

    , Find a comma
    (?= lookahead to match this pattern:
    ( start a new pattern
    [^']*'[^']* [not a quote] 0 or many times then a quote
    [^']*'[^']*) [not a quote] 0 or many times then a quote, combined with the one above it matches pairs of quotes
    )* end the pattern and match the whole pattern (pairs of quotes) zero, or multiple times
    (?! lookahead to exclude this pattern
    [^']*' [not a quote] 0 or many times then a quote
    ) end the pattern

    Here is a VBScript function that accepts a string and retuns an array which is split based on using commas as delimiters and the single quote as the text qualifier:

    Function SplitAdv(strInput)
      Dim objRE
      Set objRE = new RegExp
    
      ' Set up our RegExp object
      objRE.IgnoreCase = true
      objRE.Global = true
      objRE.Pattern = ",(?=([^']*'[^']*')*(?![^']*'))"
    
      ' .Replace replaces the comma that we will use with 
      ' chr(8), the \b character which is extremely unlikely 
      ' to appear in any string it then splits the line into 
      ' an array based on the \b
    
      SplitAdv = Split(objRE.Replace(strInput, "\b"), "\b")
    End Function
    

    In summary, parsing text data files with regular expressions is efficient and saves your development time, because you're spared from looping through your text to pick out complex patterns to break the file up with. In a highly transitional time where there is still plenty of legacy data floating around (data that is still very imporant to the businesses that use it), knowing how to create an efficient parsing routing is a valued skill.

    In Part 4 we will conclude our examination of regular expression usage with an examination of using regular expressions to replace strings (providing much more power than the simple VBScript Replace function)!

  • Read Part 4!


  • Article Information
    Article Title: Common Applications of Regular Expressions
    Article Author: Richard Lowe
    Published Date: Monday, December 04, 2000
    Article URL: http://www.4GuysFromRolla.com/webtech/120400-1.3.shtml


    Copyright 2017 QuinStreet Inc. All Rights Reserved.
    Legal Notices, Licensing, Permissions, Privacy Policy.
    Advertise | Newsletters | E-mail Offers