When you think ASP, think...
Recent Articles
All Articles
ASP.NET Articles [1.x] [2.0]
ASPFAQs.com
Message Board
Related Web Technologies
User Tips!
Coding Tips
Search

Sections:
Book Reviews
Sample Chapters
Commonly Asked Message Board Questions
Headlines from ASPWire.com
JavaScript Tutorials
MSDN Communities Hub
Official Docs
Security
Stump the SQL Guru!
Web Hosts
XML Info
Information:
Advertise
Feedback
Author an Article
Technology Jobs

















internet.com
IT
Developer
Internet News
Small Business
Personal Technology
International

Search internet.com
Advertise
Corporate Info
Newsletters
Tech Jobs
E-mail Offers
ASP ASP.NET ASP FAQs Message Board Feedback ASP Jobs
Print this page.

Windows Systems Administrator
Jupitermedia
US-CT-Darien

Justtechjobs.com Post A Job | Post A Resume

Published: Monday, December 04, 2000

Common Applications of Regular Expressions, Part 3
By Richard Lowe


  • Read Part 1
  • Read Part 2

  • In Part 2 we looked at how to grab particular "chunks" of HTML from a Web page. In this part, we'll examine how to use regular expressions to parse data files!

    - continued -

    Parsing Data Files
    Data files come in a multitude of formats and descriptions. XML files, delimited text and even unstructured text are often the sources of the data our applications need. The example we'll look at below is a delimited text file that uses qualified strings - delimiters like quotes to indicate strings that must be kept together even if they contain the delimiter character used to split the records into individual fields.

    A very plain and ordinary flat ASCII text data file might look like this:

    LAST NAME, FIRST NAME, PHONE, QUOTE
    Lowe, Richard, 312 555 1212, ASP is good
    Huston, John, 847 555 1212, I make movies
    

    In this file, the data is simply and atomically presented with a header (in caps) and two records with each field delimited by a comma character. Parsing is a simple matter of first splitting the file by rows (newline chars) and then dividing each record up into its fields. But what happens when you want to include a comma in the data itself:

    LAST NAME, FIRST NAME, PHONE, QUOTE
    Lowe, Richard, 312 555 1212, I like ASP, VB and SQL
    Huston, John, 847 555 1212, I make movies
    

    Trying to parse the first record creates a problem because the last record will be considered to be two fields by a parser that only considers commas. In order to circumvent this problem, fields that contain the delimiter character are qualified - distinguished usually by being enclosed in quotes. A text qualified version of the above data file would look like this:

    LAST NAME, FIRST NAME, PHONE, QUOTE
    Lowe, Richard, 312 555 1212, 'I like ASP, VB and SQL'
    Huston, John, 847 555 1212, 'I make movies'
    

    Now there is way to tell which commas should be used to split the record up and which should be left as part of a field, every comma inside the single quotes should be treated as part of the text. All that remains is to implement a regular expression parser that can tell when to split based on the comma and when not to. The challenge here is a bit different from most regular expressions. Typically, you will only be looking at a small portion of text and seeing if that matches your regular expression. But in this case, the only way to reliably tell what is inside the quotes is to consider the entire line at once. Here's an example of what I mean, take this partial line of text from a fictional data file:

    1, Ford, Black, 21, ', dog, cat, duck, ',

    Since there is data to the left of the 1, the above line is really quite ambiguous, we don't know how many single quotes have come before this segment of the data, and therefore we don't know which text is the qualified text (which we should not split up in our parsing). If there are an even number (or no) single quotes before this text, then ', dog, cat, duck, ' is a qualified string and should be kept together. If there are an odd number then 1, Ford, Black, 21, ' is the end portion of a qualified string and should be kept together.

    To solve this our regular expression must examine the entire line of text and consider how many quotes appear in it to determine whether we are inside or outside of a set of quotes:

    ,(?=([^']*'[^']*')*(?![^']*'))

    This regular expression first finds a comma, then looks to make sure there that the number of single quotes after the comma is either an even number or none at all. It works on the premise that an even number of quotes following a comma denotes that the comma is outside of a string. Here's how it breaks down:

    , Find a comma
    (?= lookahead to match this pattern:
    ( start a new pattern
    [^']*'[^']* [not a quote] 0 or many times then a quote
    [^']*'[^']*) [not a quote] 0 or many times then a quote, combined with the one above it matches pairs of quotes
    )* end the pattern and match the whole pattern (pairs of quotes) zero, or multiple times
    (?! lookahead to exclude this pattern
    [^']*' [not a quote] 0 or many times then a quote
    ) end the pattern

    Here is a VBScript function that accepts a string and retuns an array which is split based on using commas as delimiters and the single quote as the text qualifier:

    Function SplitAdv(strInput)
      Dim objRE
      Set objRE = new RegExp
    
      ' Set up our RegExp object
      objRE.IgnoreCase = true
      objRE.Global = true
      objRE.Pattern = ",(?=([^']*'[^']*')*(?![^']*'))"
    
      ' .Replace replaces the comma that we will use with 
      ' chr(8), the \b character which is extremely unlikely 
      ' to appear in any string it then splits the line into 
      ' an array based on the \b
    
      SplitAdv = Split(objRE.Replace(strInput, "\b"), "\b")
    End Function
    

    In summary, parsing text data files with regular expressions is efficient and saves your development time, because you're spared from looping through your text to pick out complex patterns to break the file up with. In a highly transitional time where there is still plenty of legacy data floating around (data that is still very imporant to the businesses that use it), knowing how to create an efficient parsing routing is a valued skill.

    In Part 4 we will conclude our examination of regular expression usage with an examination of using regular expressions to replace strings (providing much more power than the simple VBScript Replace function)!

  • Read Part 4!


    Windows Internet Technology | ASP.NET [1.x] [2.0] | ASPMessageboard.com | ASPFAQs.com | Advertise | Feedback | Author an Article



  • JupiterOnlineMedia

    internet.comearthweb.comDevx.commediabistro.comGraphics.com

    Search:

    Jupitermedia Corporation has two divisions: Jupiterimages and JupiterOnlineMedia

    Jupitermedia Corporate Info


    Legal Notices, Licensing, Reprints, & Permissions, Privacy Policy.

    Advertise | Newsletters | Tech Jobs | Shopping | E-mail Offers

    Solutions
    Whitepapers and eBooks
    Microsoft Article: Will Hyper-V Make VMware This Decade's Netscape?
    Microsoft Article: 7.0, Microsoft's Lucky Version?
    Microsoft Article: Hyper-V--The Killer Feature in Windows Server 2008
    Avaya Article: How to Feed Data into the Avaya Event Processor
    Microsoft Article: Install What You Need with Windows Server 2008
    HP eBook: Putting the Green into IT
    Whitepaper: HP Integrated Citrix XenServer for HP ProLiant Servers
    Intel Go Parallel Portal: Interview with C++ Guru Herb Sutter, Part 1
    Intel Go Parallel Portal: Interview with C++ Guru Herb Sutter, Part 2--The Future of Concurrency
    Avaya Article: Setting Up a SIP A/S Development Environment
    IBM Article: How Cool Is Your Data Center?
    Microsoft Article: Managing Virtual Machines with Microsoft System Center
    HP eBook: Storage Networking , Part 1
    Microsoft Article: Solving Data Center Complexity with Microsoft System Center Configuration Manager 2007
    MORE WHITEPAPERS, EBOOKS, AND ARTICLES
    Webcasts
    Intel Video: Are Multi-core Processors Here to Stay?
    On-Demand Webcast: Five Virtualization Trends to Watch
    HP Video: Page Cost Calculator
    Intel Video: APIs for Parallel Programming
    HP Webcast: Storage Is Changing Fast - Be Ready or Be Left Behind
    Microsoft Silverlight Video: Creating Fading Controls with Expression Design and Expression Blend 2
    MORE WEBCASTS, PODCASTS, AND VIDEOS
    Downloads and eKits
    Sun Download: Solaris 8 Migration Assistant
    Sybase Download: SQL Anywhere Developer Edition
    Red Gate Download: SQL Backup Pro and free DBA Best Practices eBook
    Red Gate Download: SQL Compare Pro 6
    Iron Speed Designer Application Generator
    MORE DOWNLOADS, EKITS, AND FREE TRIALS
    Tutorials and Demos
    How-to-Article: Preparing for Hyper-Threading Technology and Dual Core Technology
    eTouch PDF: Conquering the Tyranny of E-Mail and Word Processors
    IBM Article: Collaborating in the High-Performance Workplace
    HP Demo: StorageWorks EVA4400
    Intel Featured Algorhythm: Intel Threading Building Blocks--The Pipeline Class
    Microsoft How-to Article: Get Going with Silverlight and Windows Live
    MORE TUTORIALS, DEMOS AND STEP-BY-STEP GUIDES