![]() |
|
|
Published: Monday, December 04, 2000 By Richard Lowe
In Part 2 we looked at how to grab particular "chunks" of HTML from a Web page. In this part, we'll examine how to use regular expressions to parse data files!
Parsing Data Files A very plain and ordinary flat ASCII text data file might look like this:
In this file, the data is simply and atomically presented with a header (in caps) and two records with each field delimited by a comma character. Parsing is a simple matter of first splitting the file by rows (newline chars) and then dividing each record up into its fields. But what happens when you want to include a comma in the data itself:
Trying to parse the first record creates a problem because the last record will be considered to be two fields by a parser that only considers commas. In order to circumvent this problem, fields that contain the delimiter character are qualified - distinguished usually by being enclosed in quotes. A text qualified version of the above data file would look like this:
Now there is way to tell which commas should be used to split the record up and which should be left as part of a field, every comma inside the single quotes should be treated as part of the text. All that remains is to implement a regular expression parser that can tell when to split based on the comma and when not to. The challenge here is a bit different from most regular expressions. Typically, you will only be looking at a small portion of text and seeing if that matches your regular expression. But in this case, the only way to reliably tell what is inside the quotes is to consider the entire line at once. Here's an example of what I mean, take this partial line of text from a fictional data file:
Since there is data to the left of the To solve this our regular expression must examine the entire line of text and consider how many quotes appear in it to determine whether we are inside or outside of a set of quotes:
This regular expression first finds a comma, then looks to make sure there that the number of single quotes after the comma is either an even number or none at all. It works on the premise that an even number of quotes following a comma denotes that the comma is outside of a string. Here's how it breaks down:
Here is a VBScript function that accepts a string and retuns an array which is split based on using commas as delimiters and the single quote as the text qualifier:
In summary, parsing text data files with regular expressions is efficient and saves your development time, because you're spared from looping through your text to pick out complex patterns to break the file up with. In a highly transitional time where there is still plenty of legacy data floating around (data that is still very imporant to the businesses that use it), knowing how to create an efficient parsing routing is a valued skill.
In Part 4 we will conclude our examination of regular expression usage
with an examination of using regular expressions to replace strings (providing much more power than the simple
VBScript
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
![]() |
![]() |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||