To read the article online, visit http://www.4GuysFromRolla.com/webtech/103100-1.shtml

Picking Out Delimited Text with Regular Expressions


Problems with Windows Scripting Host 5.5
Some people have reported problems installing WSH 5.5, which is required for the non-greedy repetition regular expressions discussed in this article. For more information on these installation problems, consult this ASPMessageboard post.

Have you ever wanted to parse an HTML document and be able to easily grab the text between certain text delimiters? For example, imagine that we wanted to list all the text in an HTML document that falls within any bold tags (<b> ... </b>). Or say that we wanted to grab the text (if any) that was the HTML TITLE (i.e., the text was between the TITLE tags: <TITLE> ... </TITLE>). While thi can be done with standard VBScript string functions, oftentimes the needed code is messy, usually requiring multiple variables to hold various indexes where certain substrings start and end.

With regular expressions, however, this is a very easy task! This article will not delve into what, exactly, regular expression are or the specifics of using them in an ASP page. For more information on these topics be sure to read the articles recommended in the Regular Expressions Article Index and read some of the posts at the Regular Expression Forum. If you are familiar with regular expressions, you might think the challenge I propose is easily solvable with the following regular expression:

-- in general terms
delimiter(.*)delimiter

-- to find text between bold tags:
<b>(.*)</>

Well, you're kind of right. For those not familiar with the above regular expression, it is, basically, saying, "Search for the first delimiter (<b>), look for zero or more characters, and then look for the closing delimiter (</b>)." While this may seem like the right thing to ask for consider the following HTML document:

<HTML>
<BODY>
  Hello <B>there</B>!  How are <b>you</b> today?
</BODY>
</HTML>

The above regular expression (<b>(.*)</>) is a bit ambiguous in this scenario. Do you want to return <B>there</B> and <b>you</b> (as two separate strings), or <B>there</B>! How are <b>you</b> (as one lengthier string)? Realize that both of the possible results follow the English explanation given above. The strings in both results begin with the starting delimiter <b>, contain zero to more characters, and end with the closing delimiter, </b>. The first result returns two strings, while the second result returns just one. If you use the regular expression <B>there</B> it will return the second result, the longer string, <B>there</B>! How are <b>you</b>.

To tell get the two shorter strings, we've got to tell the regular expression engine that, when searching for zero or more characters between our two delimiters, return the match that has the least number of characters between the delimiters. This is done by using non-greedy repetition. The .* represents greedy repetition - it looks for zero or more characters (emphasis on more). We can specify non-greedy repetition, which will return matches that have the fewest number of characters between the delimiters, by using .*? (note the addition of the question mark).

Regular expressions became available in VBScript with the 5.0 release. Non-greedy repetition, however, wasn't supported! With the latest release of Microsoft's scripting engines (version 5.5), non-greedy repetition is supported. So, before you can use the code we will examine shortly, you must ensure that you have the VBScript Scripting Engine 5.5 or greater installed on your system. To download the latest version of the VBScript Scripting Engine visit http://msdn.microsoft.com/scripting/; to determine what server-side scripting engine version you're using, be sure to read: Determining the Server-Side Scripting Language and Version!

Let's look at a quick code example that returns the TITLE of an HTML page on the Web server. First we will open the HTML page using the FileSystemObject to read in the contents of a local Web page. Next, we will use a non-greedy repetition regular expression to pick out the text between <TITLE> and </TITLE>. (For this examply, greedy repetition would work, in theory, since there should only be one TITLE tag per Web page.) First, we'll grab the contents of an HTML file using the FileSystemObject:

'Open an HTML page and read in its contents into'
'the variable strContents
Dim objFSO
Set objFSO = Server.CreateObject("Scripting.FileSystemObject")

Dim objFile
Set objFile = objFSO.OpenTextFile(Server.MapPath("/SomePage.htm"))

Dim strContents
strContents = objFile.ReadAll

'Clean up...
objFile.Close
Set objFile = Nothing
Set objFSO = Nothing

OK, at this point we have the contents of the HTML page /SomePage.htm read into a local variable, strContents. Now we need to setup our regular expression, which we'll look at in Part 2 of this article! If you are new to regular expressions, I highly recommend that you take a moment read some of the articles suggested at our Regular Expressions Article Index before continuing onto Part 2.

Read Part 2!


Article Information
Article Title: Picking Out Delimited Text with Regular Expressions
Article Author: Scott Mitchell
Published Date: Tuesday, October 31, 2000
Article URL: http://www.4GuysFromRolla.com/webtech/103100-1.shtml


Copyright 2017 QuinStreet Inc. All Rights Reserved.
Legal Notices, Licensing, Permissions, Privacy Policy.
Advertise | Newsletters | E-mail Offers