Published: Tuesday, October 31, 2000
Picking Out Delimited Text with Regular Expressions
| Problems with Windows Scripting Host 5.5 |
|
Some people have reported problems installing WSH 5.5, which is required for the non-greedy repetition
regular expressions discussed in this article. For more information on these installation problems,
consult this ASPMessageboard
post.
|
Have you ever wanted to parse an HTML document and be able to easily grab the text between certain text
delimiters? For example, imagine that we wanted to list all the text in an HTML document that
falls within any bold tags (<b> ... </b>). Or say that we wanted to grab the
text (if any) that was the HTML TITLE (i.e., the text was between the TITLE tags: <TITLE> ...
</TITLE>). While thi can be done with standard VBScript string functions, oftentimes
the needed code is messy, usually requiring multiple variables to hold various indexes where certain substrings
start and end.
With regular expressions, however, this is a very easy task! This article will not delve into what, exactly,
regular expression are or the specifics of using them in an ASP page. For more information on these topics
be sure to read the articles recommended in the Regular Expressions
Article Index and read some of the posts at the Regular
Expression Forum. If you are familiar with regular expressions, you might think the challenge I propose is
easily solvable with the following regular expression:
-- in general terms delimiter(.*)delimiter
-- to find text between bold tags:
<b>(.*)</>
|
Well, you're kind of right. For those not familiar with the above regular expression, it is, basically,
saying, "Search for the first delimiter (<b>), look for zero or more characters, and then
look for the closing delimiter (</b>)." While this may seem like the right thing to ask
for consider the following HTML document:
<HTML>
<BODY>
Hello <B>there</B>! How are <b>you</b> today?
</BODY>
</HTML>
|
The above regular expression (<b>(.*)</>) is a bit ambiguous in this scenario. Do you
want to return <B>there</B> and <b>you</b> (as two separate
strings), or
<B>there</B>! How are <b>you</b> (as one lengthier string)? Realize
that both of the possible results follow the English explanation given above. The strings in both results
begin with the starting delimiter <b>, contain zero to more characters, and end with
the closing delimiter, </b>. The first result returns two strings, while the
second result returns just one. If you use the regular expression <B>there</B> it will
return the second result, the longer string, <B>there</B>! How are <b>you</b>.
To tell get the two shorter strings, we've got to tell the regular expression engine that, when searching for
zero or more characters between our two delimiters, return the match that has the least number of characters
between the delimiters. This is done by using non-greedy repetition. The .* represents
greedy repetition - it looks for zero or more characters (emphasis on more). We can specify non-greedy
repetition, which will return matches that have the fewest number of characters between the delimiters, by
using .*? (note the addition of the question mark).
Regular expressions became available in VBScript with the 5.0 release. Non-greedy repetition, however,
wasn't supported! With the latest release of Microsoft's scripting engines (version 5.5), non-greedy repetition
is supported. So, before you can use the code we will examine shortly, you must ensure that you have
the VBScript Scripting Engine 5.5 or greater installed on your system. To download the latest version of
the VBScript Scripting Engine visit http://msdn.microsoft.com/scripting/;
to determine what server-side scripting engine version you're using, be sure to read:
Determining the Server-Side Scripting Language and Version!
Let's look at a quick code example that returns the TITLE of an HTML page on the Web server. First we will
open the HTML page using the FileSystemObject to read
in the contents of a local Web page. Next, we will use a non-greedy repetition regular expression to pick out
the text between <TITLE> and </TITLE>. (For this examply, greedy
repetition would work, in theory, since there should only be one TITLE tag per Web page.)
First, we'll grab the contents of an HTML file using the FileSystemObject:
'Open an HTML page and read in its contents into'
'the variable strContents
Dim objFSO
Set objFSO = Server.CreateObject("Scripting.FileSystemObject")
Dim objFile
Set objFile = objFSO.OpenTextFile(Server.MapPath("/SomePage.htm"))
Dim strContents
strContents = objFile.ReadAll
'Clean up...
objFile.Close
Set objFile = Nothing
Set objFSO = Nothing
|
OK, at this point we have the contents of the HTML page /SomePage.htm read into a local variable,
strContents. Now we need to setup our regular expression, which we'll look at in
Part 2 of this article! If you are new to regular
expressions, I highly recommend that you take a moment read some of the articles suggested at our
Regular Expressions Article Index before continuing onto
Part 2.
Read Part 2!