| Problems with Windows Scripting Host 5.5 |
|---|
| Some people have reported problems installing WSH 5.5, which is required for the non-greedy repetition regular expressions discussed in this article. For more information on these installation problems, consult this ASPMessageboard post. |
Have you ever wanted to parse an HTML document and be able to easily grab the text between certain text
delimiters? For example, imagine that we wanted to list all the text in an HTML document that
falls within any bold tags (<b> ... </b>). Or say that we wanted to grab the
text (if any) that was the HTML TITLE (i.e., the text was between the TITLE tags: <TITLE> ...
</TITLE>). While thi can be done with standard VBScript string functions, oftentimes
the needed code is messy, usually requiring multiple variables to hold various indexes where certain substrings
start and end.
With regular expressions, however, this is a very easy task! This article will not delve into what, exactly, regular expression are or the specifics of using them in an ASP page. For more information on these topics be sure to read the articles recommended in the Regular Expressions Article Index and read some of the posts at the Regular Expression Forum. If you are familiar with regular expressions, you might think the challenge I propose is easily solvable with the following regular expression:
-- in general terms
|
Well, you're kind of right. For those not familiar with the above regular expression, it is, basically,
saying, "Search for the first delimiter (<b>), look for zero or more characters, and then
look for the closing delimiter (</b>)." While this may seem like the right thing to ask
for consider the following HTML document:
|
The above regular expression (<b>(.*)</>) is a bit ambiguous in this scenario. Do you
want to return <B>there</B> and <b>you</b> (as two separate
strings), or
<B>there</B>! How are <b>you</b> (as one lengthier string)? Realize
that both of the possible results follow the English explanation given above. The strings in both results
begin with the starting delimiter <b>, contain zero to more characters, and end with
the closing delimiter, </b>. The first result returns two strings, while the
second result returns just one. If you use the regular expression <B>there</B> it will
return the second result, the longer string, <B>there</B>! How are <b>you</b>.
To tell get the two shorter strings, we've got to tell the regular expression engine that, when searching for
zero or more characters between our two delimiters, return the match that has the least number of characters
between the delimiters. This is done by using non-greedy repetition. The .* represents
greedy repetition - it looks for zero or more characters (emphasis on more). We can specify non-greedy
repetition, which will return matches that have the fewest number of characters between the delimiters, by
using .*? (note the addition of the question mark).
Regular expressions became available in VBScript with the 5.0 release. Non-greedy repetition, however, wasn't supported! With the latest release of Microsoft's scripting engines (version 5.5), non-greedy repetition is supported. So, before you can use the code we will examine shortly, you must ensure that you have the VBScript Scripting Engine 5.5 or greater installed on your system. To download the latest version of the VBScript Scripting Engine visit http://msdn.microsoft.com/scripting/; to determine what server-side scripting engine version you're using, be sure to read: Determining the Server-Side Scripting Language and Version!
Let's look at a quick code example that returns the TITLE of an HTML page on the Web server. First we will
open the HTML page using the FileSystemObject to read
in the contents of a local Web page. Next, we will use a non-greedy repetition regular expression to pick out
the text between <TITLE> and </TITLE>. (For this examply, greedy
repetition would work, in theory, since there should only be one TITLE tag per Web page.)
First, we'll grab the contents of an HTML file using the FileSystemObject:
|
OK, at this point we have the contents of the HTML page /SomePage.htm read into a local variable,
strContents. Now we need to setup our regular expression, which we'll look at in
Part 2 of this article! If you are new to regular
expressions, I highly recommend that you take a moment read some of the articles suggested at our
Regular Expressions Article Index before continuing onto
Part 2.
Read Part 2!




