Published: Monday, December 04, 2000
Common Applications of Regular Expressions, Part 2
By Richard Lowe
Read Part 1
In Part 1 we looked at how to validate passwords and email addresses
using regular expressions. In this part, we'll look at how to extract specific "chunks" of HTML from a
Web page.
Extracting Specific Sections From An HTML Page
The main challenge in extracting data from an HTML page is finding a way to uniquely identify the section of
data you are wanting to extract. Take for example this (fictional) HTML code snippet which shows the
headline for a particular site:
<table border="0" width="11%" class="Somestory">
<tr>
<td width="100%">
<p align="center">In the news...</td>
</tr>
</table>
<table border="0" width="11%" class="Headline">
<tr>
<td width="100%">
<p align="center">It's War!</td>
</tr>
</table>
<table border="0" width="11%" class="Someotherstory">
<tr>
<td width="100%">
<p align="center">In the news...</td>
</tr>
</table>
|
Seeing this snippet makes it pretty obvious that the headline is presented in the middle table that has a
class attribute set to Headline. (If you are getting HTML that is not directly controlled by
you, you might find it handy to use an optional feature of IE that allows you to view partial source based on
what you highlight: http://www.microsoft.com/Windows/ie/WebAccess/default.ASP).
For this exercise, we'll assume that this is the only table with the class attribute set to Headline.
(This example won't delve into the mechanics of grabbing information from another Web page - rather, this example
will focus on picking out particular HTML from a page. To learn how to grab the HTML contents from another
Web page through an ASP page, be sure to read: Grabbing Information From Other
Web Servers and the FAQ: How can I treat some other web
page as data on my own site?)
Now we need to create a regular expression that will find this and only this table to include in our pages.
First, add the code supporting the regular expression:
<%
Dim re, strHTML
Set re = new RegExp ' creates the RegExp object
re.IgnoreCase = true
re.Global = false ' quits searching after first match
%>
|
Then we need to consider the area we want to capture: In this case we want the entire <table>
structure with the ending tag and headline text (no matter what it is) intact. Start by finding the opening
<table> tag:
re.Pattern = "<table.*(?=Headline)"
|
This will match the opening table tag and return everything after it (except new lines) up to the text
Headline. Here is how you return the matched HTML:
' Puts all matched HTML in a collection called Matches:
Set Matches = re.Execute(strHTML)
' Show all the matching HTML:
For Each Item in Matches
Response.Write Item.Value
Next
' Show one specific piece:
Response.write Matches.Item(0).Value
|
Executed against our HTML snippet above, this expression returns one match that looks
like this:
<table border="0" width="11%" class="
|
The (?=Headline) portion of the expression doesn't capture characters, so you won't see the class
of the table in this partial match. Capturing the remainder of the table is quite simple for Version 5.5
(and greater) of VBScript (and JScript):
re.Pattern = "<table.*(?=Headline)(.|\n)*?</table>"
|
To break it down: (.|\n) captures ANY charater, translated it means (any character except newline
OR newline (which, obviously, translates to ANY character)). Followed by a * matches any
charater 0 to many times and the ? makes the * non-greedy. Non-greedy means that
the expression should match as little as possible before the next part of the expression is found. The
</table> is the end of the Headline table.
The ? qualifier is important because it prevents the regular expression from returning the
contents of other tables. For example in given the HTML snippet above, removing the ? from in
front of the * would return this:
<table border="0" width="11%" class="Headline">
<tr>
<td width="100%">
<p align="center">It's War!</td>
</tr>
</table>
<table border="0" width="11%" class="Someotherstory">
<tr>
<td width="100%">
<p align="center">In the news...</td>
</tr>
</table>
|
It not only captured the ending <table> tag from the Headline table, but also
from the Someotherstory table as well, thus the need for the non-greed qualifier (?).
(For more information on the non-greedy qualifier, be sure to read:
Picking Out Delimited Text with Regular Expressions!)
This example described a fairly ideal condition for returning portions of HTML, in the real world it is often
more complicated, especially in cases where you don't have any influence over the source of the HTML you are
pulling. The best approach is to examine small amounts of HTML surrounding the content you want to extract
and build a regular expression slowly, testing often, to ensure you're getting only the matches you want.
It's also important to handle the case where your regular expression doesn't match anything from the source
HTML. Content can change quickly, and you want to ensure your page is not displaying an unprofessional looking
error simply because someone else changed their content's format.
In Part 3 we'll look at another real-world use for regular expressions:
parsing data files!
Read Part 3!