To read the article online, visit http://www.4GuysFromRolla.com/webtech/120400-1.2.shtml

Common Applications of Regular Expressions, Part 2

By Richard Lowe


  • Read Part 1

  • In Part 1 we looked at how to validate passwords and email addresses using regular expressions. In this part, we'll look at how to extract specific "chunks" of HTML from a Web page.

    Extracting Specific Sections From An HTML Page
    The main challenge in extracting data from an HTML page is finding a way to uniquely identify the section of data you are wanting to extract. Take for example this (fictional) HTML code snippet which shows the headline for a particular site:

    <table border="0" width="11%" class="Somestory">
      <tr>
        <td width="100%">
          <p align="center">In the news...</td>
      </tr>
    </table>
    <table border="0" width="11%" class="Headline">
      <tr>
        <td width="100%">
          <p align="center">It's War!</td>
      </tr>
    </table>
    <table border="0" width="11%" class="Someotherstory">
      <tr>
        <td width="100%">
          <p align="center">In the news...</td>
      </tr>
    </table>
    

    Seeing this snippet makes it pretty obvious that the headline is presented in the middle table that has a class attribute set to Headline. (If you are getting HTML that is not directly controlled by you, you might find it handy to use an optional feature of IE that allows you to view partial source based on what you highlight: http://www.microsoft.com/Windows/ie/WebAccess/default.ASP). For this exercise, we'll assume that this is the only table with the class attribute set to Headline. (This example won't delve into the mechanics of grabbing information from another Web page - rather, this example will focus on picking out particular HTML from a page. To learn how to grab the HTML contents from another Web page through an ASP page, be sure to read: Grabbing Information From Other Web Servers and the FAQ: How can I treat some other web page as data on my own site?)

    Now we need to create a regular expression that will find this and only this table to include in our pages. First, add the code supporting the regular expression:

    <%
      Dim re, strHTML
      Set re = new RegExp		' creates the RegExp object
    
      re.IgnoreCase = true
      re.Global = false		' quits searching after first match
    %>
    

    Then we need to consider the area we want to capture: In this case we want the entire <table> structure with the ending tag and headline text (no matter what it is) intact. Start by finding the opening <table> tag:

    re.Pattern = "<table.*(?=Headline)"

    This will match the opening table tag and return everything after it (except new lines) up to the text Headline. Here is how you return the matched HTML:

    ' Puts all matched HTML in a collection called Matches:
    Set Matches  = re.Execute(strHTML)
    
    ' Show all the matching HTML:
    For Each Item in Matches
      Response.Write Item.Value
    Next
    
    ' Show one specific piece:
    Response.write Matches.Item(0).Value
    

    Executed against our HTML snippet above, this expression returns one match that looks like this:

    <table border="0" width="11%" class="

    The (?=Headline) portion of the expression doesn't capture characters, so you won't see the class of the table in this partial match. Capturing the remainder of the table is quite simple for Version 5.5 (and greater) of VBScript (and JScript):

    re.Pattern = "<table.*(?=Headline)(.|\n)*?</table>"

    To break it down: (.|\n) captures ANY charater, translated it means (any character except newline OR newline (which, obviously, translates to ANY character)). Followed by a * matches any charater 0 to many times and the ? makes the * non-greedy. Non-greedy means that the expression should match as little as possible before the next part of the expression is found. The </table> is the end of the Headline table.

    The ? qualifier is important because it prevents the regular expression from returning the contents of other tables. For example in given the HTML snippet above, removing the ? from in front of the * would return this:

    <table border="0" width="11%" class="Headline">
      <tr>
        <td width="100%">
          <p align="center">It's War!</td>
      </tr>
    </table>
    <table border="0" width="11%" class="Someotherstory">
      <tr>
        <td width="100%">
          <p align="center">In the news...</td>
      </tr>
    </table>
    

    It not only captured the ending <table> tag from the Headline table, but also from the Someotherstory table as well, thus the need for the non-greed qualifier (?). (For more information on the non-greedy qualifier, be sure to read: Picking Out Delimited Text with Regular Expressions!)

    This example described a fairly ideal condition for returning portions of HTML, in the real world it is often more complicated, especially in cases where you don't have any influence over the source of the HTML you are pulling. The best approach is to examine small amounts of HTML surrounding the content you want to extract and build a regular expression slowly, testing often, to ensure you're getting only the matches you want. It's also important to handle the case where your regular expression doesn't match anything from the source HTML. Content can change quickly, and you want to ensure your page is not displaying an unprofessional looking error simply because someone else changed their content's format.

    In Part 3 we'll look at another real-world use for regular expressions: parsing data files!

  • Read Part 3!


  • Article Information
    Article Title: Common Applications of Regular Expressions
    Article Author: Richard Lowe
    Published Date: Monday, December 04, 2000
    Article URL: http://www.4GuysFromRolla.com/webtech/120400-1.2.shtml


    Copyright 2017 QuinStreet Inc. All Rights Reserved.
    Legal Notices, Licensing, Permissions, Privacy Policy.
    Advertise | Newsletters | E-mail Offers