When you think ASP, think...
Recent Articles
All Articles
ASP.NET Articles [1.x] [2.0]
ASPFAQs.com
Message Board
Related Web Technologies
User Tips!
Coding Tips
Search

Sections:
Book Reviews
Sample Chapters
Commonly Asked Message Board Questions
Headlines from ASPWire.com
JavaScript Tutorials
MSDN Communities Hub
Official Docs
Security
Stump the SQL Guru!
Web Hosts
XML Info
Information:
Advertise
Feedback
Author an Article
Technology Jobs

















internet.com
IT
Developer
Internet News
Small Business
Personal Technology
International

Search internet.com
Advertise
Corporate Info
Newsletters
Tech Jobs
E-mail Offers
ASP ASP.NET ASP FAQs Message Board Feedback ASP Jobs
Print this page.

Windows Systems Administrator
Jupitermedia
US-CT-Darien

Justtechjobs.com Post A Job | Post A Resume

Published: Monday, December 04, 2000

Common Applications of Regular Expressions, Part 2
By Richard Lowe


  • Read Part 1

  • In Part 1 we looked at how to validate passwords and email addresses using regular expressions. In this part, we'll look at how to extract specific "chunks" of HTML from a Web page.

    - continued -

    Extracting Specific Sections From An HTML Page
    The main challenge in extracting data from an HTML page is finding a way to uniquely identify the section of data you are wanting to extract. Take for example this (fictional) HTML code snippet which shows the headline for a particular site:

    <table border="0" width="11%" class="Somestory">
      <tr>
        <td width="100%">
          <p align="center">In the news...</td>
      </tr>
    </table>
    <table border="0" width="11%" class="Headline">
      <tr>
        <td width="100%">
          <p align="center">It's War!</td>
      </tr>
    </table>
    <table border="0" width="11%" class="Someotherstory">
      <tr>
        <td width="100%">
          <p align="center">In the news...</td>
      </tr>
    </table>
    

    Seeing this snippet makes it pretty obvious that the headline is presented in the middle table that has a class attribute set to Headline. (If you are getting HTML that is not directly controlled by you, you might find it handy to use an optional feature of IE that allows you to view partial source based on what you highlight: http://www.microsoft.com/Windows/ie/WebAccess/default.ASP). For this exercise, we'll assume that this is the only table with the class attribute set to Headline. (This example won't delve into the mechanics of grabbing information from another Web page - rather, this example will focus on picking out particular HTML from a page. To learn how to grab the HTML contents from another Web page through an ASP page, be sure to read: Grabbing Information From Other Web Servers and the FAQ: How can I treat some other web page as data on my own site?)

    Now we need to create a regular expression that will find this and only this table to include in our pages. First, add the code supporting the regular expression:

    <%
      Dim re, strHTML
      Set re = new RegExp		' creates the RegExp object
    
      re.IgnoreCase = true
      re.Global = false		' quits searching after first match
    %>
    

    Then we need to consider the area we want to capture: In this case we want the entire <table> structure with the ending tag and headline text (no matter what it is) intact. Start by finding the opening <table> tag:

    re.Pattern = "<table.*(?=Headline)"

    This will match the opening table tag and return everything after it (except new lines) up to the text Headline. Here is how you return the matched HTML:

    ' Puts all matched HTML in a collection called Matches:
    Set Matches  = re.Execute(strHTML)
    
    ' Show all the matching HTML:
    For Each Item in Matches
      Response.Write Item.Value
    Next
    
    ' Show one specific piece:
    Response.write Matches.Item(0).Value
    

    Executed against our HTML snippet above, this expression returns one match that looks like this:

    <table border="0" width="11%" class="

    The (?=Headline) portion of the expression doesn't capture characters, so you won't see the class of the table in this partial match. Capturing the remainder of the table is quite simple for Version 5.5 (and greater) of VBScript (and JScript):

    re.Pattern = "<table.*(?=Headline)(.|\n)*?</table>"

    To break it down: (.|\n) captures ANY charater, translated it means (any character except newline OR newline (which, obviously, translates to ANY character)). Followed by a * matches any charater 0 to many times and the ? makes the * non-greedy. Non-greedy means that the expression should match as little as possible before the next part of the expression is found. The </table> is the end of the Headline table.

    The ? qualifier is important because it prevents the regular expression from returning the contents of other tables. For example in given the HTML snippet above, removing the ? from in front of the * would return this:

    <table border="0" width="11%" class="Headline">
      <tr>
        <td width="100%">
          <p align="center">It's War!</td>
      </tr>
    </table>
    <table border="0" width="11%" class="Someotherstory">
      <tr>
        <td width="100%">
          <p align="center">In the news...</td>
      </tr>
    </table>
    

    It not only captured the ending <table> tag from the Headline table, but also from the Someotherstory table as well, thus the need for the non-greed qualifier (?). (For more information on the non-greedy qualifier, be sure to read: Picking Out Delimited Text with Regular Expressions!)

    This example described a fairly ideal condition for returning portions of HTML, in the real world it is often more complicated, especially in cases where you don't have any influence over the source of the HTML you are pulling. The best approach is to examine small amounts of HTML surrounding the content you want to extract and build a regular expression slowly, testing often, to ensure you're getting only the matches you want. It's also important to handle the case where your regular expression doesn't match anything from the source HTML. Content can change quickly, and you want to ensure your page is not displaying an unprofessional looking error simply because someone else changed their content's format.

    In Part 3 we'll look at another real-world use for regular expressions: parsing data files!

  • Read Part 3!


    Windows Internet Technology | ASP.NET [1.x] [2.0] | ASPMessageboard.com | ASPFAQs.com | Advertise | Feedback | Author an Article



  • JupiterOnlineMedia

    internet.comearthweb.comDevx.commediabistro.comGraphics.com

    Search:

    Jupitermedia Corporation has two divisions: Jupiterimages and JupiterOnlineMedia

    Jupitermedia Corporate Info


    Legal Notices, Licensing, Reprints, & Permissions, Privacy Policy.

    Advertise | Newsletters | Tech Jobs | Shopping | E-mail Offers

    Solutions
    Whitepapers and eBooks
    Microsoft Article: HyperV-The Killer Feature in WinServer ‘08
    Avaya Article: How to Feed Data into the Avaya Event Processor
    Microsoft Article: Install What You Need with Win Server ‘08
    HP eBook: Putting the Green into IT
    Whitepaper: HP Integrated Citrix XenServer for HP ProLiant Servers
    Intel Go Parallel Portal: Interview with C++ Guru Herb Sutter, Part 1
    Intel Go Parallel Portal: Interview with C++ Guru Herb Sutter, Part 2--The Future of Concurrency
    Avaya Article: Setting Up a SIP A/S Development Environment
    IBM Article: How Cool Is Your Data Center?
    Microsoft Article: Managing Virtual Machines with Microsoft System Center
    HP eBook: Storage Networking , Part 1
    Microsoft Article: Solving Data Center Complexity with Microsoft System Center Configuration Manager 2007
    MORE WHITEPAPERS, EBOOKS, AND ARTICLES
    Webcasts
    Intel Video: Are Multi-core Processors Here to Stay?
    On-Demand Webcast: Five Virtualization Trends to Watch
    HP Video: Page Cost Calculator
    Intel Video: APIs for Parallel Programming
    HP Webcast: Storage Is Changing Fast - Be Ready or Be Left Behind
    Microsoft Silverlight Video: Creating Fading Controls with Expression Design and Expression Blend 2
    MORE WEBCASTS, PODCASTS, AND VIDEOS
    Downloads and eKits
    Sun Download: Solaris 8 Migration Assistant
    Sybase Download: SQL Anywhere Developer Edition
    Red Gate Download: SQL Backup Pro and free DBA Best Practices eBook
    Red Gate Download: SQL Compare Pro 6
    Iron Speed Designer Application Generator
    MORE DOWNLOADS, EKITS, AND FREE TRIALS
    Tutorials and Demos
    How-to-Article: Preparing for Hyper-Threading Technology and Dual Core Technology
    eTouch PDF: Conquering the Tyranny of E-Mail and Word Processors
    IBM Article: Collaborating in the High-Performance Workplace
    HP Demo: StorageWorks EVA4400
    Intel Featured Algorhythm: Intel Threading Building Blocks--The Pipeline Class
    Microsoft How-to Article: Get Going with Silverlight and Windows Live
    MORE TUTORIALS, DEMOS AND STEP-BY-STEP GUIDES