When you think ASP, think...
Recent Articles
All Articles
ASP.NET Articles [1.x] [2.0]
ASPFAQs.com
Message Board
Related Web Technologies
User Tips!
Coding Tips
Search

Sections:
Book Reviews
Sample Chapters
Commonly Asked Message Board Questions
Headlines from ASPWire.com
JavaScript Tutorials
MSDN Communities Hub
Official Docs
Security
Stump the SQL Guru!
Web Hosts
XML Info
Information:
Advertise
Feedback
Author an Article
Technology Jobs

















internet.com
IT
Developer
Internet News
Small Business
Personal Technology
International

Search internet.com
Advertise
Corporate Info
Newsletters
Tech Jobs
E-mail Offers
ASP ASP.NET ASP FAQs Message Board Feedback ASP Jobs
Print this page.

Windows Systems Administrator
Jupitermedia
US-CT-Darien

Justtechjobs.com Post A Job | Post A Resume

Published: Friday, March 10, 2000

Grabbing Table Columns from Other Web Pages
By Thomas Winningham


Here's a way to extract data from a webpage. Basically HTML as you know delimits cells with <TD> ... </TD>. So, although a webpage is two-dimensional, the order in which table cells appear in an HTML document is in fact linear.

- continued -

Quite often with database generated webpages like stock quotes, weather, and news, the server side scripting often puts changing data into a design that never changes. By numbering the cells in an HTML document, we can extract that data and manipulate it (err... that is if you are the content owner or have permission to do so.)

My example of <TD> ... </TD> parsing uses the SoftWing ASPTear component which is freely available. Heck, there's even a tutorial on ASPTear on 4Guys! This component allows you to get a web page on the server side, and put it's html into a string.

I wrote two Functions, one to number a webpage for reference (NumberCells) and one to actually extract a cell's content(GetCell). The NumberCells is very useful to determine the exact table column number that you are interested in extracting. Once you have this number, you can just use the GetCell Function to grab that cell.

Here is an example ASP page that uses these two Functions:

<%
'extract.asp (c) 2000 Thomas Winningham - use freely!

Function GetCell(cellnumber, extracturl)
   'Variables: Cellnumber: Number of the cell to get data from
   'ExtractURL: complete url that contains the cells and data
   'Returns a string of the cells data (including html)

   'Use the SOFTWING.AspTear component to get the HTML (but any 
   'will do... just rewrite this part) connection is including in 
   'Function for purpose of being self-contained
   Const Request_POST = 1
   Const Request_GET = 2

   Set xObj = Server.CreateObject("SOFTWING.AspTear")
   strRetVal = xObj.Retrieve(extracturl,Request_GET,"","","")
   set xobj = nothing

   i = 1              ' HTML Text Location Start
   q = 1              ' Cell Number Start

   ' Loop until we have processed the cell we're looking for   
   Do until q > cellnumber
      ' Look for <TD the start of a cell
      i = InStr(i, UCase(strRetVal), "<TD")
      
      ' Find the location of the end of the <TD tag
      r = InStr(i, strRetVal, ">")          
      
      ' Let the next loop start looking after this <TD tag we found
      i = r + 1                             
      
      ' increase the count of which cell we're at
      q = q + 1                             
   Loop

   ' The start of our cell text is right after the last found tag
   StartCellText = i                    

   ' Now... to find the end of this cell's text, we look for either <TABLE 
   ' or <TD - whichever comes first (but we have to check if they exist or not)
   ' We don't include nested tables in the cell data because those tables have 
   ' cells of their own.
   If (InStr(r, UCase(strRetVal), "<TABLE") > 0) AND _
         (InStr(r, UCase(strRetVal), "<TABLE") < _
              InStr(r, UCase(strRetVal), "</TD>")) then
      ThisCellText = mid(strRetVal, StartCellText, _
               InStr(r, UCase(strRetVal),"<TABLE")- StartCellText )
   Else
      ThisCellText = mid(strRetVal, StartCellText, _
               InStr(r, UCase(strRetVal), "</TD>")- StartCellText )
   End If

   GetCell = ThisCellText
End Function



Function NumberCells(extracturl)
   'Variables: ExtractUrl: The URL (eg http://www.cnn.com) to number
   'returns a string of the entire HTML document with numbers at the beginning
   'of each cell

   'Use the SOFTWING.AspTear component to get the HTML (but any will do... 
   'just rewrite this part)
   'connection is including in Function for purpose of being self-contained
   Const Request_POST = 1
   Const Request_GET = 2

   Set xObj = Server.CreateObject("SOFTWING.AspTear")
   strRetVal = xObj.Retrieve(extracturl,Request_GET,"","","")
   set xobj = nothing

   i = 1           ' HTML Text Location Start
   q = 1           ' Cell Number Start

   ' So long as <TD cells exist- number them
   Do while InStr(i, UCase(strRetVal), "<TD") > 0
       ' find next <TD
       i = InStr(i, UCase(strRetVal), "<TD")
       
       ' fomd the end of the <TD
       r = InStr(i, strRetVal, ">")
       strRetVal =  left(strRetVal, r) & q & _
             right(strRetVal, len(strRetVal) - r)

       'Number the cells: the string equals all the html we've check, 
       'our cell number, and then the html we've yet to check
       ' Let the next loop start looking after this <TD tag we found
       i = r + 1
 
       ' increase the count of which cell we're at
       q = q + 1
   Loop

   NumberCells = strRetVal
End Function
%>

Hopefully this will prove simple and valueable enough. Personally I use this to extract weather forecasts from a webpage, and format it for sending to my pager (its a cheap pager service hehe).

Try It Out!
A demo has been setup to grab the contents from the Go.com Money Page. The current quote for Microsoft stock is yanked from this page. The top part of the demo displays the output of the NumberCells function, displaying the value of the various table columns. The bottom half of the demo illustrates the use of GetCell, grabbing the value for Microsoft's quote. This is a cached page to increase performance.
  • Run the demo!
  • Feel free to optimize this thrown-together code. My biggest concern is the the speed at which this thing handles a whole HTML document as a string, and if my UCase() statements slow things down... I recently wrote a more complex version of this idea that gets multiple cells in one pass and puts them into a dictionary object (eg, weather(Temp_Today) might return 70)...

    Good luck, and get permission to use the content you steal. Maybe someday everything will be XML and we won't have to pull things out of their layout!

    Happy Programming!


    Attachments:

  • Download extract.asp in text format
  • Run the demo!


    Windows Internet Technology | ASP.NET [1.x] [2.0] | ASPMessageboard.com | ASPFAQs.com | Advertise | Feedback | Author an Article



  • JupiterOnlineMedia

    internet.comearthweb.comDevx.commediabistro.comGraphics.com

    Search:

    Jupitermedia Corporation has two divisions: Jupiterimages and JupiterOnlineMedia

    Jupitermedia Corporate Info


    Legal Notices, Licensing, Reprints, & Permissions, Privacy Policy.

    Advertise | Newsletters | Tech Jobs | Shopping | E-mail Offers

    Solutions
    Whitepapers and eBooks
    Microsoft Article: HyperV-The Killer Feature in WinServer ‘08
    Avaya Article: How to Feed Data into the Avaya Event Processor
    Microsoft Article: Install What You Need with Win Server ‘08
    HP eBook: Putting the Green into IT
    Whitepaper: HP Integrated Citrix XenServer for HP ProLiant Servers
    Intel Go Parallel Portal: Interview with C++ Guru Herb Sutter, Part 1
    Intel Go Parallel Portal: Interview with C++ Guru Herb Sutter, Part 2--The Future of Concurrency
    Avaya Article: Setting Up a SIP A/S Development Environment
    IBM Article: How Cool Is Your Data Center?
    Microsoft Article: Managing Virtual Machines with Microsoft System Center
    HP eBook: Storage Networking , Part 1
    Microsoft Article: Solving Data Center Complexity with Microsoft System Center Configuration Manager 2007
    MORE WHITEPAPERS, EBOOKS, AND ARTICLES
    Webcasts
    Intel Video: Are Multi-core Processors Here to Stay?
    On-Demand Webcast: Five Virtualization Trends to Watch
    HP Video: Page Cost Calculator
    Intel Video: APIs for Parallel Programming
    HP Webcast: Storage Is Changing Fast - Be Ready or Be Left Behind
    Microsoft Silverlight Video: Creating Fading Controls with Expression Design and Expression Blend 2
    MORE WEBCASTS, PODCASTS, AND VIDEOS
    Downloads and eKits
    Sun Download: Solaris 8 Migration Assistant
    Sybase Download: SQL Anywhere Developer Edition
    Red Gate Download: SQL Backup Pro and free DBA Best Practices eBook
    Red Gate Download: SQL Compare Pro 6
    Iron Speed Designer Application Generator
    MORE DOWNLOADS, EKITS, AND FREE TRIALS
    Tutorials and Demos
    How-to-Article: Preparing for Hyper-Threading Technology and Dual Core Technology
    eTouch PDF: Conquering the Tyranny of E-Mail and Word Processors
    IBM Article: Collaborating in the High-Performance Workplace
    HP Demo: StorageWorks EVA4400
    Intel Featured Algorhythm: Intel Threading Building Blocks--The Pipeline Class
    Microsoft How-to Article: Get Going with Silverlight and Windows Live
    MORE TUTORIALS, DEMOS AND STEP-BY-STEP GUIDES