To read the article online, visit http://www.4GuysFromRolla.com/webtech/031000-1.shtml

Grabbing Table Columns from Other Web Pages

By Thomas Winningham


Here's a way to extract data from a webpage. Basically HTML as you know delimits cells with <TD> ... </TD>. So, although a webpage is two-dimensional, the order in which table cells appear in an HTML document is in fact linear.

Quite often with database generated webpages like stock quotes, weather, and news, the server side scripting often puts changing data into a design that never changes. By numbering the cells in an HTML document, we can extract that data and manipulate it (err... that is if you are the content owner or have permission to do so.)

My example of <TD> ... </TD> parsing uses the SoftWing ASPTear component which is freely available. Heck, there's even a tutorial on ASPTear on 4Guys! This component allows you to get a web page on the server side, and put it's html into a string.

I wrote two Functions, one to number a webpage for reference (NumberCells) and one to actually extract a cell's content(GetCell). The NumberCells is very useful to determine the exact table column number that you are interested in extracting. Once you have this number, you can just use the GetCell Function to grab that cell.

Here is an example ASP page that uses these two Functions:

<%
'extract.asp (c) 2000 Thomas Winningham - use freely!

Function GetCell(cellnumber, extracturl)
   'Variables: Cellnumber: Number of the cell to get data from
   'ExtractURL: complete url that contains the cells and data
   'Returns a string of the cells data (including html)

   'Use the SOFTWING.AspTear component to get the HTML (but any 
   'will do... just rewrite this part) connection is including in 
   'Function for purpose of being self-contained
   Const Request_POST = 1
   Const Request_GET = 2

   Set xObj = Server.CreateObject("SOFTWING.AspTear")
   strRetVal = xObj.Retrieve(extracturl,Request_GET,"","","")
   set xobj = nothing

   i = 1              ' HTML Text Location Start
   q = 1              ' Cell Number Start

   ' Loop until we have processed the cell we're looking for   
   Do until q > cellnumber
      ' Look for <TD the start of a cell
      i = InStr(i, UCase(strRetVal), "<TD")
      
      ' Find the location of the end of the <TD tag
      r = InStr(i, strRetVal, ">")          
      
      ' Let the next loop start looking after this <TD tag we found
      i = r + 1                             
      
      ' increase the count of which cell we're at
      q = q + 1                             
   Loop

   ' The start of our cell text is right after the last found tag
   StartCellText = i                    

   ' Now... to find the end of this cell's text, we look for either <TABLE 
   ' or <TD - whichever comes first (but we have to check if they exist or not)
   ' We don't include nested tables in the cell data because those tables have 
   ' cells of their own.
   If (InStr(r, UCase(strRetVal), "<TABLE") > 0) AND _
         (InStr(r, UCase(strRetVal), "<TABLE") < _
              InStr(r, UCase(strRetVal), "</TD>")) then
      ThisCellText = mid(strRetVal, StartCellText, _
               InStr(r, UCase(strRetVal),"<TABLE")- StartCellText )
   Else
      ThisCellText = mid(strRetVal, StartCellText, _
               InStr(r, UCase(strRetVal), "</TD>")- StartCellText )
   End If

   GetCell = ThisCellText
End Function



Function NumberCells(extracturl)
   'Variables: ExtractUrl: The URL (eg http://www.cnn.com) to number
   'returns a string of the entire HTML document with numbers at the beginning
   'of each cell

   'Use the SOFTWING.AspTear component to get the HTML (but any will do... 
   'just rewrite this part)
   'connection is including in Function for purpose of being self-contained
   Const Request_POST = 1
   Const Request_GET = 2

   Set xObj = Server.CreateObject("SOFTWING.AspTear")
   strRetVal = xObj.Retrieve(extracturl,Request_GET,"","","")
   set xobj = nothing

   i = 1           ' HTML Text Location Start
   q = 1           ' Cell Number Start

   ' So long as <TD cells exist- number them
   Do while InStr(i, UCase(strRetVal), "<TD") > 0
       ' find next <TD
       i = InStr(i, UCase(strRetVal), "<TD")
       
       ' fomd the end of the <TD
       r = InStr(i, strRetVal, ">")
       strRetVal =  left(strRetVal, r) & q & _
             right(strRetVal, len(strRetVal) - r)

       'Number the cells: the string equals all the html we've check, 
       'our cell number, and then the html we've yet to check
       ' Let the next loop start looking after this <TD tag we found
       i = r + 1
 
       ' increase the count of which cell we're at
       q = q + 1
   Loop

   NumberCells = strRetVal
End Function
%>

Hopefully this will prove simple and valueable enough. Personally I use this to extract weather forecasts from a webpage, and format it for sending to my pager (its a cheap pager service hehe).

Try It Out!
A demo has been setup to grab the contents from the Go.com Money Page. The current quote for Microsoft stock is yanked from this page. The top part of the demo displays the output of the NumberCells function, displaying the value of the various table columns. The bottom half of the demo illustrates the use of GetCell, grabbing the value for Microsoft's quote. This is a cached page to increase performance.
  • Run the demo!
  • Feel free to optimize this thrown-together code. My biggest concern is the the speed at which this thing handles a whole HTML document as a string, and if my UCase() statements slow things down... I recently wrote a more complex version of this idea that gets multiple cells in one pass and puts them into a dictionary object (eg, weather(Temp_Today) might return 70)...

    Good luck, and get permission to use the content you steal. Maybe someday everything will be XML and we won't have to pull things out of their layout!

    Happy Programming!


    Attachments:

  • Download extract.asp in text format
  • Run the demo!


  • Article Information
    Article Title: Grabbing Table Columns from Other Web Pages
    Article Author: Thomas Winningham
    Published Date: Friday, March 10, 2000
    Article URL: http://www.4GuysFromRolla.com/webtech/031000-1.shtml


    Copyright 2017 QuinStreet Inc. All Rights Reserved.
    Legal Notices, Licensing, Permissions, Privacy Policy.
    Advertise | Newsletters | E-mail Offers