Grabbing Table Columns from Other Web Pages
By Thomas Winningham
Here's a way to extract data from a webpage. Basically HTML as you know
delimits cells with <TD> ... </TD>
. So, although a webpage is
two-dimensional, the order in which table cells appear in an HTML document
is in fact linear.
Quite often with database generated webpages like stock quotes, weather, and
news, the server side scripting often puts changing data into a design that
never changes.
By numbering the cells in an HTML document, we can extract that data and
manipulate it (err... that is if you are the content owner or have
permission to do so.)
My example of <TD> ... </TD>
parsing uses the SoftWing ASPTear component which is
freely available.
Heck, there's even a tutorial on ASPTear on 4Guys!
This component allows you to get a web page on the server side, and put it's html into a string.
I wrote two Functions, one to number a webpage for reference (NumberCells
)
and one to actually extract a cell's content(GetCell
). The NumberCells
is very useful to determine the exact table column number that you are interested in extracting. Once you
have this number, you can just use the GetCell
Function to grab that cell.
Here is an example ASP page that uses these two Functions:
<%
'extract.asp (c) 2000 Thomas Winningham - use freely!
Function GetCell(cellnumber, extracturl)
'Variables: Cellnumber: Number of the cell to get data from
'ExtractURL: complete url that contains the cells and data
'Returns a string of the cells data (including html)
'Use the SOFTWING.AspTear component to get the HTML (but any
'will do... just rewrite this part) connection is including in
'Function for purpose of being self-contained
Const Request_POST = 1
Const Request_GET = 2
Set xObj = Server.CreateObject("SOFTWING.AspTear")
strRetVal = xObj.Retrieve(extracturl,Request_GET,"","","")
set xobj = nothing
i = 1 ' HTML Text Location Start
q = 1 ' Cell Number Start
' Loop until we have processed the cell we're looking for
Do until q > cellnumber
' Look for <TD the start of a cell
i = InStr(i, UCase(strRetVal), "<TD")
' Find the location of the end of the <TD tag
r = InStr(i, strRetVal, ">")
' Let the next loop start looking after this <TD tag we found
i = r + 1
' increase the count of which cell we're at
q = q + 1
Loop
' The start of our cell text is right after the last found tag
StartCellText = i
' Now... to find the end of this cell's text, we look for either <TABLE
' or <TD - whichever comes first (but we have to check if they exist or not)
' We don't include nested tables in the cell data because those tables have
' cells of their own.
If (InStr(r, UCase(strRetVal), "<TABLE") > 0) AND _
(InStr(r, UCase(strRetVal), "<TABLE") < _
InStr(r, UCase(strRetVal), "</TD>")) then
ThisCellText = mid(strRetVal, StartCellText, _
InStr(r, UCase(strRetVal),"<TABLE")- StartCellText )
Else
ThisCellText = mid(strRetVal, StartCellText, _
InStr(r, UCase(strRetVal), "</TD>")- StartCellText )
End If
GetCell = ThisCellText
End Function
Function NumberCells(extracturl)
'Variables: ExtractUrl: The URL (eg http://www.cnn.com) to number
'returns a string of the entire HTML document with numbers at the beginning
'of each cell
'Use the SOFTWING.AspTear component to get the HTML (but any will do...
'just rewrite this part)
'connection is including in Function for purpose of being self-contained
Const Request_POST = 1
Const Request_GET = 2
Set xObj = Server.CreateObject("SOFTWING.AspTear")
strRetVal = xObj.Retrieve(extracturl,Request_GET,"","","")
set xobj = nothing
i = 1 ' HTML Text Location Start
q = 1 ' Cell Number Start
' So long as <TD cells exist- number them
Do while InStr(i, UCase(strRetVal), "<TD") > 0
' find next <TD
i = InStr(i, UCase(strRetVal), "<TD")
' fomd the end of the <TD
r = InStr(i, strRetVal, ">")
strRetVal = left(strRetVal, r) & q & _
right(strRetVal, len(strRetVal) - r)
'Number the cells: the string equals all the html we've check,
'our cell number, and then the html we've yet to check
' Let the next loop start looking after this <TD tag we found
i = r + 1
' increase the count of which cell we're at
q = q + 1
Loop
NumberCells = strRetVal
End Function
%>
|
Hopefully this will prove simple and valueable enough. Personally I use this
to extract weather forecasts from a webpage, and format it for sending to my
pager (its a cheap pager service hehe).
Try It Out! |
A demo has been setup to grab the contents from the Go.com Money
Page. The current quote for Microsoft stock is yanked from this page. The top part of the demo
displays the output of the NumberCells function, displaying the value of the various table
columns. The bottom half of the demo illustrates the use of GetCell , grabbing the value for
Microsoft's quote. This is a cached page to increase performance.
Run the demo! |
Feel free to optimize this thrown-together code. My biggest concern is the
the speed at which this thing handles a whole HTML document as a string, and
if my
UCase()
statements slow things down... I recently wrote a more complex
version of this idea that gets multiple cells in one pass and puts them into
a dictionary object (eg,
weather(Temp_Today)
might return 70)...
Good luck, and get permission to use the content you steal. Maybe someday
everything will be XML and we won't have to pull things out of their layout!
Happy Programming!
Attachments:
Download extract.asp
in text format
Run the demo!