Safe Screenscraping with Microsoft's XMLHTTP Component
By Paul Kosmas
Introduction
There are many times when, as a Web developer, you need to grab HTML or XML data from a remote Web site.
In fact, there are a plethora of such articles here on 4Guys and on ASPFAQs.com.
In my article, I am going to examine some of the more advanced features of a free Microsoft component that
can be used to grab either XML or HTML data from a remote Web site: XMLHttp. For a general discussion on
screen scraping in ASP using XMLHttp (and other free components), read this
FAQ. For this article I am going to assume you are familiar with grabbing HTML or XML data from a remote
server through XMLHttp.
Timeout Limitations
One problem that can arise with grabbing data from a remote server is that the remote server you are calling
may be experiencing problems. For example, the data you are trying to retrieve into your site may be unavailable,
moved, or the remote server may be down, and this could crash your script, returning an unwanted error message
to the client browser or bring down your page altogether.
Handling this situation gracefully is simple once you are aware of a few under-documented properties of the XMLhttp component. Note that for this article I will be discussing the features available in MS XMLHttp component v. 3 SP1. You can download the latest version of the XMLHttp component at http://msdn.microsoft.com/downloads/default.asp?url=/downloads/topic.asp?url=/msdn-files/028/000/072/topic.xml.
On the Web site that I run, we wanted to grab some weather data from a local source (accuweather's nearest reporting sources were over 60 miles away, and at different elevations, so it wasn't quite "accurate" after all). Incidentally, the data we needed was not XML, so we requested the information as a string, but you can apply this solution to a pure XML request as well.
Now, how do we handle a scenario in which the Web site takes a long time to respond to our request?
XMLHttp contains a readyState property which returns the state of the data being requested.
This property has one of five possible values that indicate the progress of the request:
readyState Property Values | |
|---|---|
0 | The object has been created but has not been initialized because open method has not been called. |
1 | The object has been created but the send method has not been called. |
2 | The send method has been called and the status and headers are available, but the response is not yet available. |
3 | Some data has been received. You can call responseBody and responseText to get the current partial results. |
4 | All the data has been received, and the complete data is available in responseBody and responseText. |
Of course, the value of the readyState property will change over time. Assuming a successful
request, the readyState property will, throughout the lifetime of our request, have the values
0 through 4 at one point or another. The thing we need to be wary of is if it's
taking too long to reach the final stage.
Fortunately, XMLHttp contains a waitForResponse method that accepts a parameter specifying
how many seconds to wait for the remote server to reply. We can use this method to say something along
the lines of, "Wait two seconds to get the data from the remote server. If you don't have it by then,
give up on waiting for it..." Some things to note, though, when using this property. First, you must
specify to use an asynchronous request to the remote server. Also, if the timeout expires
when trying to access a page, the XMLHttp component will raise an error, meaning that you must use
On Error Resume Next to disable error trapping and then specifically check for a raised error
after the waitForResponse method. If this sounds confusing, don't worry, a code sample
will help clear everything up!
The below code sample illustrates how to use the XMLHttp component to access data from a remote Web server. If the request takes longer than three seconds, "default" data is used to simulate the data that should have been retrieved from the remote Web server.
|
At this point in the script, the local server is waiting for the remote server to send back some information.
We want to give the remote server a reasonable time to respond, but not so much time as to lose my visitor while
waiting for the data. Let's wait up to three seconds for this information. Remember, if the request takes
longer than three seconds, the XMLHttp component will raise an error. Hence we must turn off error
handling first, call the waitForResponse method, and then check if an error has occurred
(for more information on error handling in VBScript be sure to read:
Error Handling in ASP):
|
If we reach this Else code, we know that the server responded. However, what if we were requesting
a 404 page? Or what if all the data didn't get returned for some reason? To accommodate for these unexpected
behaviors that could seriously screw up our end results, we need to ensure that the readyState property
equals 4 and that the Status property, which returns the HTTP Response status,
equals 200 (meaning that the HTTP request was successful). Otherwise, we have a successfully
completed request.
|
Conclusion
At this point we have the data (or lack of it) stored in a variable that can be used anywhere in the remainder
of the ASP page. I use this information on my front page, and all over my site. So to reduce stress
on both servers I used the technique described in another 4Guys article to store the data in an application
level variable, which I'd recommend you do as well. For more information on this technique, be sure to
read: A Real-World Example of Caching Data in the Application Object.
I hope you found this article both interesting and illuminating! If you have any questions or problems, please don't hesitate to email me.
Happy Programming!




