Safe Screenscraping with Microsoft's XMLHTTP ComponentBy Paul Kosmas
There are many times when, as a Web developer, you need to grab HTML or XML data from a remote Web site. In fact, there are a plethora of such articles here on 4Guys and on ASPFAQs.com. In my article, I am going to examine some of the more advanced features of a free Microsoft component that can be used to grab either XML or HTML data from a remote Web site: XMLHttp. For a general discussion on screen scraping in ASP using XMLHttp (and other free components), read this FAQ. For this article I am going to assume you are familiar with grabbing HTML or XML data from a remote server through XMLHttp.
One problem that can arise with grabbing data from a remote server is that the remote server you are calling may be experiencing problems. For example, the data you are trying to retrieve into your site may be unavailable, moved, or the remote server may be down, and this could crash your script, returning an unwanted error message to the client browser or bring down your page altogether.
Handling this situation gracefully is simple once you are aware of a few under-documented properties of the XMLhttp component. Note that for this article I will be discussing the features available in MS XMLHttp component v. 3 SP1. You can download the latest version of the XMLHttp component at http://msdn.microsoft.com/downloads/default.asp?url=/downloads/topic.asp?url=/msdn-files/028/000/072/topic.xml.
On the Web site that I run, we wanted to grab some weather data from a local source (accuweather's nearest reporting sources were over 60 miles away, and at different elevations, so it wasn't quite "accurate" after all). Incidentally, the data we needed was not XML, so we requested the information as a string, but you can apply this solution to a pure XML request as well.
Now, how do we handle a scenario in which the Web site takes a long time to respond to our request?
XMLHttp contains a
readyState property which returns the state of the data being requested.
This property has one of five possible values that indicate the progress of the request:
|The object has been created but has not been initialized because |
|The object has been created but the |
|Some data has been received. You can call |
|All the data has been received, and the complete data is available in |
Of course, the value of the
readyState property will change over time. Assuming a successful
readyState property will, throughout the lifetime of our request, have the values
4 at one point or another. The thing we need to be wary of is if it's
taking too long to reach the final stage.
Fortunately, XMLHttp contains a
waitForResponse method that accepts a parameter specifying
how many seconds to wait for the remote server to reply. We can use this method to say something along
the lines of, "Wait two seconds to get the data from the remote server. If you don't have it by then,
give up on waiting for it..." Some things to note, though, when using this property. First, you must
specify to use an asynchronous request to the remote server. Also, if the timeout expires
when trying to access a page, the XMLHttp component will raise an error, meaning that you must use
On Error Resume Next to disable error trapping and then specifically check for a raised error
waitForResponse method. If this sounds confusing, don't worry, a code sample
will help clear everything up!
The below code sample illustrates how to use the XMLHttp component to access data from a remote Web server. If the request takes longer than three seconds, "default" data is used to simulate the data that should have been retrieved from the remote Web server.
At this point in the script, the local server is waiting for the remote server to send back some information.
We want to give the remote server a reasonable time to respond, but not so much time as to lose my visitor while
waiting for the data. Let's wait up to three seconds for this information. Remember, if the request takes
longer than three seconds, the XMLHttp component will raise an error. Hence we must turn off error
handling first, call the
waitForResponse method, and then check if an error has occurred
(for more information on error handling in VBScript be sure to read:
Error Handling in ASP):
If we reach this
Else code, we know that the server responded. However, what if we were requesting
a 404 page? Or what if all the data didn't get returned for some reason? To accommodate for these unexpected
behaviors that could seriously screw up our end results, we need to ensure that the
4 and that the
Status property, which returns the HTTP Response status,
200 (meaning that the HTTP request was successful). Otherwise, we have a successfully
At this point we have the data (or lack of it) stored in a variable that can be used anywhere in the remainder of the ASP page. I use this information on my front page, and all over my site. So to reduce stress on both servers I used the technique described in another 4Guys article to store the data in an application level variable, which I'd recommend you do as well. For more information on this technique, be sure to read: A Real-World Example of Caching Data in the Application Object.
I hope you found this article both interesting and illuminating! If you have any questions or problems, please don't hesitate to email me.