When you think ASP, think...
Recent Articles
All Articles
ASP.NET Articles
Related Web Technologies
User Tips!
Coding Tips

Sample Chapters
JavaScript Tutorials
MSDN Communities Hub
Official Docs
Stump the SQL Guru!
XML Info
Author an Article
Print this page.
Published: Wednesday, October 24, 2001

Safe Screenscraping with Microsoft's XMLHTTP Component

By Paul Kosmas

There are many times when, as a Web developer, you need to grab HTML or XML data from a remote Web site. In fact, there are a plethora of such articles here on 4Guys and on ASPFAQs.com. In my article, I am going to examine some of the more advanced features of a free Microsoft component that can be used to grab either XML or HTML data from a remote Web site: XMLHttp. For a general discussion on screen scraping in ASP using XMLHttp (and other free components), read this FAQ. For this article I am going to assume you are familiar with grabbing HTML or XML data from a remote server through XMLHttp.

- continued -

Timeout Limitations
One problem that can arise with grabbing data from a remote server is that the remote server you are calling may be experiencing problems. For example, the data you are trying to retrieve into your site may be unavailable, moved, or the remote server may be down, and this could crash your script, returning an unwanted error message to the client browser or bring down your page altogether.

Handling this situation gracefully is simple once you are aware of a few under-documented properties of the XMLhttp component. Note that for this article I will be discussing the features available in MS XMLHttp component v. 3 SP1. You can download the latest version of the XMLHttp component at http://msdn.microsoft.com/downloads/default.asp?url=/downloads/topic.asp?url=/msdn-files/028/000/072/topic.xml.

On the Web site that I run, we wanted to grab some weather data from a local source (accuweather's nearest reporting sources were over 60 miles away, and at different elevations, so it wasn't quite "accurate" after all). Incidentally, the data we needed was not XML, so we requested the information as a string, but you can apply this solution to a pure XML request as well.

Now, how do we handle a scenario in which the Web site takes a long time to respond to our request? XMLHttp contains a readyState property which returns the state of the data being requested. This property has one of five possible values that indicate the progress of the request:

readyState Property Values
0The object has been created but has not been initialized because open method has not been called.
1The object has been created but the send method has not been called.
2The send method has been called and the status and headers are available, but the response is not yet available.
3Some data has been received. You can call responseBody and responseText to get the current partial results.
4All the data has been received, and the complete data is available in responseBody and responseText.

Of course, the value of the readyState property will change over time. Assuming a successful request, the readyState property will, throughout the lifetime of our request, have the values 0 through 4 at one point or another. The thing we need to be wary of is if it's taking too long to reach the final stage.

Fortunately, XMLHttp contains a waitForResponse method that accepts a parameter specifying how many seconds to wait for the remote server to reply. We can use this method to say something along the lines of, "Wait two seconds to get the data from the remote server. If you don't have it by then, give up on waiting for it..." Some things to note, though, when using this property. First, you must specify to use an asynchronous request to the remote server. Also, if the timeout expires when trying to access a page, the XMLHttp component will raise an error, meaning that you must use On Error Resume Next to disable error trapping and then specifically check for a raised error after the waitForResponse method. If this sounds confusing, don't worry, a code sample will help clear everything up!

The below code sample illustrates how to use the XMLHttp component to access data from a remote Web server. If the request takes longer than three seconds, "default" data is used to simulate the data that should have been retrieved from the remote Web server.
  'Declare the variables, and set the value of the variable url as 
  'the full URL of the requested page.
  Dim xml, strData, url
  url = "http://www.someserver.com/somepage.asp"

  'Next, create an instance of the MS XMLhttp component.
  'old version, unstable server side performance so upgrade if you can
  ' *** OLD *** Set xml = Server.CreateObject("Microsoft.XMLHTTP")
  'new version, better server side performance
  Set xml = Server.CreateObject("MSXML2.ServerXMLHTTP")

  'Now open the connection, and send the request to the remote server. You 
  'need to set the optional Async parameter to True on the open method for 
  'this script to work. Otherwise, the waitForResponse method used 
  'below will have no effect.
  xml.Open "GET", url, true   ' the True specifies an asynchronous request
  Call xml.Send()

At this point in the script, the local server is waiting for the remote server to send back some information. We want to give the remote server a reasonable time to respond, but not so much time as to lose my visitor while waiting for the data. Let's wait up to three seconds for this information. Remember, if the request takes longer than three seconds, the XMLHttp component will raise an error. Hence we must turn off error handling first, call the waitForResponse method, and then check if an error has occurred (for more information on error handling in VBScript be sure to read: Error Handling in ASP):

  'Turn off error handling
  On Error Resume Next
  'Wait for up to 3 seconds if we've not gotten the data yet
  If xml.readyState <> 4 then
    xml.waitForResponse 3
  End If

  'Did an error occur?  If so, use a default value for our data
  If Err.Number <> 0 then
    strData = "some default text..."

If we reach this Else code, we know that the server responded. However, what if we were requesting a 404 page? Or what if all the data didn't get returned for some reason? To accommodate for these unexpected behaviors that could seriously screw up our end results, we need to ensure that the readyState property equals 4 and that the Status property, which returns the HTTP Response status, equals 200 (meaning that the HTTP request was successful). Otherwise, we have a successfully completed request.

    If (xml.readyState <> 4) Or (xml.Status <> 200) Then
      'Abort the XMLHttp request
	  strData = "Problem communicating with remote server..."
      strData = xml.ResponseText
    End If
  End If

At this point we have the data (or lack of it) stored in a variable that can be used anywhere in the remainder of the ASP page. I use this information on my front page, and all over my site. So to reduce stress on both servers I used the technique described in another 4Guys article to store the data in an application level variable, which I'd recommend you do as well. For more information on this technique, be sure to read: A Real-World Example of Caching Data in the Application Object.

I hope you found this article both interesting and illuminating! If you have any questions or problems, please don't hesitate to email me.

Happy Programming!

  • By Paul Kosmas

  • ASP.NET [1.x] [2.0] | ASPFAQs.com | Advertise | Feedback | Author an Article