Parsing HTML Documents with the Html Agility PackBy Scott Mitchell
Screen scraping is the process of programmatically accessing and processing information from an external website. For example, a price comparison website might screen scrape a variety of online retailers to build a database of products and what various retailers are selling them for. Typically, screen scraping is performed by mimicking the behavior of a browser - namely, by making an HTTP request from code and then parsing and analyzing the returned HTML.
The .NET Framework offers a variety of classes for accessing data from a remote website, namely the
WebClient class and the
HttpWebRequest class. These classes are useful for making an HTTP
request to a remote website and pulling down the markup from a particular URL, but they offer no assistance in parsing the returned HTML. Instead, developers commonly
rely on string parsing methods like
String.Substring, and the like, or through the use of regular expressions.
Another option for parsing HTML documents is to use the Html Agility Pack, a free, open-source library designed to simplify reading from and writing to HTML documents. The Html Agility Pack constructs a Document Object Model (DOM) view of the HTML document being parsed. With a few lines of code, developers can walk through the DOM, moving from a node to its children, or vice versa. Also, the Html Agility Pack can return specific nodes in the DOM through the use of XPath expressions. (The Html Agility Pack also includes a class for downloading an HTML document from a remote website; this means you can both download and parse an external web page using the Html Agility Pack.)
This article shows how to get started using the Html Agility Pack and includes a number of real-world examples that illustrate this library's utility. A complete, working demo is available for download at the end of this article. Read on to learn more!
Getting Started: Downloading and Using the Html Agility Pack
The Html Agility Pack is a free, open-source library that parses an HTML document and constructs a Document Object Model (DOM) that can be traversed manually or by using XPath expressions. (To use the Html Agility Pack you must be using ASP.NET version 3.5 or later.) In a nutshell, the Html Agility Pack makes it easy to examine an HTML document for particular content, and to extract or modify that markup.
The Html Agility Pack is wrapped inside a single assembly,
HtmlAgilityPack.dll. To use the Html Agility Pack from your website you'll need to copy this
assembly into your website's
Bin folder. You can download the latest version of
HtmlAgilityPack.dll from the
Html Agility Pack project page; alternatively, you can download the demo available at the end of this article,
HtmlAgilityPack.dll version 1.4.0 in the
With the Html Agility Pack assembly in the
Bin folder you're ready to start downloading and parsing HTML documents. This article shows how to use the
Html Agility Pack to perform three different HTML parsing tasks.
Listing the Meta Tags on a Remote Web Page
Screen scraping usually involves downloading the HTML for a specific web page and picking out particular pieces of information. This first demo shows how to use the Html Agility Pack to download a remote web page and enumerate the
<meta>tags, displaying those
<meta>tags that contain both a
The Html Agility Pack contains a number of classes, all in the
HtmlAgilityPack namespace. Therefore, start by adding a
using statement (or
Imports statement if you are using VB) to the top of your code-behind class:
To download a web page from a remote server, use the
Load method, passing in the URL to download.
Load method returns an
HtmlDocument object. In the above code snippet we've assigned this returned object to the local variable
HtmlDocument class represents a complete HTML document and contains a
DocumentNode property, which returns
HtmlNode object that represents the root node of the document.
HtmlNode class has several germane properties worth noting. There are properties for traversing the DOM, including:
Name- gets or sets the node's name. For HTML elements this property returns (or assigns) the name of the tag - "body" for the
<body>tag, "p" for a
<p>tag, and so on.
Attributes- returns the collection of attributes for this element, if any.
InnerHtml- gets or sets the HTML content within the node.
InnerText- returns the text within the node.
NodeType- indicates the type of the node. Can be
Ancestorsmethod returns a collection of all ancestor nodes. And the
SelectNodesmethod returns a collection of nodes that match a specified XPath expression.
Given all of these methods and properties, there are a variety of ways you could get a list of all
<meta> tags in the HTML document. For this demo I
decided to use the
SelectNodes method. The statement below calls the
SelectNodes method of the
DocumentNode property, using the XPath expression "//meta", which returns all of the
<meta> tags in the document.
If there are no
<meta> tags in the document then, at this point,
metaTags will be
null. But if there are one or more
<meta> tags then
metaTags will be a collection of matching
HtmlNode objects. We can enumerate these matching nodes
an display their attributes.
|For More On XPath...|
If you are not familiar with XPath then the syntax - |
foreach loop enumerates the items in
metaTags (if it's not
null) and checks to see that there exists
content attributes. Presuming these attributes exist, the
<meta> tag information is emitted.
(Note how the value of an attribute is accessed using the syntax
And that's all there is to it! No messy regular expressions, no tangle of string parsing method calls, but rather a concise, readable syntax for accessing the HTML document's contents.
The following screen shot shows the above code snippet in action. Here, the user enters a URL into the textbox and clicks the Get Meta Tags button. Clicking this button
causes a postback and on postback the code we examined above is executed. Namely, the Html Agility Pack is used to download the content from the specified URL and
SelectNodes method is used to get back all
<meta> tags. Those
<meta> tags with
content attributes are displayed in a bulleted list.
Listing the Links on a Remote Web Page
The previous demo showed how to use the
SelectNodesmethod and an XPath expression to search the document for a particular set of nodes. Another approach is to use LINQ. The
HtmlNodeclass's methods that return a collection of nodes - such as
Descendants- return the collection as
IEnumerable<HtmlNode>objects. If you are familiar with LINQ you are aware that LINQ is setup to work with any object of type
IEnumerable<T>. Consequently, we can use LINQ to query an HTML document's nodes.
To demonstrate accessing node information using LINQ, I created a demo that retrieves the text and
href values for all hyperlinks (
on a page. The code starts out the same way as the previous demo - create an
HtmlWeb object and call its
But then it uses the
Descendants method and LINQ's query syntax to get all of the hyperlinks on the page. More specifically,
it gets all
<a> tags on the page that have an
href attribute and contain something other than white-space for their inner text and returns
a new, anonymous type that has two properties:
At this point you can enumerate over
linksOnPage to see all of the links on the specified web page. In the demo available for download, I displayed this
information by binding
linksOnPage to a ListView control named
The ListView's template is simple enough - it displays each item in a bulleted list:
The screen shot below shows the output when run on the 4GuysFromRolla.com homepage.
Modifying and Saving an HTML Document
The previous two demos illustrated how the Html Agility Pack takes HTML from a remote website and constructs a DOM that can be read from, but it's also possible to modify the DOM and save the updated DOM to disk (or to any stream, for that matter). This third and final demo starts like the other two - the user is prompted to enter a URL and that HTML document is downloaded. Once downloaded, it is modified in two ways:
- A new element in constructed programmatically and added as the first child of the
- All of the hyperlinks in the page are updated so that, when clicked, they are opened in another window. This is accomplished by setting each link's
This demo starts the same way as the previous two, by creating an
HtmlWeb object and calling its
<body> element is accessed. This is done using LINQ but this time using the extension methods (rather than the query syntax). The below line
of code says, in English, "From all of the descendants of the document node, give me the first node whose name equals 'body'. If no such node exists, give me back
If there is a
<body> element then we next need to create an HTML element and add it as the first child element of the
The following code creates a new HTML element node (
messageElement), adds a
style attribute, specifies the new element's name ("div"), and
then assigns its inner HTML. After this, the new element is inserted at the beginning of
SelectNodes method is used to retrieve all
<a> tags that have an
href attribute specified. Presuming any such tags
were found, they are enumerated. For each link a check is performed to see if there is already a
target attribute defined. If not, the
attribute is added with a value of
_blank. If the target attribute already exists it is set to
At this point the document has been modified, but all of these modifications have occurred in memory. To save the modified document we call the
object's Save method, passing in the file name. In this demo I place the modified markup in the
~/ModifiedPages folder using a file name of the form
guid.htm where guid is a globally unique identifier (e.g., a value like
The following screen shot shows the contents of the saved, modified 4GuysFromRolla.com homepage. The big block of text at the top is the HTML element we added at the
start of the
<body>, and clicking on any link in the page opens the link in a new window. (The modified version, when viewed through a browser, has
many broken images and styling issues because the 4Guys homepage, like many other sites, uses relative paths for images and external resources. Because I didn't also
download the associated images and external resources, these are not found when viewing the modified page.)
If you do a View/Source on the modified web page you'll see that the HTML content we added and modified is reflected there. Here is the markup emitted by the
messageElement node we added:
And here is the markup of one of the many links on the page. Note the presence of the
target="_blank" attribute - this isn't found in the original markup.