When you think ASP, think...
Recent Articles
All Articles
ASP.NET Articles
ASPFAQs.com
Message Board
Related Web Technologies
User Tips!
Coding Tips
Search

Sections:
Book Reviews
Sample Chapters
Commonly Asked Message Board Questions
JavaScript Tutorials
MSDN Communities Hub
Official Docs
Security
Stump the SQL Guru!
Web Hosts
XML
Information:
Advertise
Feedback
Author an Article
Jobs

ASP ASP.NET ASP FAQs Message Board Feedback ASP Jobs
 
Print this Page!
Published: Wednesday, January 12, 2011

Parsing HTML Documents with the Html Agility Pack

By Scott Mitchell


Introduction


Screen scraping is the process of programmatically accessing and processing information from an external website. For example, a price comparison website might screen scrape a variety of online retailers to build a database of products and what various retailers are selling them for. Typically, screen scraping is performed by mimicking the behavior of a browser - namely, by making an HTTP request from code and then parsing and analyzing the returned HTML.

The .NET Framework offers a variety of classes for accessing data from a remote website, namely the WebClient class and the HttpWebRequest class. These classes are useful for making an HTTP request to a remote website and pulling down the markup from a particular URL, but they offer no assistance in parsing the returned HTML. Instead, developers commonly rely on string parsing methods like String.IndexOf, String.Substring, and the like, or through the use of regular expressions.

Another option for parsing HTML documents is to use the Html Agility Pack, a free, open-source library designed to simplify reading from and writing to HTML documents. The Html Agility Pack constructs a Document Object Model (DOM) view of the HTML document being parsed. With a few lines of code, developers can walk through the DOM, moving from a node to its children, or vice versa. Also, the Html Agility Pack can return specific nodes in the DOM through the use of XPath expressions. (The Html Agility Pack also includes a class for downloading an HTML document from a remote website; this means you can both download and parse an external web page using the Html Agility Pack.)

This article shows how to get started using the Html Agility Pack and includes a number of real-world examples that illustrate this library's utility. A complete, working demo is available for download at the end of this article. Read on to learn more!

- continued -

Getting Started: Downloading and Using the Html Agility Pack


The Html Agility Pack is a free, open-source library that parses an HTML document and constructs a Document Object Model (DOM) that can be traversed manually or by using XPath expressions. (To use the Html Agility Pack you must be using ASP.NET version 3.5 or later.) In a nutshell, the Html Agility Pack makes it easy to examine an HTML document for particular content, and to extract or modify that markup.

The Html Agility Pack is wrapped inside a single assembly, HtmlAgilityPack.dll. To use the Html Agility Pack from your website you'll need to copy this assembly into your website's Bin folder. You can download the latest version of HtmlAgilityPack.dll from the Html Agility Pack project page; alternatively, you can download the demo available at the end of this article, which includes HtmlAgilityPack.dll version 1.4.0 in the Bin folder.

With the Html Agility Pack assembly in the Bin folder you're ready to start downloading and parsing HTML documents. This article shows how to use the Html Agility Pack to perform three different HTML parsing tasks.

Listing the Meta Tags on a Remote Web Page


Screen scraping usually involves downloading the HTML for a specific web page and picking out particular pieces of information. This first demo shows how to use the Html Agility Pack to download a remote web page and enumerate the <meta> tags, displaying those <meta> tags that contain both a name and content attribute.

The Html Agility Pack contains a number of classes, all in the HtmlAgilityPack namespace. Therefore, start by adding a using statement (or Imports statement if you are using VB) to the top of your code-behind class:

using HtmlAgilityPack;

To download a web page from a remote server, use the HtmlWeb class's Load method, passing in the URL to download.

var webGet = new HtmlWeb();
var document = webGet.Load(url);

The Load method returns an HtmlDocument object. In the above code snippet we've assigned this returned object to the local variable document. The HtmlDocument class represents a complete HTML document and contains a DocumentNode property, which returns an HtmlNode object that represents the root node of the document.

The HtmlNode class has several germane properties worth noting. There are properties for traversing the DOM, including:

  • ParentNode,
  • ChildNodes,
  • NextSibling, and
  • PreviousSibling
There are properties for determining information about the node itself, such as:
  • Name - gets or sets the node's name. For HTML elements this property returns (or assigns) the name of the tag - "body" for the <body> tag, "p" for a <p> tag, and so on.
  • Attributes - returns the collection of attributes for this element, if any.
  • InnerHtml - gets or sets the HTML content within the node.
  • InnerText - returns the text within the node.
  • NodeType - indicates the type of the node. Can be Document, Element, Comment, or Text.
There are also methods for retrieving particular nodes relative to this one. For instance, the Ancestors method returns a collection of all ancestor nodes. And the SelectNodes method returns a collection of nodes that match a specified XPath expression.

Given all of these methods and properties, there are a variety of ways you could get a list of all <meta> tags in the HTML document. For this demo I decided to use the SelectNodes method. The statement below calls the SelectNodes method of the document object's DocumentNode property, using the XPath expression "//meta", which returns all of the <meta> tags in the document.

var metaTags = document.DocumentNode.SelectNodes("//meta");

If there are no <meta> tags in the document then, at this point, metaTags will be null. But if there are one or more <meta> tags then metaTags will be a collection of matching HtmlNode objects. We can enumerate these matching nodes an display their attributes.

For More On XPath...
If you are not familiar with XPath then the syntax - //meta - may look a little Greek. XPath is a special syntax used to navigate through elements and attributes in an XML document. The statement "//meta" says, in English, give me any nodes in the document from the current node (DocumentNode) that have the name "meta" no matter where they appear in the DOM." The XPath tutorial at w3schools.com offers a good overview of the XPath standard. If you are new to XPath or a bit rusty, you'll find the XPath Syntax tutorial invaluable.

The following foreach loop enumerates the items in metaTags (if it's not null) and checks to see that there exists name and content attributes. Presuming these attributes exist, the <meta> tag information is emitted. (Note how the value of an attribute is accessed using the syntax tag.Attributes["attributeName"].Value.)

if (metaTags != null)
{
   foreach (var tag in metaTags)
   {
      if (tag.Attributes["name"] != null && tag.Attributes["content"] != null)
      {
         ... output tag.Attributes["name"].Value and tag.Attributes["content"].Value ...
      }
   }
}

And that's all there is to it! No messy regular expressions, no tangle of string parsing method calls, but rather a concise, readable syntax for accessing the HTML document's contents.

The following screen shot shows the above code snippet in action. Here, the user enters a URL into the textbox and clicks the Get Meta Tags button. Clicking this button causes a postback and on postback the code we examined above is executed. Namely, the Html Agility Pack is used to download the content from the specified URL and the SelectNodes method is used to get back all <meta> tags. Those <meta> tags with name and content attributes are displayed in a bulleted list.

The meta tags on 4GuysFromRolla.com are listed.

Listing the Links on a Remote Web Page


The previous demo showed how to use the SelectNodes method and an XPath expression to search the document for a particular set of nodes. Another approach is to use LINQ. The HtmlNode class's methods that return a collection of nodes - such as Ancestors and Descendants - return the collection as IEnumerable<HtmlNode> objects. If you are familiar with LINQ you are aware that LINQ is setup to work with any object of type IEnumerable<T>. Consequently, we can use LINQ to query an HTML document's nodes.

To demonstrate accessing node information using LINQ, I created a demo that retrieves the text and href values for all hyperlinks (<a> tags) on a page. The code starts out the same way as the previous demo - create an HtmlWeb object and call its Load method:

var webGet = new HtmlWeb();
var document = webGet.Load(url);

But then it uses the document object's Descendants method and LINQ's query syntax to get all of the hyperlinks on the page. More specifically, it gets all <a> tags on the page that have an href attribute and contain something other than white-space for their inner text and returns a new, anonymous type that has two properties: Url and Text.

var linksOnPage = from lnks in document.DocumentNode.Descendants()
                  where lnks.Name == "a" &&
                       lnks.Attributes["href"] != null &&
                       lnks.InnerText.Trim().Length > 0
                  select new
                  {
                     Url = lnks.Attributes["href"].Value,
                     Text = lnks.InnerText
                  };

At this point you can enumerate over linksOnPage to see all of the links on the specified web page. In the demo available for download, I displayed this information by binding linksOnPage to a ListView control named lvLinks:

lvLinks.DataSource = linksOnPage;
lvLinks.DataBind();

The ListView's template is simple enough - it displays each item in a bulleted list:

<asp:ListView ID="lvLinks" runat="server">
   <LayoutTemplate>
      <ul>
         <asp:PlaceHolder runat="server" ID="itemPlaceholder" />
      </ul>
   </LayoutTemplate>
   
   <ItemTemplate>
      <li>
         <%# Eval("Text") %> - <%# Eval("Url") %>
      </li>
   </ItemTemplate>
</asp:ListView>

The screen shot below shows the output when run on the 4GuysFromRolla.com homepage.

A list of links found on the 4GuysFromRolla.com homepage.

Modifying and Saving an HTML Document


The previous two demos illustrated how the Html Agility Pack takes HTML from a remote website and constructs a DOM that can be read from, but it's also possible to modify the DOM and save the updated DOM to disk (or to any stream, for that matter). This third and final demo starts like the other two - the user is prompted to enter a URL and that HTML document is downloaded. Once downloaded, it is modified in two ways:
  1. A new element in constructed programmatically and added as the first child of the <body> element, and
  2. All of the hyperlinks in the page are updated so that, when clicked, they are opened in another window. This is accomplished by setting each link's target attribute to _blank.
After the document has been modified it is saved to disk. (Of course, this is saved to the local file system - that is, to the file system of the computer running the code - and not to the file system of the remote web server from where the HTML document was downloaded.)

This demo starts the same way as the previous two, by creating an HtmlWeb object and calling its Load method:

var webGet = new HtmlWeb();
var document = webGet.Load(url);

Next, the <body> element is accessed. This is done using LINQ but this time using the extension methods (rather than the query syntax). The below line of code says, in English, "From all of the descendants of the document node, give me the first node whose name equals 'body'. If no such node exists, give me back the value null."

var body = document.DocumentNode.Descendants()
                                .Where(n => n.Name == "body")
                                .FirstOrDefault();

If there is a <body> element then we next need to create an HTML element and add it as the first child element of the <body>. The following code creates a new HTML element node (messageElement), adds a style attribute, specifies the new element's name ("div"), and then assigns its inner HTML. After this, the new element is inserted at the beginning of body's ChildNodes collection.

if (body != null)
{
   var messageElement = new HtmlNode(HtmlNodeType.Element, document, 0);
   messageElement.Attributes.Add("style", "width:95%;border:solid black 2px;background-color:#ffc;font-size:xx-large;text-align:center");
   messageElement.Name = "div";
   messageElement.InnerHtml = "<p>Hello! This page was modified by the Html Agility Pack!</p><p>Click on a link below... it should open in a new window!</p>";

   body.ChildNodes.Insert(0, messageElement);
}

Next, the SelectNodes method is used to retrieve all <a> tags that have an href attribute specified. Presuming any such tags were found, they are enumerated. For each link a check is performed to see if there is already a target attribute defined. If not, the target attribute is added with a value of _blank. If the target attribute already exists it is set to _blank.

var linksThatDoNotOpenInNewWindow = document.DocumentNode.SelectNodes("//a[@href]");
if (linksThatDoNotOpenInNewWindow != null)
{
   foreach (var link in linksThatDoNotOpenInNewWindow)
      if (link.Attributes["target"] == null)
         link.Attributes.Add("target", "_blank");
      else
         link.Attributes["target"].Value = "_blank";
}

At this point the document has been modified, but all of these modifications have occurred in memory. To save the modified document we call the document object's Save method, passing in the file name. In this demo I place the modified markup in the ~/ModifiedPages folder using a file name of the form guid.htm where guid is a globally unique identifier (e.g., a value like 02cdb8d8-3a01-4076-baaa-f7a8bd6b22ea).

var fileName = string.Format("~/ModifiedPages/{0}.htm", Guid.NewGuid().ToString());
document.Save(Server.MapPath(fileName));

The following screen shot shows the contents of the saved, modified 4GuysFromRolla.com homepage. The big block of text at the top is the HTML element we added at the start of the <body>, and clicking on any link in the page opens the link in a new window. (The modified version, when viewed through a browser, has many broken images and styling issues because the 4Guys homepage, like many other sites, uses relative paths for images and external resources. Because I didn't also download the associated images and external resources, these are not found when viewing the modified page.)

A modified version of the 4GuysFromRolla.com homepage has been saved.

If you do a View/Source on the modified web page you'll see that the HTML content we added and modified is reflected there. Here is the markup emitted by the messageElement node we added:

<div style="width:95%;border:solid black 2px;background-color:#ffc;font-size:xx-large;text-align:center"><p>Hello! This page was modified by the Html Agility Pack!</p><p>Click on a link below... it should open in a new window!</p></div>

And here is the markup of one of the many links on the page. Note the presence of the target="_blank" attribute - this isn't found in the original markup.

<a href="http://www.4guysfromrolla.com/articles/122910-1.aspx" class="headlines" target="_blank">2010's Most Popular Articles</a>

Happy Programming!

  • By Scott Mitchell


    Attachments:

  • Download the Demo Code Used in this Article

    Further Reading

  • Html Agility Pack Project Page
  • XPath Tutorial | XPath Syntax Tutorial
  • Screen Scrapes in ASP.NET
  • A Deeper Look at Performing HTTP Requests in an ASP.NET Page


  • ASP.NET [1.x] [2.0] | ASPMessageboard.com | ASPFAQs.com | Advertise | Feedback | Author an Article