What is the difference between HtmlDocument and HtmlNode in Html Agility Pack?

The Html Agility Pack (HAP) is a .NET library that is used to parse HTML documents and perform web scraping tasks. It provides a flexible and robust API to manipulate HTML documents, both as a whole and at an individual element level. Two of the core classes in the Html Agility Pack are HtmlDocument and HtmlNode. Understanding the difference between these two classes is essential for effectively using the HAP.

HtmlDocument

The HtmlDocument class represents an entire HTML document. It serves as the entry point for parsing HTML content and provides access to the document's overall structure. An instance of HtmlDocument contains the complete DOM (Document Object Model) tree, and it allows you to navigate and query the document using various methods and properties.

When you load HTML content into an HtmlDocument, you are creating a representation of the entire web page, which can then be traversed and manipulated. The HtmlDocument class provides methods to load HTML from a string, file, stream, or web response.

Here's an example of how to load HTML into an HtmlDocument:

var htmlDoc = new HtmlAgilityPack.HtmlDocument();
htmlDoc.LoadHtml("<html><body><p>Hello, World!</p></body></html>");

HtmlNode

The HtmlNode class represents a single node or element within the HTML document, such as an <a> tag, a <div> block, or a text node. An HtmlNode could be an element node, a comment node, a text node, etc. Each HtmlNode can have child nodes, creating a hierarchy that mirrors the structure of the HTML document.

HtmlNode objects are used to manipulate individual elements within the document. For example, you can change an element's attributes, inner text, or even remove the element altogether. You can also use HtmlNode instances to navigate the DOM tree by accessing parent, sibling, or child nodes.

Here's an example of how to access and manipulate an HtmlNode:

// Assume htmlDoc is an already loaded HtmlDocument as shown above
HtmlNode pNode = htmlDoc.DocumentNode.SelectSingleNode("//p");
if (pNode != null)
{
    pNode.InnerHtml = "Goodbye, World!";
}

In this example, we use the SelectSingleNode method on the DocumentNode property of HtmlDocument to find the first paragraph (<p>) element in the document. We then change its inner HTML content.

Summary

  • HtmlDocument represents the entire HTML document and is the starting point for parsing and manipulating the HTML content.
  • HtmlNode represents a single node within the document, which could be an element, text, or comment. It is used for element-level manipulation.

Together, the HtmlDocument and HtmlNode classes provide a powerful way to navigate and edit HTML content in the context of web scraping or any situation where you need to manipulate HTML programmatically using the .NET framework.

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon