The Html Agility Pack (HAP) is a .NET library that is used to parse HTML documents and perform web scraping tasks. It provides a flexible and robust API to manipulate HTML documents, both as a whole and at an individual element level. Two of the core classes in the Html Agility Pack are HtmlDocument
and HtmlNode
. Understanding the difference between these two classes is essential for effectively using the HAP.
HtmlDocument
The HtmlDocument
class represents an entire HTML document. It serves as the entry point for parsing HTML content and provides access to the document's overall structure. An instance of HtmlDocument
contains the complete DOM (Document Object Model) tree, and it allows you to navigate and query the document using various methods and properties.
When you load HTML content into an HtmlDocument
, you are creating a representation of the entire web page, which can then be traversed and manipulated. The HtmlDocument
class provides methods to load HTML from a string, file, stream, or web response.
Here's an example of how to load HTML into an HtmlDocument
:
var htmlDoc = new HtmlAgilityPack.HtmlDocument();
htmlDoc.LoadHtml("<html><body><p>Hello, World!</p></body></html>");
HtmlNode
The HtmlNode
class represents a single node or element within the HTML document, such as an <a>
tag, a <div>
block, or a text node. An HtmlNode
could be an element node, a comment node, a text node, etc. Each HtmlNode
can have child nodes, creating a hierarchy that mirrors the structure of the HTML document.
HtmlNode
objects are used to manipulate individual elements within the document. For example, you can change an element's attributes, inner text, or even remove the element altogether. You can also use HtmlNode
instances to navigate the DOM tree by accessing parent, sibling, or child nodes.
Here's an example of how to access and manipulate an HtmlNode
:
// Assume htmlDoc is an already loaded HtmlDocument as shown above
HtmlNode pNode = htmlDoc.DocumentNode.SelectSingleNode("//p");
if (pNode != null)
{
pNode.InnerHtml = "Goodbye, World!";
}
In this example, we use the SelectSingleNode
method on the DocumentNode
property of HtmlDocument
to find the first paragraph (<p>
) element in the document. We then change its inner HTML content.
Summary
HtmlDocument
represents the entire HTML document and is the starting point for parsing and manipulating the HTML content.HtmlNode
represents a single node within the document, which could be an element, text, or comment. It is used for element-level manipulation.
Together, the HtmlDocument
and HtmlNode
classes provide a powerful way to navigate and edit HTML content in the context of web scraping or any situation where you need to manipulate HTML programmatically using the .NET framework.