What is the HtmlDocument Class and How Do I Use It?
The HtmlDocument
class is the core component of Html Agility Pack, a powerful .NET library for parsing and manipulating HTML documents. This class provides a comprehensive DOM (Document Object Model) representation of HTML content, allowing developers to navigate, query, and modify HTML structures programmatically without requiring a web browser.
Understanding the HtmlDocument Class
The HtmlDocument
class serves as the entry point for all HTML parsing operations in Html Agility Pack. It represents an entire HTML document and provides methods to load HTML content from various sources, including strings, files, URLs, and streams. Once loaded, the document exposes a hierarchical tree structure of HTML elements that can be traversed and manipulated.
Key Features
- Fault-tolerant parsing: Handles malformed HTML gracefully
- XPath support: Query elements using XPath expressions
- CSS selector support: Select elements using CSS-like selectors
- Document modification: Add, remove, or modify HTML elements
- Multiple input sources: Load from strings, files, URLs, or streams
Basic Usage and Setup
Installation
First, install Html Agility Pack via NuGet Package Manager:
Install-Package HtmlAgilityPack
Or using .NET CLI:
dotnet add package HtmlAgilityPack
Creating an HtmlDocument Instance
using HtmlAgilityPack;
// Create a new HtmlDocument instance
var doc = new HtmlDocument();
Loading HTML Content
Loading from String
The most common scenario involves parsing HTML content from a string:
var html = @"
<html>
<head>
<title>Sample Page</title>
</head>
<body>
<div class='container'>
<h1>Welcome to My Site</h1>
<p>This is a sample paragraph.</p>
<ul>
<li>Item 1</li>
<li>Item 2</li>
<li>Item 3</li>
</ul>
</div>
</body>
</html>";
var doc = new HtmlDocument();
doc.LoadHtml(html);
Loading from File
var doc = new HtmlDocument();
doc.Load("path/to/your/file.html");
Loading from URL
var web = new HtmlWeb();
var doc = web.Load("https://example.com");
Loading from Stream
using var stream = new FileStream("document.html", FileMode.Open);
var doc = new HtmlDocument();
doc.Load(stream);
Navigating the Document Structure
Accessing Document Properties
var doc = new HtmlDocument();
doc.LoadHtml(html);
// Access the document node (root)
var documentNode = doc.DocumentNode;
// Get all child nodes
var childNodes = documentNode.ChildNodes;
// Access specific elements
var htmlNode = doc.DocumentNode.SelectSingleNode("//html");
var headNode = doc.DocumentNode.SelectSingleNode("//head");
var bodyNode = doc.DocumentNode.SelectSingleNode("//body");
Finding Elements
Using XPath Expressions
// Find single element
var titleNode = doc.DocumentNode.SelectSingleNode("//title");
var titleText = titleNode?.InnerText;
// Find multiple elements
var listItems = doc.DocumentNode.SelectNodes("//li");
foreach (var item in listItems)
{
Console.WriteLine(item.InnerText);
}
// Find elements with specific attributes
var containerDiv = doc.DocumentNode.SelectSingleNode("//div[@class='container']");
Using LINQ to XML Style Queries
// Find elements by tag name
var paragraphs = doc.DocumentNode.Descendants("p");
foreach (var p in paragraphs)
{
Console.WriteLine(p.InnerText);
}
// Find elements with specific attributes
var divsWithClass = doc.DocumentNode
.Descendants("div")
.Where(div => div.GetAttributeValue("class", "") == "container");
Extracting Data
Getting Text Content
var doc = new HtmlDocument();
doc.LoadHtml(html);
// Get inner text (without HTML tags)
var heading = doc.DocumentNode.SelectSingleNode("//h1");
var headingText = heading.InnerText; // "Welcome to My Site"
// Get inner HTML (with HTML tags)
var containerDiv = doc.DocumentNode.SelectSingleNode("//div[@class='container']");
var containerHtml = containerDiv.InnerHtml;
Getting Attribute Values
// Get attribute value
var divClass = containerDiv.GetAttributeValue("class", "default-value");
// Check if attribute exists
var hasId = containerDiv.HasAttribute("id");
// Get all attributes
var attributes = containerDiv.Attributes;
foreach (var attr in attributes)
{
Console.WriteLine($"{attr.Name}: {attr.Value}");
}
Modifying HTML Content
Adding Elements
var doc = new HtmlDocument();
doc.LoadHtml(html);
// Create a new element
var newParagraph = HtmlNode.CreateNode("<p>This is a new paragraph.</p>");
// Add to existing element
var containerDiv = doc.DocumentNode.SelectSingleNode("//div[@class='container']");
containerDiv.AppendChild(newParagraph);
Modifying Existing Elements
// Change text content
var heading = doc.DocumentNode.SelectSingleNode("//h1");
heading.InnerHtml = "Updated Heading";
// Change attribute values
heading.SetAttributeValue("class", "main-heading");
// Add new attributes
heading.SetAttributeValue("id", "page-title");
Removing Elements
// Remove specific element
var paragraph = doc.DocumentNode.SelectSingleNode("//p");
paragraph.Remove();
// Remove all elements of a type
var listItems = doc.DocumentNode.SelectNodes("//li");
foreach (var item in listItems)
{
item.Remove();
}
Advanced Usage Patterns
Error Handling and Validation
try
{
var doc = new HtmlDocument();
doc.LoadHtml(html);
// Check for parse errors
if (doc.ParseErrors != null && doc.ParseErrors.Count() > 0)
{
foreach (var error in doc.ParseErrors)
{
Console.WriteLine($"Parse Error: {error.Reason} at line {error.Line}");
}
}
// Validate element existence before accessing
var element = doc.DocumentNode.SelectSingleNode("//div[@id='content']");
if (element != null)
{
// Safe to access element properties
var content = element.InnerText;
}
}
catch (Exception ex)
{
Console.WriteLine($"Error processing HTML: {ex.Message}");
}
Working with Forms
var doc = new HtmlDocument();
doc.LoadHtml(html);
// Find form elements
var forms = doc.DocumentNode.SelectNodes("//form");
foreach (var form in forms)
{
var action = form.GetAttributeValue("action", "");
var method = form.GetAttributeValue("method", "GET");
Console.WriteLine($"Form: {action} ({method})");
// Find input fields
var inputs = form.SelectNodes(".//input");
if (inputs != null)
{
foreach (var input in inputs)
{
var name = input.GetAttributeValue("name", "");
var type = input.GetAttributeValue("type", "text");
var value = input.GetAttributeValue("value", "");
Console.WriteLine($" Input: {name} ({type}) = {value}");
}
}
}
Extracting Links and Images
// Extract all links
var links = doc.DocumentNode.SelectNodes("//a[@href]");
if (links != null)
{
foreach (var link in links)
{
var url = link.GetAttributeValue("href", "");
var text = link.InnerText.Trim();
Console.WriteLine($"Link: {text} -> {url}");
}
}
// Extract all images
var images = doc.DocumentNode.SelectNodes("//img");
if (images != null)
{
foreach (var img in images)
{
var src = img.GetAttributeValue("src", "");
var alt = img.GetAttributeValue("alt", "");
Console.WriteLine($"Image: {alt} -> {src}");
}
}
Integration with Web Scraping Workflows
The HtmlDocument class integrates seamlessly with other web scraping tools and APIs. For complex JavaScript-rendered content that Html Agility Pack cannot handle, you might need to combine it with browser automation tools. When dealing with dynamic content, consider using solutions that can handle JavaScript-heavy websites with headless browsers before parsing the rendered HTML with HtmlDocument.
For scenarios requiring more robust error handling and retry mechanisms, you can implement patterns similar to those used in handling timeouts in browser automation, applying similar principles to HTTP requests when loading HTML content.
Performance Considerations
Memory Management
// Dispose of resources properly
using var doc = new HtmlDocument();
doc.LoadHtml(html);
// Document will be automatically disposed
Optimizing Large Documents
// For large documents, consider streaming
var doc = new HtmlDocument();
// Configure options for better performance
doc.OptionFixNestedTags = true;
doc.OptionAutoCloseOnEnd = true;
doc.OptionCheckSyntax = false; // Disable syntax checking for speed
doc.LoadHtml(html);
Common Pitfalls and Solutions
Handling Malformed HTML
Html Agility Pack is designed to handle malformed HTML, but understanding its behavior helps:
var malformedHtml = "<div><p>Unclosed paragraph<div>Nested div</div>";
var doc = new HtmlDocument();
doc.LoadHtml(malformedHtml);
// Html Agility Pack will attempt to fix the structure
var fixedHtml = doc.DocumentNode.OuterHtml;
Console.WriteLine(fixedHtml);
Case Sensitivity
// HTML element names are case-insensitive
var element1 = doc.DocumentNode.SelectSingleNode("//DIV");
var element2 = doc.DocumentNode.SelectSingleNode("//div");
// Both will work the same way
// But attribute values might be case-sensitive
var classElements = doc.DocumentNode.SelectNodes("//div[@class='Container']"); // Won't match 'container'
Conclusion
The HtmlDocument class in Html Agility Pack provides a robust and flexible foundation for HTML parsing and manipulation in .NET applications. Its fault-tolerant parsing capabilities, combined with powerful querying methods using XPath and LINQ, make it an excellent choice for web scraping, content extraction, and HTML processing tasks.
Whether you're building a simple HTML parser or a complex web scraping system, understanding how to effectively use the HtmlDocument class will significantly enhance your ability to work with HTML content programmatically. Remember to handle errors gracefully, dispose of resources properly, and consider performance implications when working with large documents.
For more advanced scenarios involving dynamic content or complex user interactions, consider combining Html Agility Pack with browser automation tools or specialized web scraping APIs that can handle JavaScript-rendered content before parsing with HtmlDocument.