What is the HtmlDocument Class and How Do I Use It?

The HtmlDocument class is the core component of Html Agility Pack, a powerful .NET library for parsing and manipulating HTML documents. This class provides a comprehensive DOM (Document Object Model) representation of HTML content, allowing developers to navigate, query, and modify HTML structures programmatically without requiring a web browser.

Understanding the HtmlDocument Class

The HtmlDocument class serves as the entry point for all HTML parsing operations in Html Agility Pack. It represents an entire HTML document and provides methods to load HTML content from various sources, including strings, files, URLs, and streams. Once loaded, the document exposes a hierarchical tree structure of HTML elements that can be traversed and manipulated.

Key Features

Fault-tolerant parsing: Handles malformed HTML gracefully
XPath support: Query elements using XPath expressions
CSS selector support: Select elements using CSS-like selectors
Document modification: Add, remove, or modify HTML elements
Multiple input sources: Load from strings, files, URLs, or streams

Basic Usage and Setup

Installation

First, install Html Agility Pack via NuGet Package Manager:

Install-Package HtmlAgilityPack

Or using .NET CLI:

dotnet add package HtmlAgilityPack

Creating an HtmlDocument Instance

using HtmlAgilityPack;

// Create a new HtmlDocument instance
var doc = new HtmlDocument();

Loading HTML Content

Loading from String

The most common scenario involves parsing HTML content from a string:

var html = @"
<html>
<head>
    <title>Sample Page</title>
</head>
<body>
    <div class='container'>
        <h1>Welcome to My Site</h1>
        <p>This is a sample paragraph.</p>
        <ul>
            <li>Item 1</li>
            <li>Item 2</li>
            <li>Item 3</li>
        </ul>
    </div>
</body>
</html>";

var doc = new HtmlDocument();
doc.LoadHtml(html);

Loading from File

var doc = new HtmlDocument();
doc.Load("path/to/your/file.html");

Loading from URL

var web = new HtmlWeb();
var doc = web.Load("https://example.com");

Loading from Stream

using var stream = new FileStream("document.html", FileMode.Open);
var doc = new HtmlDocument();
doc.Load(stream);

Navigating the Document Structure

Accessing Document Properties

var doc = new HtmlDocument();
doc.LoadHtml(html);

// Access the document node (root)
var documentNode = doc.DocumentNode;

// Get all child nodes
var childNodes = documentNode.ChildNodes;

// Access specific elements
var htmlNode = doc.DocumentNode.SelectSingleNode("//html");
var headNode = doc.DocumentNode.SelectSingleNode("//head");
var bodyNode = doc.DocumentNode.SelectSingleNode("//body");

Finding Elements

Using XPath Expressions

// Find single element
var titleNode = doc.DocumentNode.SelectSingleNode("//title");
var titleText = titleNode?.InnerText;

// Find multiple elements
var listItems = doc.DocumentNode.SelectNodes("//li");
foreach (var item in listItems)
{
    Console.WriteLine(item.InnerText);
}

// Find elements with specific attributes
var containerDiv = doc.DocumentNode.SelectSingleNode("//div[@class='container']");

Using LINQ to XML Style Queries

// Find elements by tag name
var paragraphs = doc.DocumentNode.Descendants("p");
foreach (var p in paragraphs)
{
    Console.WriteLine(p.InnerText);
}

// Find elements with specific attributes
var divsWithClass = doc.DocumentNode
    .Descendants("div")
    .Where(div => div.GetAttributeValue("class", "") == "container");

Extracting Data

Getting Text Content

var doc = new HtmlDocument();
doc.LoadHtml(html);

// Get inner text (without HTML tags)
var heading = doc.DocumentNode.SelectSingleNode("//h1");
var headingText = heading.InnerText; // "Welcome to My Site"

// Get inner HTML (with HTML tags)
var containerDiv = doc.DocumentNode.SelectSingleNode("//div[@class='container']");
var containerHtml = containerDiv.InnerHtml;

Getting Attribute Values

// Get attribute value
var divClass = containerDiv.GetAttributeValue("class", "default-value");

// Check if attribute exists
var hasId = containerDiv.HasAttribute("id");

// Get all attributes
var attributes = containerDiv.Attributes;
foreach (var attr in attributes)
{
    Console.WriteLine($"{attr.Name}: {attr.Value}");
}

Modifying HTML Content

Adding Elements

var doc = new HtmlDocument();
doc.LoadHtml(html);

// Create a new element
var newParagraph = HtmlNode.CreateNode("<p>This is a new paragraph.</p>");

// Add to existing element
var containerDiv = doc.DocumentNode.SelectSingleNode("//div[@class='container']");
containerDiv.AppendChild(newParagraph);

Modifying Existing Elements

// Change text content
var heading = doc.DocumentNode.SelectSingleNode("//h1");
heading.InnerHtml = "Updated Heading";

// Change attribute values
heading.SetAttributeValue("class", "main-heading");

// Add new attributes
heading.SetAttributeValue("id", "page-title");

Removing Elements

// Remove specific element
var paragraph = doc.DocumentNode.SelectSingleNode("//p");
paragraph.Remove();

// Remove all elements of a type
var listItems = doc.DocumentNode.SelectNodes("//li");
foreach (var item in listItems)
{
    item.Remove();
}

Advanced Usage Patterns

Error Handling and Validation

try
{
    var doc = new HtmlDocument();
    doc.LoadHtml(html);

    // Check for parse errors
    if (doc.ParseErrors != null && doc.ParseErrors.Count() > 0)
    {
        foreach (var error in doc.ParseErrors)
        {
            Console.WriteLine($"Parse Error: {error.Reason} at line {error.Line}");
        }
    }

    // Validate element existence before accessing
    var element = doc.DocumentNode.SelectSingleNode("//div[@id='content']");
    if (element != null)
    {
        // Safe to access element properties
        var content = element.InnerText;
    }
}
catch (Exception ex)
{
    Console.WriteLine($"Error processing HTML: {ex.Message}");
}

Working with Forms

var doc = new HtmlDocument();
doc.LoadHtml(html);

// Find form elements
var forms = doc.DocumentNode.SelectNodes("//form");
foreach (var form in forms)
{
    var action = form.GetAttributeValue("action", "");
    var method = form.GetAttributeValue("method", "GET");

    Console.WriteLine($"Form: {action} ({method})");

    // Find input fields
    var inputs = form.SelectNodes(".//input");
    if (inputs != null)
    {
        foreach (var input in inputs)
        {
            var name = input.GetAttributeValue("name", "");
            var type = input.GetAttributeValue("type", "text");
            var value = input.GetAttributeValue("value", "");

            Console.WriteLine($"  Input: {name} ({type}) = {value}");
        }
    }
}

Extracting Links and Images

// Extract all links
var links = doc.DocumentNode.SelectNodes("//a[@href]");
if (links != null)
{
    foreach (var link in links)
    {
        var url = link.GetAttributeValue("href", "");
        var text = link.InnerText.Trim();
        Console.WriteLine($"Link: {text} -> {url}");
    }
}

// Extract all images
var images = doc.DocumentNode.SelectNodes("//img");
if (images != null)
{
    foreach (var img in images)
    {
        var src = img.GetAttributeValue("src", "");
        var alt = img.GetAttributeValue("alt", "");
        Console.WriteLine($"Image: {alt} -> {src}");
    }
}

Integration with Web Scraping Workflows

The HtmlDocument class integrates seamlessly with other web scraping tools and APIs. For complex JavaScript-rendered content that Html Agility Pack cannot handle, you might need to combine it with browser automation tools. When dealing with dynamic content, consider using solutions that can handle JavaScript-heavy websites with headless browsers before parsing the rendered HTML with HtmlDocument.

For scenarios requiring more robust error handling and retry mechanisms, you can implement patterns similar to those used in handling timeouts in browser automation, applying similar principles to HTTP requests when loading HTML content.

Performance Considerations

Memory Management

// Dispose of resources properly
using var doc = new HtmlDocument();
doc.LoadHtml(html);
// Document will be automatically disposed

Optimizing Large Documents

// For large documents, consider streaming
var doc = new HtmlDocument();

// Configure options for better performance
doc.OptionFixNestedTags = true;
doc.OptionAutoCloseOnEnd = true;
doc.OptionCheckSyntax = false; // Disable syntax checking for speed

doc.LoadHtml(html);

Common Pitfalls and Solutions

Handling Malformed HTML

Html Agility Pack is designed to handle malformed HTML, but understanding its behavior helps:

var malformedHtml = "<div><p>Unclosed paragraph<div>Nested div</div>";
var doc = new HtmlDocument();
doc.LoadHtml(malformedHtml);

// Html Agility Pack will attempt to fix the structure
var fixedHtml = doc.DocumentNode.OuterHtml;
Console.WriteLine(fixedHtml);

Case Sensitivity

// HTML element names are case-insensitive
var element1 = doc.DocumentNode.SelectSingleNode("//DIV");
var element2 = doc.DocumentNode.SelectSingleNode("//div");
// Both will work the same way

// But attribute values might be case-sensitive
var classElements = doc.DocumentNode.SelectNodes("//div[@class='Container']"); // Won't match 'container'

Conclusion

The HtmlDocument class in Html Agility Pack provides a robust and flexible foundation for HTML parsing and manipulation in .NET applications. Its fault-tolerant parsing capabilities, combined with powerful querying methods using XPath and LINQ, make it an excellent choice for web scraping, content extraction, and HTML processing tasks.

Whether you're building a simple HTML parser or a complex web scraping system, understanding how to effectively use the HtmlDocument class will significantly enhance your ability to work with HTML content programmatically. Remember to handle errors gracefully, dispose of resources properly, and consider performance implications when working with large documents.

For more advanced scenarios involving dynamic content or complex user interactions, consider combining Html Agility Pack with browser automation tools or specialized web scraping APIs that can handle JavaScript-rendered content before parsing with HtmlDocument.

Table of contents