How do I parse HTML from a string using Html Agility Pack?

Html Agility Pack is one of the most popular and powerful HTML parsing libraries for .NET developers. Unlike browser-based automation tools, it provides a lightweight solution for parsing HTML content directly from strings, making it ideal for web scraping, data extraction, and HTML manipulation tasks.

What is Html Agility Pack?

Html Agility Pack is a .NET library that provides a simple way to parse HTML documents using a familiar DOM-like API. It can handle malformed HTML gracefully and offers both XPath and LINQ-to-XML query capabilities, making it versatile for various HTML parsing scenarios.

Installation and Setup

Before parsing HTML strings, you need to install Html Agility Pack in your .NET project:

Using NuGet Package Manager

Install-Package HtmlAgilityPack

Using .NET CLI

dotnet add package HtmlAgilityPack

Using PackageReference in .csproj

<PackageReference Include="HtmlAgilityPack" Version="1.11.54" />

Basic HTML String Parsing

Simple String Parsing

The most straightforward way to parse HTML from a string is using the HtmlDocument class:

using HtmlAgilityPack;

class Program
{
    static void Main()
    {
        string htmlString = @"
            <html>
                <head><title>Sample Page</title></head>
                <body>
                    <div class='container'>
                        <h1 id='main-title'>Welcome to My Site</h1>
                        <p>This is a sample paragraph.</p>
                        <ul>
                            <li>Item 1</li>
                            <li>Item 2</li>
                            <li>Item 3</li>
                        </ul>
                    </div>
                </body>
            </html>";

        // Create HtmlDocument instance
        HtmlDocument doc = new HtmlDocument();

        // Load HTML from string
        doc.LoadHtml(htmlString);

        // Access the document root
        HtmlNode rootNode = doc.DocumentNode;

        Console.WriteLine("Document parsed successfully!");
    }
}

Extracting Specific Elements

Once you've loaded the HTML string, you can extract specific elements using various selection methods:

// Extract title
HtmlNode titleNode = doc.DocumentNode.SelectSingleNode("//title");
string title = titleNode?.InnerText ?? "No title found";
Console.WriteLine($"Title: {title}");

// Extract main heading
HtmlNode h1Node = doc.DocumentNode.SelectSingleNode("//h1[@id='main-title']");
string heading = h1Node?.InnerText ?? "No heading found";
Console.WriteLine($"Heading: {heading}");

// Extract all list items
HtmlNodeCollection listItems = doc.DocumentNode.SelectNodes("//li");
if (listItems != null)
{
    foreach (HtmlNode item in listItems)
    {
        Console.WriteLine($"List item: {item.InnerText}");
    }
}

Advanced Parsing Techniques

Using CSS Selectors with QuerySelector

Html Agility Pack supports CSS selectors through the QuerySelector methods:

// Select by class
HtmlNode containerDiv = doc.DocumentNode.QuerySelector(".container");

// Select by ID
HtmlNode mainTitle = doc.DocumentNode.QuerySelector("#main-title");

// Select multiple elements
IEnumerable<HtmlNode> paragraphs = doc.DocumentNode.QuerySelectorAll("p");

// Complex selectors
HtmlNode firstListItem = doc.DocumentNode.QuerySelector("ul li:first-child");

Extracting Attributes

You can easily extract HTML attributes from parsed elements:

string htmlWithAttributes = @"
    <div>
        <img src='image1.jpg' alt='Sample Image' class='responsive' />
        <a href='https://example.com' target='_blank'>External Link</a>
    </div>";

HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(htmlWithAttributes);

// Extract image attributes
HtmlNode imgNode = doc.DocumentNode.SelectSingleNode("//img");
if (imgNode != null)
{
    string src = imgNode.GetAttributeValue("src", "");
    string alt = imgNode.GetAttributeValue("alt", "");
    string cssClass = imgNode.GetAttributeValue("class", "");

    Console.WriteLine($"Image: src={src}, alt={alt}, class={cssClass}");
}

// Extract link attributes
HtmlNode linkNode = doc.DocumentNode.SelectSingleNode("//a");
if (linkNode != null)
{
    string href = linkNode.GetAttributeValue("href", "");
    string target = linkNode.GetAttributeValue("target", "");

    Console.WriteLine($"Link: href={href}, target={target}");
}

Handling Tables and Structured Data

Html Agility Pack excels at parsing structured data like tables:

string tableHtml = @"
    <table>
        <thead>
            <tr>
                <th>Name</th>
                <th>Age</th>
                <th>City</th>
            </tr>
        </thead>
        <tbody>
            <tr>
                <td>John Doe</td>
                <td>30</td>
                <td>New York</td>
            </tr>
            <tr>
                <td>Jane Smith</td>
                <td>25</td>
                <td>London</td>
            </tr>
        </tbody>
    </table>";

HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(tableHtml);

// Extract table headers
var headers = doc.DocumentNode
    .SelectNodes("//th")
    ?.Select(th => th.InnerText.Trim())
    .ToList();

// Extract table data
var rows = doc.DocumentNode.SelectNodes("//tbody/tr");
if (rows != null)
{
    foreach (var row in rows)
    {
        var cells = row.SelectNodes("td")
            ?.Select(td => td.InnerText.Trim())
            .ToArray();

        if (cells != null && headers != null)
        {
            for (int i = 0; i < Math.Min(headers.Count, cells.Length); i++)
            {
                Console.WriteLine($"{headers[i]}: {cells[i]}");
            }
            Console.WriteLine("---");
        }
    }
}

Error Handling and Robustness

Handling Malformed HTML

One of Html Agility Pack's strengths is its ability to handle malformed HTML gracefully:

public static class HtmlParser
{
    public static HtmlDocument ParseHtmlString(string html)
    {
        try
        {
            var doc = new HtmlDocument();

            // Configure parser options
            doc.OptionFixNestedTags = true;
            doc.OptionAutoCloseOnEnd = true;
            doc.OptionDefaultStreamEncoding = Encoding.UTF8;

            doc.LoadHtml(html);

            // Check for parsing errors
            if (doc.ParseErrors != null && doc.ParseErrors.Any())
            {
                foreach (var error in doc.ParseErrors)
                {
                    Console.WriteLine($"Parse warning: {error.Reason} at line {error.Line}");
                }
            }

            return doc;
        }
        catch (Exception ex)
        {
            Console.WriteLine($"Error parsing HTML: {ex.Message}");
            throw;
        }
    }
}

Safe Element Extraction

Implement safe extraction methods to prevent null reference exceptions:

public static class HtmlExtensions
{
    public static string SafeInnerText(this HtmlNode node)
    {
        return node?.InnerText?.Trim() ?? string.Empty;
    }

    public static string SafeGetAttribute(this HtmlNode node, string attributeName, string defaultValue = "")
    {
        return node?.GetAttributeValue(attributeName, defaultValue) ?? defaultValue;
    }

    public static List<HtmlNode> SafeSelectNodes(this HtmlNode node, string xpath)
    {
        return node?.SelectNodes(xpath)?.ToList() ?? new List<HtmlNode>();
    }
}

// Usage example
string title = doc.DocumentNode.SelectSingleNode("//title").SafeInnerText();
string metaDescription = doc.DocumentNode
    .SelectSingleNode("//meta[@name='description']")
    .SafeGetAttribute("content");

Performance Optimization

Memory Management

For large-scale HTML parsing operations, consider memory management:

public class OptimizedHtmlParser
{
    public void ParseMultipleHtmlStrings(IEnumerable<string> htmlStrings)
    {
        foreach (string html in htmlStrings)
        {
            using (var doc = new HtmlDocument())
            {
                doc.LoadHtml(html);

                // Process document
                ProcessDocument(doc);

                // Document will be disposed automatically
            }

            // Force garbage collection for large datasets
            if (Environment.WorkingSet > 500_000_000) // 500MB threshold
            {
                GC.Collect();
                GC.WaitForPendingFinalizers();
            }
        }
    }

    private void ProcessDocument(HtmlDocument doc)
    {
        // Your processing logic here
    }
}

Reusing HtmlDocument Instances

For better performance when parsing multiple strings, reuse HtmlDocument instances:

public class ReusableHtmlParser
{
    private readonly HtmlDocument _document;

    public ReusableHtmlParser()
    {
        _document = new HtmlDocument();
        _document.OptionFixNestedTags = true;
        _document.OptionAutoCloseOnEnd = true;
    }

    public HtmlNode ParseString(string html)
    {
        _document.LoadHtml(html);
        return _document.DocumentNode;
    }
}

Comparison with Other Parsing Methods

While Html Agility Pack is excellent for parsing HTML strings, you might also consider browser automation tools for JavaScript-heavy content. For scenarios requiring JavaScript execution, tools that can handle dynamic content loading might be more appropriate.

However, for pure HTML parsing from strings, Html Agility Pack offers several advantages:

Performance: Faster than browser automation for static HTML
Memory efficiency: Lower resource usage
Simplicity: No browser dependencies
Reliability: Handles malformed HTML gracefully

Best Practices

Always check for null values when working with selected nodes
Use specific XPath or CSS selectors to improve performance
Configure parser options based on your HTML quality expectations
Implement proper error handling for production applications
Consider encoding issues when dealing with international content
Use LINQ for complex data transformations after parsing

Conclusion

Html Agility Pack provides a robust and efficient solution for parsing HTML from strings in .NET applications. Its ability to handle malformed HTML, combined with powerful selection methods and excellent performance characteristics, makes it an ideal choice for web scraping and HTML processing tasks. Whether you're extracting data from web responses, processing HTML templates, or building content analysis tools, Html Agility Pack offers the flexibility and reliability needed for professional development.

For more complex scenarios involving dynamic content or JavaScript execution, consider complementing Html Agility Pack with browser automation tools, but for pure HTML string parsing, it remains one of the best choices available for .NET developers.

Table of contents