Table of contents

How do I parse HTML from a string using Html Agility Pack?

Html Agility Pack is one of the most popular and powerful HTML parsing libraries for .NET developers. Unlike browser-based automation tools, it provides a lightweight solution for parsing HTML content directly from strings, making it ideal for web scraping, data extraction, and HTML manipulation tasks.

What is Html Agility Pack?

Html Agility Pack is a .NET library that provides a simple way to parse HTML documents using a familiar DOM-like API. It can handle malformed HTML gracefully and offers both XPath and LINQ-to-XML query capabilities, making it versatile for various HTML parsing scenarios.

Installation and Setup

Before parsing HTML strings, you need to install Html Agility Pack in your .NET project:

Using NuGet Package Manager

Install-Package HtmlAgilityPack

Using .NET CLI

dotnet add package HtmlAgilityPack

Using PackageReference in .csproj

<PackageReference Include="HtmlAgilityPack" Version="1.11.54" />

Basic HTML String Parsing

Simple String Parsing

The most straightforward way to parse HTML from a string is using the HtmlDocument class:

using HtmlAgilityPack;

class Program
{
    static void Main()
    {
        string htmlString = @"
            <html>
                <head><title>Sample Page</title></head>
                <body>
                    <div class='container'>
                        <h1 id='main-title'>Welcome to My Site</h1>
                        <p>This is a sample paragraph.</p>
                        <ul>
                            <li>Item 1</li>
                            <li>Item 2</li>
                            <li>Item 3</li>
                        </ul>
                    </div>
                </body>
            </html>";

        // Create HtmlDocument instance
        HtmlDocument doc = new HtmlDocument();

        // Load HTML from string
        doc.LoadHtml(htmlString);

        // Access the document root
        HtmlNode rootNode = doc.DocumentNode;

        Console.WriteLine("Document parsed successfully!");
    }
}

Extracting Specific Elements

Once you've loaded the HTML string, you can extract specific elements using various selection methods:

// Extract title
HtmlNode titleNode = doc.DocumentNode.SelectSingleNode("//title");
string title = titleNode?.InnerText ?? "No title found";
Console.WriteLine($"Title: {title}");

// Extract main heading
HtmlNode h1Node = doc.DocumentNode.SelectSingleNode("//h1[@id='main-title']");
string heading = h1Node?.InnerText ?? "No heading found";
Console.WriteLine($"Heading: {heading}");

// Extract all list items
HtmlNodeCollection listItems = doc.DocumentNode.SelectNodes("//li");
if (listItems != null)
{
    foreach (HtmlNode item in listItems)
    {
        Console.WriteLine($"List item: {item.InnerText}");
    }
}

Advanced Parsing Techniques

Using CSS Selectors with QuerySelector

Html Agility Pack supports CSS selectors through the QuerySelector methods:

// Select by class
HtmlNode containerDiv = doc.DocumentNode.QuerySelector(".container");

// Select by ID
HtmlNode mainTitle = doc.DocumentNode.QuerySelector("#main-title");

// Select multiple elements
IEnumerable<HtmlNode> paragraphs = doc.DocumentNode.QuerySelectorAll("p");

// Complex selectors
HtmlNode firstListItem = doc.DocumentNode.QuerySelector("ul li:first-child");

Extracting Attributes

You can easily extract HTML attributes from parsed elements:

string htmlWithAttributes = @"
    <div>
        <img src='image1.jpg' alt='Sample Image' class='responsive' />
        <a href='https://example.com' target='_blank'>External Link</a>
    </div>";

HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(htmlWithAttributes);

// Extract image attributes
HtmlNode imgNode = doc.DocumentNode.SelectSingleNode("//img");
if (imgNode != null)
{
    string src = imgNode.GetAttributeValue("src", "");
    string alt = imgNode.GetAttributeValue("alt", "");
    string cssClass = imgNode.GetAttributeValue("class", "");

    Console.WriteLine($"Image: src={src}, alt={alt}, class={cssClass}");
}

// Extract link attributes
HtmlNode linkNode = doc.DocumentNode.SelectSingleNode("//a");
if (linkNode != null)
{
    string href = linkNode.GetAttributeValue("href", "");
    string target = linkNode.GetAttributeValue("target", "");

    Console.WriteLine($"Link: href={href}, target={target}");
}

Handling Tables and Structured Data

Html Agility Pack excels at parsing structured data like tables:

string tableHtml = @"
    <table>
        <thead>
            <tr>
                <th>Name</th>
                <th>Age</th>
                <th>City</th>
            </tr>
        </thead>
        <tbody>
            <tr>
                <td>John Doe</td>
                <td>30</td>
                <td>New York</td>
            </tr>
            <tr>
                <td>Jane Smith</td>
                <td>25</td>
                <td>London</td>
            </tr>
        </tbody>
    </table>";

HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(tableHtml);

// Extract table headers
var headers = doc.DocumentNode
    .SelectNodes("//th")
    ?.Select(th => th.InnerText.Trim())
    .ToList();

// Extract table data
var rows = doc.DocumentNode.SelectNodes("//tbody/tr");
if (rows != null)
{
    foreach (var row in rows)
    {
        var cells = row.SelectNodes("td")
            ?.Select(td => td.InnerText.Trim())
            .ToArray();

        if (cells != null && headers != null)
        {
            for (int i = 0; i < Math.Min(headers.Count, cells.Length); i++)
            {
                Console.WriteLine($"{headers[i]}: {cells[i]}");
            }
            Console.WriteLine("---");
        }
    }
}

Error Handling and Robustness

Handling Malformed HTML

One of Html Agility Pack's strengths is its ability to handle malformed HTML gracefully:

public static class HtmlParser
{
    public static HtmlDocument ParseHtmlString(string html)
    {
        try
        {
            var doc = new HtmlDocument();

            // Configure parser options
            doc.OptionFixNestedTags = true;
            doc.OptionAutoCloseOnEnd = true;
            doc.OptionDefaultStreamEncoding = Encoding.UTF8;

            doc.LoadHtml(html);

            // Check for parsing errors
            if (doc.ParseErrors != null && doc.ParseErrors.Any())
            {
                foreach (var error in doc.ParseErrors)
                {
                    Console.WriteLine($"Parse warning: {error.Reason} at line {error.Line}");
                }
            }

            return doc;
        }
        catch (Exception ex)
        {
            Console.WriteLine($"Error parsing HTML: {ex.Message}");
            throw;
        }
    }
}

Safe Element Extraction

Implement safe extraction methods to prevent null reference exceptions:

public static class HtmlExtensions
{
    public static string SafeInnerText(this HtmlNode node)
    {
        return node?.InnerText?.Trim() ?? string.Empty;
    }

    public static string SafeGetAttribute(this HtmlNode node, string attributeName, string defaultValue = "")
    {
        return node?.GetAttributeValue(attributeName, defaultValue) ?? defaultValue;
    }

    public static List<HtmlNode> SafeSelectNodes(this HtmlNode node, string xpath)
    {
        return node?.SelectNodes(xpath)?.ToList() ?? new List<HtmlNode>();
    }
}

// Usage example
string title = doc.DocumentNode.SelectSingleNode("//title").SafeInnerText();
string metaDescription = doc.DocumentNode
    .SelectSingleNode("//meta[@name='description']")
    .SafeGetAttribute("content");

Performance Optimization

Memory Management

For large-scale HTML parsing operations, consider memory management:

public class OptimizedHtmlParser
{
    public void ParseMultipleHtmlStrings(IEnumerable<string> htmlStrings)
    {
        foreach (string html in htmlStrings)
        {
            using (var doc = new HtmlDocument())
            {
                doc.LoadHtml(html);

                // Process document
                ProcessDocument(doc);

                // Document will be disposed automatically
            }

            // Force garbage collection for large datasets
            if (Environment.WorkingSet > 500_000_000) // 500MB threshold
            {
                GC.Collect();
                GC.WaitForPendingFinalizers();
            }
        }
    }

    private void ProcessDocument(HtmlDocument doc)
    {
        // Your processing logic here
    }
}

Reusing HtmlDocument Instances

For better performance when parsing multiple strings, reuse HtmlDocument instances:

public class ReusableHtmlParser
{
    private readonly HtmlDocument _document;

    public ReusableHtmlParser()
    {
        _document = new HtmlDocument();
        _document.OptionFixNestedTags = true;
        _document.OptionAutoCloseOnEnd = true;
    }

    public HtmlNode ParseString(string html)
    {
        _document.LoadHtml(html);
        return _document.DocumentNode;
    }
}

Comparison with Other Parsing Methods

While Html Agility Pack is excellent for parsing HTML strings, you might also consider browser automation tools for JavaScript-heavy content. For scenarios requiring JavaScript execution, tools that can handle dynamic content loading might be more appropriate.

However, for pure HTML parsing from strings, Html Agility Pack offers several advantages:

  • Performance: Faster than browser automation for static HTML
  • Memory efficiency: Lower resource usage
  • Simplicity: No browser dependencies
  • Reliability: Handles malformed HTML gracefully

Best Practices

  1. Always check for null values when working with selected nodes
  2. Use specific XPath or CSS selectors to improve performance
  3. Configure parser options based on your HTML quality expectations
  4. Implement proper error handling for production applications
  5. Consider encoding issues when dealing with international content
  6. Use LINQ for complex data transformations after parsing

Conclusion

Html Agility Pack provides a robust and efficient solution for parsing HTML from strings in .NET applications. Its ability to handle malformed HTML, combined with powerful selection methods and excellent performance characteristics, makes it an ideal choice for web scraping and HTML processing tasks. Whether you're extracting data from web responses, processing HTML templates, or building content analysis tools, Html Agility Pack offers the flexibility and reliability needed for professional development.

For more complex scenarios involving dynamic content or JavaScript execution, consider complementing Html Agility Pack with browser automation tools, but for pure HTML string parsing, it remains one of the best choices available for .NET developers.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon