Table of contents

How do I select elements by class or ID using Html Agility Pack?

Html Agility Pack (HAP) is a powerful .NET library for parsing and manipulating HTML documents. When selecting elements by class or ID, you'll use XPath expressions since HAP doesn't natively support CSS selectors. This guide shows you exactly how to target elements efficiently.

Setting Up Html Agility Pack

First, ensure you have Html Agility Pack installed:

<PackageReference Include="HtmlAgilityPack" Version="1.11.54" />

Selecting Elements by ID

IDs are unique identifiers, making them the most reliable way to select specific elements.

Basic ID Selection

using HtmlAgilityPack;

var html = @"
<html>
<body>
    <div id='header'>Header Content</div>
    <p id='main-paragraph'>Main content here</p>
</body>
</html>";

var doc = new HtmlDocument();
doc.LoadHtml(html);

// Select element by ID
HtmlNode headerNode = doc.DocumentNode.SelectSingleNode("//div[@id='header']");
// Or use this more general approach
HtmlNode paragraphNode = doc.DocumentNode.SelectSingleNode("//*[@id='main-paragraph']");

if (headerNode != null)
{
    Console.WriteLine($"Header text: {headerNode.InnerText}");
    Console.WriteLine($"Header HTML: {headerNode.OuterHtml}");
}

Loading from Web or File

// From URL
var web = new HtmlWeb();
var doc = web.Load("https://example.com");

// From file
var doc2 = new HtmlDocument();
doc2.Load("path/to/file.html");

// From string
var doc3 = new HtmlDocument();
doc3.LoadHtml(htmlString);

Selecting Elements by Class

Class selection is more complex since elements can have multiple classes.

Single Class Selection

var html = @"
<html>
<body>
    <div class='container'>Content 1</div>
    <div class='container active'>Content 2</div>
    <p class='text-primary'>Paragraph</p>
</body>
</html>";

var doc = new HtmlDocument();
doc.LoadHtml(html);

// Select all elements with 'container' class
var containerNodes = doc.DocumentNode
    .SelectNodes("//*[contains(concat(' ', normalize-space(@class), ' '), ' container ')]");

foreach (var node in containerNodes ?? new HtmlNodeCollection(null))
{
    Console.WriteLine($"Class: {node.GetAttributeValue("class", "")}");
    Console.WriteLine($"Content: {node.InnerText}");
}

Multiple Class Selection

// Select elements that have both 'container' AND 'active' classes
var activeContainers = doc.DocumentNode.SelectNodes(
    "//*[contains(concat(' ', normalize-space(@class), ' '), ' container ') and " +
    "contains(concat(' ', normalize-space(@class), ' '), ' active ')]");

// Select elements that have 'container' OR 'text-primary' classes
var multipleClassNodes = doc.DocumentNode.SelectNodes(
    "//*[contains(concat(' ', normalize-space(@class), ' '), ' container ') or " +
    "contains(concat(' ', normalize-space(@class), ' '), ' text-primary ')]");

Practical Examples

Extracting Data from Real HTML

var html = @"
<div class='product-card'>
    <h3 id='product-title'>Laptop Computer</h3>
    <span class='price'>$999.99</span>
    <div class='description'>High-performance laptop</div>
</div>
<div class='product-card'>
    <h3 id='product-title-2'>Desktop Computer</h3>
    <span class='price'>$1299.99</span>
    <div class='description'>Gaming desktop</div>
</div>";

var doc = new HtmlDocument();
doc.LoadHtml(html);

// Get all product cards
var products = doc.DocumentNode
    .SelectNodes("//div[contains(concat(' ', normalize-space(@class), ' '), ' product-card ')]");

foreach (var product in products ?? new HtmlNodeCollection(null))
{
    var title = product.SelectSingleNode(".//h3")?.InnerText;
    var price = product.SelectSingleNode(".//span[@class='price']")?.InnerText;
    var description = product.SelectSingleNode(".//div[@class='description']")?.InnerText;

    Console.WriteLine($"Title: {title}");
    Console.WriteLine($"Price: {price}");
    Console.WriteLine($"Description: {description}");
    Console.WriteLine("---");
}

Helper Methods for Cleaner Code

public static class HtmlAgilityPackExtensions
{
    public static HtmlNode GetElementById(this HtmlDocument doc, string id)
    {
        return doc.DocumentNode.SelectSingleNode($"//*[@id='{id}']");
    }

    public static HtmlNodeCollection GetElementsByClass(this HtmlDocument doc, string className)
    {
        return doc.DocumentNode.SelectNodes(
            $"//*[contains(concat(' ', normalize-space(@class), ' '), ' {className} ')]");
    }

    public static HtmlNode GetFirstElementByClass(this HtmlDocument doc, string className)
    {
        return doc.DocumentNode.SelectSingleNode(
            $"//*[contains(concat(' ', normalize-space(@class), ' '), ' {className} ')]");
    }
}

// Usage
var titleElement = doc.GetElementById("product-title");
var priceElements = doc.GetElementsByClass("price");
var firstContainer = doc.GetFirstElementByClass("container");

Common XPath Patterns

| Pattern | XPath Expression | Use Case | |---------|------------------|----------| | By ID | //*[@id='myId'] | Select unique element | | By Class | //*[contains(@class, 'myClass')] | Simple class match | | Exact Class | //*[contains(concat(' ', normalize-space(@class), ' '), ' myClass ')] | Precise class matching | | Tag + Class | //div[contains(@class, 'myClass')] | Specific tag with class | | Tag + ID | //div[@id='myId'] | Specific tag with ID |

Error Handling and Best Practices

public static class SafeHtmlParsing
{
    public static string GetTextById(HtmlDocument doc, string id)
    {
        try
        {
            var node = doc.DocumentNode.SelectSingleNode($"//*[@id='{id}']");
            return node?.InnerText?.Trim() ?? string.Empty;
        }
        catch (XPathException ex)
        {
            Console.WriteLine($"Invalid XPath expression: {ex.Message}");
            return string.Empty;
        }
    }

    public static List<string> GetTextsByClass(HtmlDocument doc, string className)
    {
        var results = new List<string>();

        try
        {
            var nodes = doc.DocumentNode.SelectNodes(
                $"//*[contains(concat(' ', normalize-space(@class), ' '), ' {className} ')]");

            if (nodes != null)
            {
                results.AddRange(nodes.Select(n => n.InnerText?.Trim() ?? string.Empty));
            }
        }
        catch (XPathException ex)
        {
            Console.WriteLine($"Invalid XPath expression: {ex.Message}");
        }

        return results;
    }
}

Performance Tips

  1. Use specific selectors: Instead of //*[@class='myClass'], use //div[@class='myClass'] if you know the tag
  2. Cache frequently used nodes: Store references to avoid repeated queries
  3. Use SelectSingleNode when possible: It's faster than SelectNodes when you only need the first match
  4. Avoid complex XPath: Simple expressions perform better than nested conditions

Limitations and Alternatives

Html Agility Pack doesn't execute JavaScript, so dynamically generated content won't be available. For JavaScript-heavy sites, consider:

  • Selenium WebDriver: Full browser automation
  • Playwright: Modern browser automation
  • PuppeteerSharp: Headless Chrome control for .NET

These tools can handle dynamic content but are heavier and slower than Html Agility Pack for static HTML parsing.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon