How do I select nodes using XPath with Html Agility Pack?

Html Agility Pack (HAP) is a powerful .NET library for parsing HTML documents, particularly useful for web scraping applications that need to handle malformed or imperfect HTML. XPath (XML Path Language) provides a robust query syntax for selecting nodes from HTML documents when combined with HAP.

Installation

First, install Html Agility Pack via NuGet Package Manager:

Install-Package HtmlAgilityPack

Or using .NET CLI:

dotnet add package HtmlAgilityPack

Basic Node Selection

Html Agility Pack provides two primary methods for XPath node selection:

  • SelectSingleNode(xpath): Returns the first matching node
  • SelectNodes(xpath): Returns a collection of all matching nodes

Complete Example

using System;
using HtmlAgilityPack;

class Program
{
    static void Main()
    {
        var html = @"<html>
                        <body>
                            <div id='content'>
                                <h1>Main Title</h1>
                                <p class='para highlight'>First paragraph</p>
                                <p class='para'>Second paragraph</p>
                                <ul>
                                    <li>Item 1</li>
                                    <li>Item 2</li>
                                </ul>
                            </div>
                            <footer>
                                <p class='footer-text'>Footer content</p>
                            </footer>
                        </body>
                     </html>";

        var htmlDoc = new HtmlDocument();
        htmlDoc.LoadHtml(html);

        // Select single node by ID
        var contentDiv = htmlDoc.DocumentNode.SelectSingleNode("//div[@id='content']");
        Console.WriteLine($"Content div found: {contentDiv != null}");

        // Select multiple nodes by class
        var paragraphs = htmlDoc.DocumentNode.SelectNodes("//p[@class='para']");
        if (paragraphs != null)
        {
            foreach (var paragraph in paragraphs)
            {
                Console.WriteLine($"Paragraph: {paragraph.InnerText}");
            }
        }

        // Select with multiple classes
        var highlighted = htmlDoc.DocumentNode.SelectSingleNode("//p[contains(@class, 'highlight')]");
        if (highlighted != null)
        {
            Console.WriteLine($"Highlighted text: {highlighted.InnerText}");
        }

        // Select all list items
        var listItems = htmlDoc.DocumentNode.SelectNodes("//ul/li");
        if (listItems != null)
        {
            Console.WriteLine($"Found {listItems.Count} list items");
        }
    }
}

Common XPath Patterns

Basic Selectors

// Select all paragraphs
var allParagraphs = doc.DocumentNode.SelectNodes("//p");

// Select first paragraph
var firstParagraph = doc.DocumentNode.SelectSingleNode("//p[1]");

// Select last paragraph
var lastParagraph = doc.DocumentNode.SelectSingleNode("//p[last()]");

// Select by exact attribute value
var specificDiv = doc.DocumentNode.SelectSingleNode("//div[@id='header']");

// Select by partial attribute value
var partialClass = doc.DocumentNode.SelectNodes("//div[contains(@class, 'nav')]");

Advanced Selectors

// Select by text content
var linkByText = doc.DocumentNode.SelectSingleNode("//a[text()='Home']");

// Select by partial text content
var linkByPartialText = doc.DocumentNode.SelectSingleNode("//a[contains(text(), 'Contact')]");

// Select following sibling
var nextSibling = doc.DocumentNode.SelectSingleNode("//h1/following-sibling::p[1]");

// Select parent element
var parentDiv = doc.DocumentNode.SelectSingleNode("//p[@class='content']/..");

// Select with multiple conditions
var complexSelect = doc.DocumentNode.SelectNodes("//div[@class='item' and @data-id]");

Working with Web Content

Here's a practical example for scraping web content:

using System;
using System.Net.Http;
using System.Threading.Tasks;
using HtmlAgilityPack;

class WebScraper
{
    private static readonly HttpClient client = new HttpClient();

    public static async Task Main()
    {
        try
        {
            // Load HTML from web
            var url = "https://example.com";
            var html = await client.GetStringAsync(url);

            var doc = new HtmlDocument();
            doc.LoadHtml(html);

            // Extract page title
            var title = doc.DocumentNode.SelectSingleNode("//title")?.InnerText;
            Console.WriteLine($"Page Title: {title}");

            // Extract all links
            var links = doc.DocumentNode.SelectNodes("//a[@href]");
            if (links != null)
            {
                foreach (var link in links)
                {
                    var href = link.GetAttributeValue("href", "");
                    var text = link.InnerText.Trim();
                    Console.WriteLine($"Link: {text} -> {href}");
                }
            }

            // Extract meta description
            var metaDesc = doc.DocumentNode
                .SelectSingleNode("//meta[@name='description']")
                ?.GetAttributeValue("content", "");
            Console.WriteLine($"Meta Description: {metaDesc}");
        }
        catch (Exception ex)
        {
            Console.WriteLine($"Error: {ex.Message}");
        }
    }
}

Essential XPath Syntax Reference

| Syntax | Description | Example | |--------|-------------|---------| | // | Select anywhere in document | //div selects all div elements | | / | Select direct children | /html/body selects body directly under html | | . | Current context node | ./p selects p elements in current context | | .. | Parent node | ../div selects parent's div siblings | | [@attr='value'] | Attribute exact match | //div[@id='main'] | | [contains(@attr, 'value')] | Attribute partial match | //div[contains(@class, 'nav')] | | [position()] | Position-based selection | //p[1] selects first p element | | [last()] | Last element | //li[last()] selects last list item | | [text()='value'] | Text content match | //a[text()='Home'] | | * | Any element | //*[@id='test'] any element with id |

Error Handling and Best Practices

public static class XPathHelper
{
    public static string SafeGetText(this HtmlNode node, string xpath)
    {
        try
        {
            return node?.SelectSingleNode(xpath)?.InnerText?.Trim() ?? "";
        }
        catch (XPathException)
        {
            return "";
        }
    }

    public static string SafeGetAttribute(this HtmlNode node, string xpath, string attribute)
    {
        try
        {
            return node?.SelectSingleNode(xpath)?.GetAttributeValue(attribute, "") ?? "";
        }
        catch (XPathException)
        {
            return "";
        }
    }

    public static IEnumerable<HtmlNode> SafeSelectNodes(this HtmlNode node, string xpath)
    {
        try
        {
            return node?.SelectNodes(xpath) ?? Enumerable.Empty<HtmlNode>();
        }
        catch (XPathException)
        {
            return Enumerable.Empty<HtmlNode>();
        }
    }
}

Performance Tips

  1. Use specific XPath expressions: //div[@id='content']//p is more efficient than //p
  2. Cache frequently used nodes: Store commonly accessed nodes in variables
  3. Prefer SelectSingleNode when you only need the first match
  4. Handle null results: Always check if SelectNodes returns null before iteration

Common Pitfalls

  • Case sensitivity: XPath is case-sensitive for element names and attribute values
  • Null reference exceptions: Always check if SelectNodes returns null
  • Malformed HTML: While HAP handles broken HTML well, very malformed documents may produce unexpected XPath results
  • Namespace issues: HTML5 documents may require namespace-aware XPath expressions for certain elements

Html Agility Pack's XPath support makes it an excellent choice for robust HTML parsing and web scraping tasks, providing both flexibility and reliability when working with real-world web content.

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon