Html Agility Pack (HAP) is a powerful .NET library for parsing HTML documents, particularly useful for web scraping applications that need to handle malformed or imperfect HTML. XPath (XML Path Language) provides a robust query syntax for selecting nodes from HTML documents when combined with HAP.
Installation
First, install Html Agility Pack via NuGet Package Manager:
Install-Package HtmlAgilityPack
Or using .NET CLI:
dotnet add package HtmlAgilityPack
Basic Node Selection
Html Agility Pack provides two primary methods for XPath node selection:
SelectSingleNode(xpath)
: Returns the first matching nodeSelectNodes(xpath)
: Returns a collection of all matching nodes
Complete Example
using System;
using HtmlAgilityPack;
class Program
{
static void Main()
{
var html = @"<html>
<body>
<div id='content'>
<h1>Main Title</h1>
<p class='para highlight'>First paragraph</p>
<p class='para'>Second paragraph</p>
<ul>
<li>Item 1</li>
<li>Item 2</li>
</ul>
</div>
<footer>
<p class='footer-text'>Footer content</p>
</footer>
</body>
</html>";
var htmlDoc = new HtmlDocument();
htmlDoc.LoadHtml(html);
// Select single node by ID
var contentDiv = htmlDoc.DocumentNode.SelectSingleNode("//div[@id='content']");
Console.WriteLine($"Content div found: {contentDiv != null}");
// Select multiple nodes by class
var paragraphs = htmlDoc.DocumentNode.SelectNodes("//p[@class='para']");
if (paragraphs != null)
{
foreach (var paragraph in paragraphs)
{
Console.WriteLine($"Paragraph: {paragraph.InnerText}");
}
}
// Select with multiple classes
var highlighted = htmlDoc.DocumentNode.SelectSingleNode("//p[contains(@class, 'highlight')]");
if (highlighted != null)
{
Console.WriteLine($"Highlighted text: {highlighted.InnerText}");
}
// Select all list items
var listItems = htmlDoc.DocumentNode.SelectNodes("//ul/li");
if (listItems != null)
{
Console.WriteLine($"Found {listItems.Count} list items");
}
}
}
Common XPath Patterns
Basic Selectors
// Select all paragraphs
var allParagraphs = doc.DocumentNode.SelectNodes("//p");
// Select first paragraph
var firstParagraph = doc.DocumentNode.SelectSingleNode("//p[1]");
// Select last paragraph
var lastParagraph = doc.DocumentNode.SelectSingleNode("//p[last()]");
// Select by exact attribute value
var specificDiv = doc.DocumentNode.SelectSingleNode("//div[@id='header']");
// Select by partial attribute value
var partialClass = doc.DocumentNode.SelectNodes("//div[contains(@class, 'nav')]");
Advanced Selectors
// Select by text content
var linkByText = doc.DocumentNode.SelectSingleNode("//a[text()='Home']");
// Select by partial text content
var linkByPartialText = doc.DocumentNode.SelectSingleNode("//a[contains(text(), 'Contact')]");
// Select following sibling
var nextSibling = doc.DocumentNode.SelectSingleNode("//h1/following-sibling::p[1]");
// Select parent element
var parentDiv = doc.DocumentNode.SelectSingleNode("//p[@class='content']/..");
// Select with multiple conditions
var complexSelect = doc.DocumentNode.SelectNodes("//div[@class='item' and @data-id]");
Working with Web Content
Here's a practical example for scraping web content:
using System;
using System.Net.Http;
using System.Threading.Tasks;
using HtmlAgilityPack;
class WebScraper
{
private static readonly HttpClient client = new HttpClient();
public static async Task Main()
{
try
{
// Load HTML from web
var url = "https://example.com";
var html = await client.GetStringAsync(url);
var doc = new HtmlDocument();
doc.LoadHtml(html);
// Extract page title
var title = doc.DocumentNode.SelectSingleNode("//title")?.InnerText;
Console.WriteLine($"Page Title: {title}");
// Extract all links
var links = doc.DocumentNode.SelectNodes("//a[@href]");
if (links != null)
{
foreach (var link in links)
{
var href = link.GetAttributeValue("href", "");
var text = link.InnerText.Trim();
Console.WriteLine($"Link: {text} -> {href}");
}
}
// Extract meta description
var metaDesc = doc.DocumentNode
.SelectSingleNode("//meta[@name='description']")
?.GetAttributeValue("content", "");
Console.WriteLine($"Meta Description: {metaDesc}");
}
catch (Exception ex)
{
Console.WriteLine($"Error: {ex.Message}");
}
}
}
Essential XPath Syntax Reference
| Syntax | Description | Example |
|--------|-------------|---------|
| //
| Select anywhere in document | //div
selects all div elements |
| /
| Select direct children | /html/body
selects body directly under html |
| .
| Current context node | ./p
selects p elements in current context |
| ..
| Parent node | ../div
selects parent's div siblings |
| [@attr='value']
| Attribute exact match | //div[@id='main']
|
| [contains(@attr, 'value')]
| Attribute partial match | //div[contains(@class, 'nav')]
|
| [position()]
| Position-based selection | //p[1]
selects first p element |
| [last()]
| Last element | //li[last()]
selects last list item |
| [text()='value']
| Text content match | //a[text()='Home']
|
| *
| Any element | //*[@id='test']
any element with id |
Error Handling and Best Practices
public static class XPathHelper
{
public static string SafeGetText(this HtmlNode node, string xpath)
{
try
{
return node?.SelectSingleNode(xpath)?.InnerText?.Trim() ?? "";
}
catch (XPathException)
{
return "";
}
}
public static string SafeGetAttribute(this HtmlNode node, string xpath, string attribute)
{
try
{
return node?.SelectSingleNode(xpath)?.GetAttributeValue(attribute, "") ?? "";
}
catch (XPathException)
{
return "";
}
}
public static IEnumerable<HtmlNode> SafeSelectNodes(this HtmlNode node, string xpath)
{
try
{
return node?.SelectNodes(xpath) ?? Enumerable.Empty<HtmlNode>();
}
catch (XPathException)
{
return Enumerable.Empty<HtmlNode>();
}
}
}
Performance Tips
- Use specific XPath expressions:
//div[@id='content']//p
is more efficient than//p
- Cache frequently used nodes: Store commonly accessed nodes in variables
- Prefer
SelectSingleNode
when you only need the first match - Handle null results: Always check if
SelectNodes
returns null before iteration
Common Pitfalls
- Case sensitivity: XPath is case-sensitive for element names and attribute values
- Null reference exceptions: Always check if
SelectNodes
returns null - Malformed HTML: While HAP handles broken HTML well, very malformed documents may produce unexpected XPath results
- Namespace issues: HTML5 documents may require namespace-aware XPath expressions for certain elements
Html Agility Pack's XPath support makes it an excellent choice for robust HTML parsing and web scraping tasks, providing both flexibility and reliability when working with real-world web content.