The Html Agility Pack (HAP) is a .NET code library that is designed to parse HTML documents and is particularly useful for web scraping, because it can handle documents with less-than-perfect syntax. XPath, which stands for XML Path Language, is a query language that can be used to select nodes from an XML document, which includes HTML documents when they are parsed with libraries like HAP.
Here's how you can select nodes using XPath with Html Agility Pack in C#:
- First, ensure you have Html Agility Pack installed in your project. If you're using NuGet, you can install it with the following command:
Install-Package HtmlAgilityPack
Next, you'll need to load the HTML document you want to scrape into an
HtmlDocument
object.Then, use the
SelectNodes
method of theHtmlDocument
orHtmlNode
class to retrieve a collection of nodes that match the XPath query.
Here's an example of how to use XPath with HAP to select nodes:
using System;
using HtmlAgilityPack;
class Program
{
static void Main()
{
var html = @"<html>
<body>
<div id='content'>
<p class='para'>First paragraph</p>
<p class='para'>Second paragraph</p>
</div>
</body>
</html>";
var htmlDoc = new HtmlDocument();
htmlDoc.LoadHtml(html);
// Select the div with id 'content'
var contentDiv = htmlDoc.DocumentNode.SelectSingleNode("//div[@id='content']");
// Select all paragraph nodes within the div
var paragraphs = contentDiv.SelectNodes(".//p[@class='para']");
if (paragraphs != null)
{
foreach (var paragraph in paragraphs)
{
Console.WriteLine(paragraph.InnerText);
}
}
}
}
In this example, the SelectSingleNode
method is used to select the first node that matches the XPath query, which here is looking for a div
with an id
of content
. The SelectNodes
method is then used to select all p
elements with a class
of para
within that div.
Here are some key points of XPath syntax that you may find useful when using Html Agility Pack:
//
: Selects nodes in the document from the current node that match the selection no matter where they are. For example,//p
selects allp
elements..
: Selects the current node. This is useful when you're already working within a context and want to apply the XPath relative to that context./
: Selects from the root node.[@attrib='value']
: Selects all nodes with a given attribute value. For example,//div[@id='content']
selects alldiv
elements with anid
ofcontent
.*
: Matches any element node..
: Selects the current node...
: Selects the parent of the current node.node()
: Matches any node of any kind.
Remember that XPath is case-sensitive and the HTML document you parse with Html Agility Pack should be well-formed for the XPath to work correctly. If the HTML is not well-formed, HAP is quite good at handling this as it is designed to deal with the quirks of real-world HTML, but it can make your XPath expressions more complex.