Html Agility Pack (HAP) is a versatile .NET library designed to read, manipulate, and write HTML and XML documents. It is particularly useful for web scraping because it allows you to navigate the document tree and select nodes using XPath or CSS selectors.
To load an HTML document using Html Agility Pack, you first need to install the package in your .NET project. You can do this via NuGet Package Manager in Visual Studio or by running the following command in the NuGet Package Manager Console:
Install-Package HtmlAgilityPack
Once you have Html Agility Pack installed, you can use it to load an HTML document from a string, a file, or a web URL. Here are examples of how to do each:
Load HTML from a String
using HtmlAgilityPack;
// HTML content as a string
string htmlContent = "<html><head><title>Test</title></head><body><p>Hello, World!</p></body></html>";
// Create an instance of HtmlDocument from Html Agility Pack
HtmlDocument document = new HtmlDocument();
// Load the HTML content
document.LoadHtml(htmlContent);
// Now you can manipulate the document or extract information from it
var node = document.DocumentNode.SelectSingleNode("//p");
Console.WriteLine(node.InnerText); // Outputs: Hello, World!
Load HTML from a File
using HtmlAgilityPack;
using System.IO;
// Path to the HTML file
string filePath = "path/to/your/file.html";
// Create an instance of HtmlDocument
HtmlDocument document = new HtmlDocument();
// Load the HTML file
document.Load(filePath);
// Use the document...
Load HTML from a Web URL
To load an HTML document from a web URL, you would typically use HttpClient
to fetch the content and then load it into an HtmlDocument.
using HtmlAgilityPack;
using System.Net.Http;
using System.Threading.Tasks;
// URL of the page to scrape
string url = "http://example.com";
// Use HttpClient to fetch the webpage
HttpClient httpClient = new HttpClient();
// Use async/await pattern to asynchronously fetch the web page
async Task<HtmlDocument> FetchHtmlDocumentAsync(string webUrl)
{
// Send GET request to fetch the page
string pageHtml = await httpClient.GetStringAsync(webUrl);
// Create an instance of HtmlDocument
HtmlDocument document = new HtmlDocument();
// Load the HTML content
document.LoadHtml(pageHtml);
return document;
}
// Call the method and process the document
HtmlDocument htmlDocument = await FetchHtmlDocumentAsync(url);
// Use the htmlDocument...
Remember to include proper using
directives at the top of your file for namespaces, and if you're writing a console application, make sure that your Main
method is marked with async
if you're using async/await
.
Html Agility Pack makes it easy to handle different encodings, invalid markup, and offers a great level of flexibility when it comes to parsing and manipulating HTML documents. It's a powerful tool for any .NET developer working with HTML content.