What are the main classes and methods in Html Agility Pack?
Html Agility Pack is a powerful .NET library that provides a convenient way to parse, manipulate, and query HTML documents. Understanding its core classes and methods is essential for effective web scraping and HTML processing in .NET applications. This comprehensive guide covers the main components of Html Agility Pack and demonstrates how to use them effectively.
Core Classes Overview
Html Agility Pack's architecture revolves around several key classes that work together to provide a complete HTML parsing solution. The primary classes include HtmlDocument
, HtmlNode
, HtmlNodeCollection
, and HtmlAttribute
.
HtmlDocument Class
The HtmlDocument
class is the entry point for working with HTML content. It represents an entire HTML document and provides methods for loading, parsing, and manipulating HTML.
Key Properties and Methods
using HtmlAgilityPack;
// Create a new HtmlDocument instance
HtmlDocument doc = new HtmlDocument();
// Load HTML from various sources
doc.LoadHtml("<html><body><h1>Hello World</h1></body></html>");
doc.Load("path/to/file.html");
// Access the document node (root)
HtmlNode documentNode = doc.DocumentNode;
// Get parse errors
IEnumerable<HtmlParseError> errors = doc.ParseErrors;
// Save the document
doc.Save("output.html");
Loading HTML from Web
using System.Net.Http;
using HtmlAgilityPack;
// Load HTML from a URL
var web = new HtmlWeb();
HtmlDocument doc = web.Load("https://example.com");
// Alternative with HttpClient
using var client = new HttpClient();
string html = await client.GetStringAsync("https://example.com");
doc.LoadHtml(html);
HtmlNode Class
The HtmlNode
class represents individual HTML elements, text nodes, comments, and other HTML components. It's the most frequently used class when working with Html Agility Pack.
Essential Properties
HtmlNode node = doc.DocumentNode.SelectSingleNode("//h1");
// Node properties
string nodeName = node.Name; // Element tag name
string nodeValue = node.InnerText; // Text content
string innerHTML = node.InnerHtml; // Inner HTML
string outerHTML = node.OuterHtml; // Outer HTML including the element
HtmlNodeType nodeType = node.NodeType; // Type of node
HtmlNode parentNode = node.ParentNode; // Parent element
HtmlNodeCollection children = node.ChildNodes; // Child elements
Navigation Methods
// Navigate through the DOM tree
HtmlNode firstChild = node.FirstChild;
HtmlNode lastChild = node.LastChild;
HtmlNode nextSibling = node.NextSibling;
HtmlNode previousSibling = node.PreviousSibling;
// Get ancestors and descendants
IEnumerable<HtmlNode> ancestors = node.Ancestors();
IEnumerable<HtmlNode> descendants = node.Descendants();
Selection Methods
// XPath selections
HtmlNode singleNode = node.SelectSingleNode("//div[@class='content']");
HtmlNodeCollection nodes = node.SelectNodes("//a[@href]");
// Element selection by tag
IEnumerable<HtmlNode> divElements = node.Elements("div");
HtmlNode firstDiv = node.Element("div");
// Descendant selection
IEnumerable<HtmlNode> allLinks = node.Descendants("a");
HtmlNodeCollection Class
The HtmlNodeCollection
class represents a collection of HtmlNode
objects and provides methods for iterating and manipulating multiple nodes.
// Working with node collections
HtmlNodeCollection links = doc.DocumentNode.SelectNodes("//a");
if (links != null)
{
// Iterate through nodes
foreach (HtmlNode link in links)
{
string href = link.GetAttributeValue("href", "");
string text = link.InnerText;
Console.WriteLine($"Link: {text} -> {href}");
}
// Collection properties
int count = links.Count;
HtmlNode firstLink = links[0];
HtmlNode lastLink = links[links.Count - 1];
}
HtmlAttribute Class
The HtmlAttribute
class represents HTML element attributes and provides methods for accessing and modifying attribute values.
HtmlNode element = doc.DocumentNode.SelectSingleNode("//img");
// Access attributes
HtmlAttributeCollection attributes = element.Attributes;
HtmlAttribute srcAttribute = element.Attributes["src"];
// Get attribute values
string src = element.GetAttributeValue("src", "");
string alt = element.GetAttributeValue("alt", "No alt text");
// Set attribute values
element.SetAttributeValue("src", "new-image.jpg");
element.SetAttributeValue("class", "responsive-image");
// Remove attributes
element.Attributes.Remove("onclick");
Advanced HTML Manipulation
Creating and Modifying Elements
// Create new elements
HtmlDocument doc = new HtmlDocument();
HtmlNode newDiv = doc.CreateElement("div");
newDiv.SetAttributeValue("class", "container");
newDiv.InnerHtml = "<p>New content</p>";
// Append to existing element
HtmlNode body = doc.DocumentNode.SelectSingleNode("//body");
body.AppendChild(newDiv);
// Insert elements
HtmlNode header = doc.CreateElement("h2");
header.InnerText = "Section Title";
body.InsertBefore(header, newDiv);
Removing Elements
// Remove specific elements
HtmlNodeCollection scripts = doc.DocumentNode.SelectNodes("//script");
if (scripts != null)
{
foreach (HtmlNode script in scripts.ToList())
{
script.Remove();
}
}
// Remove by condition
var emptyParagraphs = doc.DocumentNode
.SelectNodes("//p[not(normalize-space())]");
if (emptyParagraphs != null)
{
foreach (HtmlNode p in emptyParagraphs.ToList())
{
p.Remove();
}
}
XPath and CSS Selector Support
Html Agility Pack primarily uses XPath for element selection, though CSS selector support is available through extensions.
XPath Examples
// Basic XPath selections
var titleNode = doc.DocumentNode.SelectSingleNode("//title");
var allParagraphs = doc.DocumentNode.SelectNodes("//p");
var specificClass = doc.DocumentNode.SelectNodes("//div[@class='content']");
// Advanced XPath queries
var linksWithText = doc.DocumentNode.SelectNodes("//a[text()]");
var externalLinks = doc.DocumentNode.SelectNodes("//a[starts-with(@href, 'http')]");
var secondParagraph = doc.DocumentNode.SelectSingleNode("//p[2]");
CSS Selector Extension (Fizzler)
// Install-Package Fizzler.Systems.HtmlAgilityPack
using Fizzler.Systems.HtmlAgilityPack;
// Use CSS selectors
var elements = doc.DocumentNode.QuerySelectorAll(".content p");
var firstMatch = doc.DocumentNode.QuerySelector("#main-content");
Error Handling and Validation
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(htmlContent);
// Check for parse errors
if (doc.ParseErrors.Any())
{
foreach (var error in doc.ParseErrors)
{
Console.WriteLine($"Parse error: {error.Reason} at line {error.Line}");
}
}
// Safe node selection
HtmlNode GetSafeNode(string xpath)
{
try
{
return doc.DocumentNode.SelectSingleNode(xpath);
}
catch (XPathException ex)
{
Console.WriteLine($"Invalid XPath: {ex.Message}");
return null;
}
}
Performance Optimization
Efficient Node Selection
// Cache frequently used nodes
HtmlNode body = doc.DocumentNode.SelectSingleNode("//body");
// Use specific selectors instead of broad searches
var specificElement = body.SelectSingleNode(".//div[@id='specific-id']");
// Limit search scope
var tableRows = doc.DocumentNode
.SelectSingleNode("//table[@class='data']")
?.SelectNodes(".//tr");
Memory Management
// Dispose of large documents when done
using (var web = new HtmlWeb())
{
var doc = web.Load("https://large-site.com");
// Process document
// Automatic cleanup when using statement ends
}
// Explicitly remove references for large collections
HtmlNodeCollection nodes = doc.DocumentNode.SelectNodes("//div");
// Process nodes
nodes = null; // Help GC
Integration with Modern .NET
Html Agility Pack works well with modern .NET features and patterns, making it suitable for applications that need to handle dynamic content that loads after page load or process complex HTML structures similar to how JavaScript-based tools handle nested DOM structures.
Async/Await Pattern
public async Task<List<string>> ExtractLinksAsync(string url)
{
using var client = new HttpClient();
string html = await client.GetStringAsync(url);
var doc = new HtmlDocument();
doc.LoadHtml(html);
return doc.DocumentNode
.SelectNodes("//a[@href]")
?.Select(node => node.GetAttributeValue("href", ""))
.ToList() ?? new List<string>();
}
LINQ Integration
// Use LINQ with Html Agility Pack
var imageInfo = doc.DocumentNode
.Descendants("img")
.Where(img => !string.IsNullOrEmpty(img.GetAttributeValue("src", "")))
.Select(img => new
{
Src = img.GetAttributeValue("src", ""),
Alt = img.GetAttributeValue("alt", ""),
Width = img.GetAttributeValue("width", "")
})
.ToList();
Real-World Use Cases
Data Extraction from Tables
public List<Dictionary<string, string>> ExtractTableData(HtmlDocument doc)
{
var table = doc.DocumentNode.SelectSingleNode("//table");
var headers = table.SelectNodes(".//th")?.Select(th => th.InnerText.Trim()).ToList();
var rows = table.SelectNodes(".//tbody/tr");
var result = new List<Dictionary<string, string>>();
if (headers != null && rows != null)
{
foreach (var row in rows)
{
var cells = row.SelectNodes(".//td");
var rowData = new Dictionary<string, string>();
for (int i = 0; i < Math.Min(headers.Count, cells?.Count ?? 0); i++)
{
rowData[headers[i]] = cells[i].InnerText.Trim();
}
result.Add(rowData);
}
}
return result;
}
Form Data Extraction
public class FormExtractor
{
public FormData ExtractFormData(HtmlDocument doc, string formSelector = "//form")
{
var form = doc.DocumentNode.SelectSingleNode(formSelector);
if (form == null) return null;
var formData = new FormData
{
Action = form.GetAttributeValue("action", ""),
Method = form.GetAttributeValue("method", "GET").ToUpper(),
Fields = new Dictionary<string, string>()
};
var inputs = form.SelectNodes(".//input | .//select | .//textarea");
if (inputs != null)
{
foreach (var input in inputs)
{
var name = input.GetAttributeValue("name", "");
var value = input.GetAttributeValue("value", "");
if (!string.IsNullOrEmpty(name))
{
formData.Fields[name] = value;
}
}
}
return formData;
}
}
public class FormData
{
public string Action { get; set; }
public string Method { get; set; }
public Dictionary<string, string> Fields { get; set; }
}
Best Practices and Common Patterns
Null-Safe Operations
// Always check for null when selecting nodes
var node = doc.DocumentNode.SelectSingleNode("//div[@class='content']");
if (node != null)
{
var text = node.InnerText;
// Process text
}
// Use null-conditional operators
var title = doc.DocumentNode
.SelectSingleNode("//title")
?.InnerText
?.Trim();
// Safe collection iteration
var links = doc.DocumentNode.SelectNodes("//a[@href]");
if (links?.Count > 0)
{
foreach (var link in links)
{
// Process each link
}
}
Efficient Bulk Operations
// Process multiple elements efficiently
var processableNodes = doc.DocumentNode
.SelectNodes("//div[@class='item']")
?.Where(node => !string.IsNullOrEmpty(node.InnerText.Trim()))
.ToList();
if (processableNodes?.Count > 0)
{
var results = processableNodes
.AsParallel() // Use parallel processing for large datasets
.Select(ProcessNode)
.Where(result => result != null)
.ToList();
}
Conclusion
Html Agility Pack provides a robust set of classes and methods for HTML parsing and manipulation in .NET applications. The core classes—HtmlDocument
, HtmlNode
, HtmlNodeCollection
, and HtmlAttribute
—offer comprehensive functionality for most web scraping and HTML processing scenarios. Understanding these fundamental components and their methods enables developers to build efficient and reliable HTML processing solutions.
Whether you're extracting data from web pages, cleaning HTML content, or building web scraping applications, Html Agility Pack's intuitive API and powerful XPath support make it an excellent choice for .NET developers working with HTML content. The library's flexibility and performance make it suitable for everything from simple HTML parsing tasks to complex web scraping operations that require sophisticated DOM manipulation.