How do I parse HTML from a string using Html Agility Pack?
Html Agility Pack is one of the most popular and powerful HTML parsing libraries for .NET developers. Unlike browser-based automation tools, it provides a lightweight solution for parsing HTML content directly from strings, making it ideal for web scraping, data extraction, and HTML manipulation tasks.
What is Html Agility Pack?
Html Agility Pack is a .NET library that provides a simple way to parse HTML documents using a familiar DOM-like API. It can handle malformed HTML gracefully and offers both XPath and LINQ-to-XML query capabilities, making it versatile for various HTML parsing scenarios.
Installation and Setup
Before parsing HTML strings, you need to install Html Agility Pack in your .NET project:
Using NuGet Package Manager
Install-Package HtmlAgilityPack
Using .NET CLI
dotnet add package HtmlAgilityPack
Using PackageReference in .csproj
<PackageReference Include="HtmlAgilityPack" Version="1.11.54" />
Basic HTML String Parsing
Simple String Parsing
The most straightforward way to parse HTML from a string is using the HtmlDocument
class:
using HtmlAgilityPack;
class Program
{
static void Main()
{
string htmlString = @"
<html>
<head><title>Sample Page</title></head>
<body>
<div class='container'>
<h1 id='main-title'>Welcome to My Site</h1>
<p>This is a sample paragraph.</p>
<ul>
<li>Item 1</li>
<li>Item 2</li>
<li>Item 3</li>
</ul>
</div>
</body>
</html>";
// Create HtmlDocument instance
HtmlDocument doc = new HtmlDocument();
// Load HTML from string
doc.LoadHtml(htmlString);
// Access the document root
HtmlNode rootNode = doc.DocumentNode;
Console.WriteLine("Document parsed successfully!");
}
}
Extracting Specific Elements
Once you've loaded the HTML string, you can extract specific elements using various selection methods:
// Extract title
HtmlNode titleNode = doc.DocumentNode.SelectSingleNode("//title");
string title = titleNode?.InnerText ?? "No title found";
Console.WriteLine($"Title: {title}");
// Extract main heading
HtmlNode h1Node = doc.DocumentNode.SelectSingleNode("//h1[@id='main-title']");
string heading = h1Node?.InnerText ?? "No heading found";
Console.WriteLine($"Heading: {heading}");
// Extract all list items
HtmlNodeCollection listItems = doc.DocumentNode.SelectNodes("//li");
if (listItems != null)
{
foreach (HtmlNode item in listItems)
{
Console.WriteLine($"List item: {item.InnerText}");
}
}
Advanced Parsing Techniques
Using CSS Selectors with QuerySelector
Html Agility Pack supports CSS selectors through the QuerySelector
methods:
// Select by class
HtmlNode containerDiv = doc.DocumentNode.QuerySelector(".container");
// Select by ID
HtmlNode mainTitle = doc.DocumentNode.QuerySelector("#main-title");
// Select multiple elements
IEnumerable<HtmlNode> paragraphs = doc.DocumentNode.QuerySelectorAll("p");
// Complex selectors
HtmlNode firstListItem = doc.DocumentNode.QuerySelector("ul li:first-child");
Extracting Attributes
You can easily extract HTML attributes from parsed elements:
string htmlWithAttributes = @"
<div>
<img src='image1.jpg' alt='Sample Image' class='responsive' />
<a href='https://example.com' target='_blank'>External Link</a>
</div>";
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(htmlWithAttributes);
// Extract image attributes
HtmlNode imgNode = doc.DocumentNode.SelectSingleNode("//img");
if (imgNode != null)
{
string src = imgNode.GetAttributeValue("src", "");
string alt = imgNode.GetAttributeValue("alt", "");
string cssClass = imgNode.GetAttributeValue("class", "");
Console.WriteLine($"Image: src={src}, alt={alt}, class={cssClass}");
}
// Extract link attributes
HtmlNode linkNode = doc.DocumentNode.SelectSingleNode("//a");
if (linkNode != null)
{
string href = linkNode.GetAttributeValue("href", "");
string target = linkNode.GetAttributeValue("target", "");
Console.WriteLine($"Link: href={href}, target={target}");
}
Handling Tables and Structured Data
Html Agility Pack excels at parsing structured data like tables:
string tableHtml = @"
<table>
<thead>
<tr>
<th>Name</th>
<th>Age</th>
<th>City</th>
</tr>
</thead>
<tbody>
<tr>
<td>John Doe</td>
<td>30</td>
<td>New York</td>
</tr>
<tr>
<td>Jane Smith</td>
<td>25</td>
<td>London</td>
</tr>
</tbody>
</table>";
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(tableHtml);
// Extract table headers
var headers = doc.DocumentNode
.SelectNodes("//th")
?.Select(th => th.InnerText.Trim())
.ToList();
// Extract table data
var rows = doc.DocumentNode.SelectNodes("//tbody/tr");
if (rows != null)
{
foreach (var row in rows)
{
var cells = row.SelectNodes("td")
?.Select(td => td.InnerText.Trim())
.ToArray();
if (cells != null && headers != null)
{
for (int i = 0; i < Math.Min(headers.Count, cells.Length); i++)
{
Console.WriteLine($"{headers[i]}: {cells[i]}");
}
Console.WriteLine("---");
}
}
}
Error Handling and Robustness
Handling Malformed HTML
One of Html Agility Pack's strengths is its ability to handle malformed HTML gracefully:
public static class HtmlParser
{
public static HtmlDocument ParseHtmlString(string html)
{
try
{
var doc = new HtmlDocument();
// Configure parser options
doc.OptionFixNestedTags = true;
doc.OptionAutoCloseOnEnd = true;
doc.OptionDefaultStreamEncoding = Encoding.UTF8;
doc.LoadHtml(html);
// Check for parsing errors
if (doc.ParseErrors != null && doc.ParseErrors.Any())
{
foreach (var error in doc.ParseErrors)
{
Console.WriteLine($"Parse warning: {error.Reason} at line {error.Line}");
}
}
return doc;
}
catch (Exception ex)
{
Console.WriteLine($"Error parsing HTML: {ex.Message}");
throw;
}
}
}
Safe Element Extraction
Implement safe extraction methods to prevent null reference exceptions:
public static class HtmlExtensions
{
public static string SafeInnerText(this HtmlNode node)
{
return node?.InnerText?.Trim() ?? string.Empty;
}
public static string SafeGetAttribute(this HtmlNode node, string attributeName, string defaultValue = "")
{
return node?.GetAttributeValue(attributeName, defaultValue) ?? defaultValue;
}
public static List<HtmlNode> SafeSelectNodes(this HtmlNode node, string xpath)
{
return node?.SelectNodes(xpath)?.ToList() ?? new List<HtmlNode>();
}
}
// Usage example
string title = doc.DocumentNode.SelectSingleNode("//title").SafeInnerText();
string metaDescription = doc.DocumentNode
.SelectSingleNode("//meta[@name='description']")
.SafeGetAttribute("content");
Performance Optimization
Memory Management
For large-scale HTML parsing operations, consider memory management:
public class OptimizedHtmlParser
{
public void ParseMultipleHtmlStrings(IEnumerable<string> htmlStrings)
{
foreach (string html in htmlStrings)
{
using (var doc = new HtmlDocument())
{
doc.LoadHtml(html);
// Process document
ProcessDocument(doc);
// Document will be disposed automatically
}
// Force garbage collection for large datasets
if (Environment.WorkingSet > 500_000_000) // 500MB threshold
{
GC.Collect();
GC.WaitForPendingFinalizers();
}
}
}
private void ProcessDocument(HtmlDocument doc)
{
// Your processing logic here
}
}
Reusing HtmlDocument Instances
For better performance when parsing multiple strings, reuse HtmlDocument
instances:
public class ReusableHtmlParser
{
private readonly HtmlDocument _document;
public ReusableHtmlParser()
{
_document = new HtmlDocument();
_document.OptionFixNestedTags = true;
_document.OptionAutoCloseOnEnd = true;
}
public HtmlNode ParseString(string html)
{
_document.LoadHtml(html);
return _document.DocumentNode;
}
}
Comparison with Other Parsing Methods
While Html Agility Pack is excellent for parsing HTML strings, you might also consider browser automation tools for JavaScript-heavy content. For scenarios requiring JavaScript execution, tools that can handle dynamic content loading might be more appropriate.
However, for pure HTML parsing from strings, Html Agility Pack offers several advantages:
- Performance: Faster than browser automation for static HTML
- Memory efficiency: Lower resource usage
- Simplicity: No browser dependencies
- Reliability: Handles malformed HTML gracefully
Best Practices
- Always check for null values when working with selected nodes
- Use specific XPath or CSS selectors to improve performance
- Configure parser options based on your HTML quality expectations
- Implement proper error handling for production applications
- Consider encoding issues when dealing with international content
- Use LINQ for complex data transformations after parsing
Conclusion
Html Agility Pack provides a robust and efficient solution for parsing HTML from strings in .NET applications. Its ability to handle malformed HTML, combined with powerful selection methods and excellent performance characteristics, makes it an ideal choice for web scraping and HTML processing tasks. Whether you're extracting data from web responses, processing HTML templates, or building content analysis tools, Html Agility Pack offers the flexibility and reliability needed for professional development.
For more complex scenarios involving dynamic content or JavaScript execution, consider complementing Html Agility Pack with browser automation tools, but for pure HTML string parsing, it remains one of the best choices available for .NET developers.