How to Handle HTML Entities with Html Agility Pack
HTML entities are special character representations used for reserved HTML characters or symbols not easily displayable in standard encodings. When web scraping with Html Agility Pack in C#, proper entity handling ensures accurate text extraction and prevents data corruption.
Automatic Entity Decoding
Html Agility Pack automatically decodes standard HTML entities when accessing node properties:
using HtmlAgilityPack;
var html = @"
<div class='content'>
<p>Price: $100 & up</p>
<p>Size: < 5MB</p>
<p>Rating: 4 > 3 stars</p>
</div>";
var doc = new HtmlDocument();
doc.LoadHtml(html);
var paragraphs = doc.DocumentNode.SelectNodes("//p");
foreach (var p in paragraphs)
{
Console.WriteLine(p.InnerText);
}
// Output:
// Price: $100 & up
// Size: < 5MB
// Rating: 4 > 3 stars
Common HTML Entities
Here are the most frequently encountered entities and their decoded values:
using HtmlAgilityPack;
var entitiesHtml = @"
<div>
<p>& - Ampersand</p>
<p>< - Less than</p>
<p>> - Greater than</p>
<p>" - Double quote</p>
<p>' - Single quote</p>
<p> - Non-breaking space</p>
<p>© - Copyright</p>
<p>€ - Euro symbol</p>
<p>— - Em dash</p>
<p>– - En dash</p>
</div>";
var doc = new HtmlDocument();
doc.LoadHtml(entitiesHtml);
var items = doc.DocumentNode.SelectNodes("//p");
foreach (var item in items)
{
Console.WriteLine($"Original: {item.InnerHtml}");
Console.WriteLine($"Decoded: {item.InnerText}");
Console.WriteLine();
}
Manual Entity Encoding
When you need to encode text for HTML output, use HtmlEntity.Entitize
:
using HtmlAgilityPack;
string userInput = "Search for: cats & dogs < pets > animals";
string safeHtml = HtmlEntity.Entitize(userInput);
Console.WriteLine(safeHtml);
// Output: Search for: cats & dogs < pets > animals
// Create safe HTML content
var template = $"<p>{safeHtml}</p>";
Console.WriteLine(template);
// Output: <p>Search for: cats & dogs < pets > animals</p>
Explicit Entity Decoding
For guaranteed entity decoding or handling edge cases, use HtmlEntity.DeEntitize
:
using HtmlAgilityPack;
var html = @"<article>
<h1>Café Menu</h1>
<p>Prices in € & £</p>
<p>Copyright © 2024</p>
</article>";
var doc = new HtmlDocument();
doc.LoadHtml(html);
var title = doc.DocumentNode.SelectSingleNode("//h1");
var content = doc.DocumentNode.SelectNodes("//p");
// Automatic decoding via InnerText
Console.WriteLine($"Title: {title.InnerText}");
// Explicit decoding for additional safety
foreach (var p in content)
{
string decoded = HtmlEntity.DeEntitize(p.InnerText);
Console.WriteLine($"Content: {decoded}");
}
// Output:
// Title: Café Menu
// Content: Prices in € & £
// Content: Copyright © 2024
Handling Custom and Numeric Entities
For non-standard entities or specific numeric character references:
using HtmlAgilityPack;
using System.Collections.Generic;
using System.Text.RegularExpressions;
var html = @"<div>
<p>Custom entity: &customsymbol;</p>
<p>Numeric entities: € ™ ®</p>
<p>Hex entities: ™ ®</p>
</div>";
var doc = new HtmlDocument();
doc.LoadHtml(html);
// Define custom entity mappings
var customEntities = new Dictionary<string, string>
{
{"&customsymbol;", "★"},
{"&trademark;", "™"},
{"®istered;", "®"}
};
foreach (var p in doc.DocumentNode.SelectNodes("//p"))
{
string text = p.InnerText;
// Apply custom entity replacements
foreach (var entity in customEntities)
{
text = text.Replace(entity.Key, entity.Value);
}
// Html Agility Pack handles numeric entities automatically
Console.WriteLine($"Processed: {text}");
}
// Output:
// Processed: Custom entity: ★
// Processed: Numeric entities: € ™ ®
// Processed: Hex entities: ™ ®
Real-World Web Scraping Example
Here's a practical example scraping product information with entity handling:
using HtmlAgilityPack;
using System;
using System.Net.Http;
using System.Threading.Tasks;
public class ProductScraper
{
public async Task<ProductInfo> ScrapeProduct(string url)
{
using var client = new HttpClient();
var html = await client.GetStringAsync(url);
var doc = new HtmlDocument();
doc.LoadHtml(html);
// Extract product details with proper entity handling
var product = new ProductInfo
{
Name = GetCleanText(doc, "//h1[@class='product-title']"),
Price = GetCleanText(doc, "//span[@class='price']"),
Description = GetCleanText(doc, "//div[@class='description']"),
Features = GetFeatureList(doc)
};
return product;
}
private string GetCleanText(HtmlDocument doc, string xpath)
{
var node = doc.DocumentNode.SelectSingleNode(xpath);
if (node == null) return string.Empty;
// InnerText automatically decodes entities
return node.InnerText.Trim();
}
private List<string> GetFeatureList(HtmlDocument doc)
{
var features = new List<string>();
var nodes = doc.DocumentNode.SelectNodes("//ul[@class='features']//li");
if (nodes != null)
{
foreach (var node in nodes)
{
// Ensure all entities are properly decoded
string feature = HtmlEntity.DeEntitize(node.InnerText.Trim());
if (!string.IsNullOrEmpty(feature))
{
features.Add(feature);
}
}
}
return features;
}
}
public class ProductInfo
{
public string Name { get; set; }
public string Price { get; set; }
public string Description { get; set; }
public List<string> Features { get; set; }
}
Character Encoding Considerations
When loading HTML documents, specify the correct encoding to prevent entity misinterpretation:
using HtmlAgilityPack;
using System.Text;
// Load with specific encoding
var doc = new HtmlDocument();
// From file with encoding
doc.Load("webpage.html", Encoding.UTF8);
// From string with encoding detection
var htmlBytes = File.ReadAllBytes("webpage.html");
var encoding = doc.DetectEncoding(htmlBytes) ?? Encoding.UTF8;
var html = encoding.GetString(htmlBytes);
doc.LoadHtml(html);
// Manual encoding specification
doc.LoadHtml(html);
doc.DocumentNode.OwnerDocument.Encoding = Encoding.UTF8;
Best Practices
- Trust InnerText: Use
InnerText
for most entity decoding needs as it handles standard entities automatically - Validate Input: When processing user input for HTML output, always use
HtmlEntity.Entitize
- Handle Edge Cases: Use
HtmlEntity.DeEntitize
for explicit decoding when dealing with complex content - Custom Entities: Maintain a dictionary for application-specific entity mappings
- Encoding Awareness: Always consider the document's character encoding when processing international content
This comprehensive approach ensures robust HTML entity handling in your web scraping applications, preventing data corruption and maintaining text integrity across different character sets and encodings.