How do I handle HTML entities with Html Agility Pack?

How to Handle HTML Entities with Html Agility Pack

HTML entities are special character representations used for reserved HTML characters or symbols not easily displayable in standard encodings. When web scraping with Html Agility Pack in C#, proper entity handling ensures accurate text extraction and prevents data corruption.

Automatic Entity Decoding

Html Agility Pack automatically decodes standard HTML entities when accessing node properties:

using HtmlAgilityPack;

var html = @"
<div class='content'>
    <p>Price: $100 &amp; up</p>
    <p>Size: &lt; 5MB</p>
    <p>Rating: 4 &gt; 3 stars</p>
</div>";

var doc = new HtmlDocument();
doc.LoadHtml(html);

var paragraphs = doc.DocumentNode.SelectNodes("//p");
foreach (var p in paragraphs)
{
    Console.WriteLine(p.InnerText);
}
// Output:
// Price: $100 & up
// Size: < 5MB  
// Rating: 4 > 3 stars

Common HTML Entities

Here are the most frequently encountered entities and their decoded values:

using HtmlAgilityPack;

var entitiesHtml = @"
<div>
    <p>&amp; - Ampersand</p>
    <p>&lt; - Less than</p>
    <p>&gt; - Greater than</p>
    <p>&quot; - Double quote</p>
    <p>&#39; - Single quote</p>
    <p>&nbsp; - Non-breaking space</p>
    <p>&copy; - Copyright</p>
    <p>&euro; - Euro symbol</p>
    <p>&mdash; - Em dash</p>
    <p>&ndash; - En dash</p>
</div>";

var doc = new HtmlDocument();
doc.LoadHtml(entitiesHtml);

var items = doc.DocumentNode.SelectNodes("//p");
foreach (var item in items)
{
    Console.WriteLine($"Original: {item.InnerHtml}");
    Console.WriteLine($"Decoded:  {item.InnerText}");
    Console.WriteLine();
}

Manual Entity Encoding

When you need to encode text for HTML output, use HtmlEntity.Entitize:

using HtmlAgilityPack;

string userInput = "Search for: cats & dogs < pets > animals";
string safeHtml = HtmlEntity.Entitize(userInput);

Console.WriteLine(safeHtml);
// Output: Search for: cats &amp; dogs &lt; pets &gt; animals

// Create safe HTML content
var template = $"<p>{safeHtml}</p>";
Console.WriteLine(template);
// Output: <p>Search for: cats &amp; dogs &lt; pets &gt; animals</p>

Explicit Entity Decoding

For guaranteed entity decoding or handling edge cases, use HtmlEntity.DeEntitize:

using HtmlAgilityPack;

var html = @"<article>
    <h1>Caf&eacute; Menu</h1>
    <p>Prices in &euro; &amp; &pound;</p>
    <p>Copyright &copy; 2024</p>
</article>";

var doc = new HtmlDocument();
doc.LoadHtml(html);

var title = doc.DocumentNode.SelectSingleNode("//h1");
var content = doc.DocumentNode.SelectNodes("//p");

// Automatic decoding via InnerText
Console.WriteLine($"Title: {title.InnerText}");

// Explicit decoding for additional safety
foreach (var p in content)
{
    string decoded = HtmlEntity.DeEntitize(p.InnerText);
    Console.WriteLine($"Content: {decoded}");
}
// Output:
// Title: Café Menu
// Content: Prices in € & £
// Content: Copyright © 2024

Handling Custom and Numeric Entities

For non-standard entities or specific numeric character references:

using HtmlAgilityPack;
using System.Collections.Generic;
using System.Text.RegularExpressions;

var html = @"<div>
    <p>Custom entity: &customsymbol;</p>
    <p>Numeric entities: &#8364; &#8482; &#174;</p>
    <p>Hex entities: &#x2122; &#x00AE;</p>
</div>";

var doc = new HtmlDocument();
doc.LoadHtml(html);

// Define custom entity mappings
var customEntities = new Dictionary<string, string>
{
    {"&customsymbol;", "★"},
    {"&trademark;", "™"},
    {"&registered;", "®"}
};

foreach (var p in doc.DocumentNode.SelectNodes("//p"))
{
    string text = p.InnerText;

    // Apply custom entity replacements
    foreach (var entity in customEntities)
    {
        text = text.Replace(entity.Key, entity.Value);
    }

    // Html Agility Pack handles numeric entities automatically
    Console.WriteLine($"Processed: {text}");
}
// Output:
// Processed: Custom entity: ★
// Processed: Numeric entities: € ™ ®
// Processed: Hex entities: ™ ®

Real-World Web Scraping Example

Here's a practical example scraping product information with entity handling:

using HtmlAgilityPack;
using System;
using System.Net.Http;
using System.Threading.Tasks;

public class ProductScraper
{
    public async Task<ProductInfo> ScrapeProduct(string url)
    {
        using var client = new HttpClient();
        var html = await client.GetStringAsync(url);

        var doc = new HtmlDocument();
        doc.LoadHtml(html);

        // Extract product details with proper entity handling
        var product = new ProductInfo
        {
            Name = GetCleanText(doc, "//h1[@class='product-title']"),
            Price = GetCleanText(doc, "//span[@class='price']"),
            Description = GetCleanText(doc, "//div[@class='description']"),
            Features = GetFeatureList(doc)
        };

        return product;
    }

    private string GetCleanText(HtmlDocument doc, string xpath)
    {
        var node = doc.DocumentNode.SelectSingleNode(xpath);
        if (node == null) return string.Empty;

        // InnerText automatically decodes entities
        return node.InnerText.Trim();
    }

    private List<string> GetFeatureList(HtmlDocument doc)
    {
        var features = new List<string>();
        var nodes = doc.DocumentNode.SelectNodes("//ul[@class='features']//li");

        if (nodes != null)
        {
            foreach (var node in nodes)
            {
                // Ensure all entities are properly decoded
                string feature = HtmlEntity.DeEntitize(node.InnerText.Trim());
                if (!string.IsNullOrEmpty(feature))
                {
                    features.Add(feature);
                }
            }
        }

        return features;
    }
}

public class ProductInfo
{
    public string Name { get; set; }
    public string Price { get; set; }
    public string Description { get; set; }
    public List<string> Features { get; set; }
}

Character Encoding Considerations

When loading HTML documents, specify the correct encoding to prevent entity misinterpretation:

using HtmlAgilityPack;
using System.Text;

// Load with specific encoding
var doc = new HtmlDocument();

// From file with encoding
doc.Load("webpage.html", Encoding.UTF8);

// From string with encoding detection
var htmlBytes = File.ReadAllBytes("webpage.html");
var encoding = doc.DetectEncoding(htmlBytes) ?? Encoding.UTF8;
var html = encoding.GetString(htmlBytes);
doc.LoadHtml(html);

// Manual encoding specification
doc.LoadHtml(html);
doc.DocumentNode.OwnerDocument.Encoding = Encoding.UTF8;

Best Practices

  1. Trust InnerText: Use InnerText for most entity decoding needs as it handles standard entities automatically
  2. Validate Input: When processing user input for HTML output, always use HtmlEntity.Entitize
  3. Handle Edge Cases: Use HtmlEntity.DeEntitize for explicit decoding when dealing with complex content
  4. Custom Entities: Maintain a dictionary for application-specific entity mappings
  5. Encoding Awareness: Always consider the document's character encoding when processing international content

This comprehensive approach ensures robust HTML entity handling in your web scraping applications, preventing data corruption and maintaining text integrity across different character sets and encodings.

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon