How do I use Html Agility Pack with C# async/await patterns?

Using Html Agility Pack with C# Async/Await Patterns

Html Agility Pack works seamlessly with C#'s async/await pattern for non-blocking web scraping operations. This approach is essential for maintaining responsive applications, especially in UI frameworks (WPF, WinForms) and ASP.NET applications where blocking the main thread should be avoided.

Installation

First, install the Html Agility Pack NuGet package:

# Using Package Manager Console
Install-Package HtmlAgilityPack

# Using .NET CLI
dotnet add package HtmlAgilityPack

Key Principles

  1. Async I/O Operations: Use HttpClient for asynchronous web requests
  2. Sync HTML Parsing: Html Agility Pack's parsing methods are synchronous (CPU-bound operations)
  3. Proper Resource Management: Implement IDisposable for HttpClient
  4. Exception Handling: Handle network and parsing exceptions appropriately

Basic Implementation

Here's a complete example of an async web scraper:

using System;
using System.Collections.Generic;
using System.Net.Http;
using System.Threading.Tasks;
using HtmlAgilityPack;

public class AsyncWebScraper : IDisposable
{
    private readonly HttpClient _httpClient;

    public AsyncWebScraper()
    {
        _httpClient = new HttpClient();
        // Configure default headers
        _httpClient.DefaultRequestHeaders.Add("User-Agent", 
            "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36");
    }

    public async Task<List<string>> ExtractLinksAsync(string url)
    {
        try
        {
            // Async HTTP request
            var response = await _httpClient.GetAsync(url);
            response.EnsureSuccessStatusCode();

            var content = await response.Content.ReadAsStringAsync();

            // Sync HTML parsing (fast operation)
            var htmlDoc = new HtmlDocument();
            htmlDoc.LoadHtml(content);

            // Extract links
            var links = new List<string>();
            var linkNodes = htmlDoc.DocumentNode.SelectNodes("//a[@href]");

            if (linkNodes != null)
            {
                foreach (var link in linkNodes)
                {
                    var href = link.GetAttributeValue("href", string.Empty);
                    if (!string.IsNullOrEmpty(href))
                    {
                        links.Add(href);
                    }
                }
            }

            return links;
        }
        catch (HttpRequestException ex)
        {
            Console.WriteLine($"HTTP error: {ex.Message}");
            throw;
        }
        catch (Exception ex)
        {
            Console.WriteLine($"Parsing error: {ex.Message}");
            throw;
        }
    }

    public void Dispose()
    {
        _httpClient?.Dispose();
    }
}

Advanced Usage Examples

Multiple Pages Concurrently

public async Task<Dictionary<string, List<string>>> ScrapeMultipleUrlsAsync(IEnumerable<string> urls)
{
    var tasks = urls.Select(async url =>
    {
        var links = await ExtractLinksAsync(url);
        return new { Url = url, Links = links };
    });

    var results = await Task.WhenAll(tasks);

    return results.ToDictionary(r => r.Url, r => r.Links);
}

Data Extraction with Models

public class ProductInfo
{
    public string Name { get; set; }
    public decimal Price { get; set; }
    public string Description { get; set; }
}

public async Task<List<ProductInfo>> ExtractProductsAsync(string url)
{
    var response = await _httpClient.GetAsync(url);
    response.EnsureSuccessStatusCode();

    var content = await response.Content.ReadAsStringAsync();
    var htmlDoc = new HtmlDocument();
    htmlDoc.LoadHtml(content);

    var products = new List<ProductInfo>();
    var productNodes = htmlDoc.DocumentNode.SelectNodes("//div[@class='product']");

    if (productNodes != null)
    {
        foreach (var node in productNodes)
        {
            var product = new ProductInfo
            {
                Name = node.SelectSingleNode(".//h2")?.InnerText?.Trim(),
                Price = decimal.TryParse(
                    node.SelectSingleNode(".//span[@class='price']")?.InnerText?.Replace("$", ""), 
                    out var price) ? price : 0,
                Description = node.SelectSingleNode(".//p[@class='description']")?.InnerText?.Trim()
            };

            products.Add(product);
        }
    }

    return products;
}

Cancellation Token Support

public async Task<List<string>> ExtractLinksAsync(string url, CancellationToken cancellationToken = default)
{
    var response = await _httpClient.GetAsync(url, cancellationToken);
    response.EnsureSuccessStatusCode();

    var content = await response.Content.ReadAsStringAsync();

    // Check cancellation before CPU-intensive parsing
    cancellationToken.ThrowIfCancellationRequested();

    var htmlDoc = new HtmlDocument();
    htmlDoc.LoadHtml(content);

    // Process and return results...
}

Usage in Different Application Types

ASP.NET Core Controller

[ApiController]
[Route("api/[controller]")]
public class ScrapingController : ControllerBase
{
    private readonly AsyncWebScraper _scraper;

    public ScrapingController(AsyncWebScraper scraper)
    {
        _scraper = scraper;
    }

    [HttpGet("links")]
    public async Task<ActionResult<List<string>>> GetLinks(string url)
    {
        if (string.IsNullOrEmpty(url))
            return BadRequest("URL is required");

        try
        {
            var links = await _scraper.ExtractLinksAsync(url);
            return Ok(links);
        }
        catch (Exception ex)
        {
            return StatusCode(500, $"Error: {ex.Message}");
        }
    }
}

Console Application

class Program
{
    static async Task Main(string[] args)
    {
        using var scraper = new AsyncWebScraper();

        try
        {
            var links = await scraper.ExtractLinksAsync("https://example.com");

            Console.WriteLine($"Found {links.Count} links:");
            foreach (var link in links.Take(10)) // Show first 10
            {
                Console.WriteLine(link);
            }
        }
        catch (Exception ex)
        {
            Console.WriteLine($"Error: {ex.Message}");
        }
    }
}

Best Practices

  1. Reuse HttpClient: Create one instance and reuse it throughout your application
  2. Configure Timeouts: Set appropriate request timeouts
  3. Handle Rate Limiting: Implement delays between requests when scraping multiple pages
  4. Respect robots.txt: Check website scraping policies
  5. Error Handling: Implement comprehensive exception handling
  6. Resource Disposal: Always dispose of HttpClient properly

Common Pitfalls to Avoid

  • Don't create new HttpClient instances for each request (use a singleton or dependency injection)
  • Don't ignore HTTP status codes (always check response.IsSuccessStatusCode)
  • Don't parse HTML on large files without considering memory usage
  • Don't forget to handle network timeouts and retries for production applications

This async approach ensures your applications remain responsive while efficiently scraping web content with Html Agility Pack.

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon