Using Html Agility Pack with C# Async/Await Patterns
Html Agility Pack works seamlessly with C#'s async/await pattern for non-blocking web scraping operations. This approach is essential for maintaining responsive applications, especially in UI frameworks (WPF, WinForms) and ASP.NET applications where blocking the main thread should be avoided.
Installation
First, install the Html Agility Pack NuGet package:
# Using Package Manager Console
Install-Package HtmlAgilityPack
# Using .NET CLI
dotnet add package HtmlAgilityPack
Key Principles
- Async I/O Operations: Use
HttpClient
for asynchronous web requests - Sync HTML Parsing: Html Agility Pack's parsing methods are synchronous (CPU-bound operations)
- Proper Resource Management: Implement
IDisposable
forHttpClient
- Exception Handling: Handle network and parsing exceptions appropriately
Basic Implementation
Here's a complete example of an async web scraper:
using System;
using System.Collections.Generic;
using System.Net.Http;
using System.Threading.Tasks;
using HtmlAgilityPack;
public class AsyncWebScraper : IDisposable
{
private readonly HttpClient _httpClient;
public AsyncWebScraper()
{
_httpClient = new HttpClient();
// Configure default headers
_httpClient.DefaultRequestHeaders.Add("User-Agent",
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36");
}
public async Task<List<string>> ExtractLinksAsync(string url)
{
try
{
// Async HTTP request
var response = await _httpClient.GetAsync(url);
response.EnsureSuccessStatusCode();
var content = await response.Content.ReadAsStringAsync();
// Sync HTML parsing (fast operation)
var htmlDoc = new HtmlDocument();
htmlDoc.LoadHtml(content);
// Extract links
var links = new List<string>();
var linkNodes = htmlDoc.DocumentNode.SelectNodes("//a[@href]");
if (linkNodes != null)
{
foreach (var link in linkNodes)
{
var href = link.GetAttributeValue("href", string.Empty);
if (!string.IsNullOrEmpty(href))
{
links.Add(href);
}
}
}
return links;
}
catch (HttpRequestException ex)
{
Console.WriteLine($"HTTP error: {ex.Message}");
throw;
}
catch (Exception ex)
{
Console.WriteLine($"Parsing error: {ex.Message}");
throw;
}
}
public void Dispose()
{
_httpClient?.Dispose();
}
}
Advanced Usage Examples
Multiple Pages Concurrently
public async Task<Dictionary<string, List<string>>> ScrapeMultipleUrlsAsync(IEnumerable<string> urls)
{
var tasks = urls.Select(async url =>
{
var links = await ExtractLinksAsync(url);
return new { Url = url, Links = links };
});
var results = await Task.WhenAll(tasks);
return results.ToDictionary(r => r.Url, r => r.Links);
}
Data Extraction with Models
public class ProductInfo
{
public string Name { get; set; }
public decimal Price { get; set; }
public string Description { get; set; }
}
public async Task<List<ProductInfo>> ExtractProductsAsync(string url)
{
var response = await _httpClient.GetAsync(url);
response.EnsureSuccessStatusCode();
var content = await response.Content.ReadAsStringAsync();
var htmlDoc = new HtmlDocument();
htmlDoc.LoadHtml(content);
var products = new List<ProductInfo>();
var productNodes = htmlDoc.DocumentNode.SelectNodes("//div[@class='product']");
if (productNodes != null)
{
foreach (var node in productNodes)
{
var product = new ProductInfo
{
Name = node.SelectSingleNode(".//h2")?.InnerText?.Trim(),
Price = decimal.TryParse(
node.SelectSingleNode(".//span[@class='price']")?.InnerText?.Replace("$", ""),
out var price) ? price : 0,
Description = node.SelectSingleNode(".//p[@class='description']")?.InnerText?.Trim()
};
products.Add(product);
}
}
return products;
}
Cancellation Token Support
public async Task<List<string>> ExtractLinksAsync(string url, CancellationToken cancellationToken = default)
{
var response = await _httpClient.GetAsync(url, cancellationToken);
response.EnsureSuccessStatusCode();
var content = await response.Content.ReadAsStringAsync();
// Check cancellation before CPU-intensive parsing
cancellationToken.ThrowIfCancellationRequested();
var htmlDoc = new HtmlDocument();
htmlDoc.LoadHtml(content);
// Process and return results...
}
Usage in Different Application Types
ASP.NET Core Controller
[ApiController]
[Route("api/[controller]")]
public class ScrapingController : ControllerBase
{
private readonly AsyncWebScraper _scraper;
public ScrapingController(AsyncWebScraper scraper)
{
_scraper = scraper;
}
[HttpGet("links")]
public async Task<ActionResult<List<string>>> GetLinks(string url)
{
if (string.IsNullOrEmpty(url))
return BadRequest("URL is required");
try
{
var links = await _scraper.ExtractLinksAsync(url);
return Ok(links);
}
catch (Exception ex)
{
return StatusCode(500, $"Error: {ex.Message}");
}
}
}
Console Application
class Program
{
static async Task Main(string[] args)
{
using var scraper = new AsyncWebScraper();
try
{
var links = await scraper.ExtractLinksAsync("https://example.com");
Console.WriteLine($"Found {links.Count} links:");
foreach (var link in links.Take(10)) // Show first 10
{
Console.WriteLine(link);
}
}
catch (Exception ex)
{
Console.WriteLine($"Error: {ex.Message}");
}
}
}
Best Practices
- Reuse HttpClient: Create one instance and reuse it throughout your application
- Configure Timeouts: Set appropriate request timeouts
- Handle Rate Limiting: Implement delays between requests when scraping multiple pages
- Respect robots.txt: Check website scraping policies
- Error Handling: Implement comprehensive exception handling
- Resource Disposal: Always dispose of HttpClient properly
Common Pitfalls to Avoid
- Don't create new
HttpClient
instances for each request (use a singleton or dependency injection) - Don't ignore HTTP status codes (always check
response.IsSuccessStatusCode
) - Don't parse HTML on large files without considering memory usage
- Don't forget to handle network timeouts and retries for production applications
This async approach ensures your applications remain responsive while efficiently scraping web content with Html Agility Pack.