How do I Handle Exceptions in C# Web Scraping Applications?
Exception handling is critical in C# web scraping applications due to the unpredictable nature of web environments. Network issues, website changes, rate limiting, and unexpected HTML structures can all cause failures. Implementing robust exception handling ensures your scraper remains stable, recovers gracefully from errors, and provides meaningful diagnostics.
Common Exceptions in Web Scraping
Web scraping applications typically encounter several categories of exceptions:
1. Network-Related Exceptions
HttpRequestException occurs when HTTP requests fail due to network issues, DNS failures, or connection problems:
using System;
using System.Net.Http;
using System.Threading.Tasks;
public class WebScraper
{
private readonly HttpClient _httpClient;
public WebScraper()
{
_httpClient = new HttpClient
{
Timeout = TimeSpan.FromSeconds(30)
};
}
public async Task<string> FetchPageAsync(string url)
{
try
{
HttpResponseMessage response = await _httpClient.GetAsync(url);
response.EnsureSuccessStatusCode();
return await response.Content.ReadAsStringAsync();
}
catch (HttpRequestException ex)
{
Console.WriteLine($"HTTP request failed: {ex.Message}");
throw;
}
catch (TaskCanceledException ex)
{
Console.WriteLine($"Request timeout: {ex.Message}");
throw;
}
}
}
2. Timeout Exceptions
TaskCanceledException is thrown when operations exceed configured timeout limits:
public async Task<string> FetchWithTimeoutAsync(string url, int timeoutSeconds = 30)
{
using var cts = new CancellationTokenSource(TimeSpan.FromSeconds(timeoutSeconds));
try
{
HttpResponseMessage response = await _httpClient.GetAsync(url, cts.Token);
response.EnsureSuccessStatusCode();
return await response.Content.ReadAsStringAsync();
}
catch (OperationCanceledException)
{
throw new TimeoutException($"Request to {url} timed out after {timeoutSeconds} seconds");
}
}
3. Parsing and Data Extraction Exceptions
HTML parsing can fail when selectors don't match or when data formats are unexpected:
using HtmlAgilityPack;
public class DataExtractor
{
public string ExtractTitle(string html)
{
try
{
var doc = new HtmlDocument();
doc.LoadHtml(html);
var titleNode = doc.DocumentNode.SelectSingleNode("//title");
if (titleNode == null)
{
throw new InvalidOperationException("Title element not found");
}
return titleNode.InnerText.Trim();
}
catch (Exception ex) when (ex is XPathException || ex is NullReferenceException)
{
Console.WriteLine($"Parsing error: {ex.Message}");
return string.Empty;
}
}
}
Implementing Retry Logic with Exponential Backoff
Transient errors often resolve themselves, making retry logic essential:
using Polly;
using Polly.Retry;
public class ResilientScraper
{
private readonly HttpClient _httpClient;
private readonly AsyncRetryPolicy<HttpResponseMessage> _retryPolicy;
public ResilientScraper()
{
_httpClient = new HttpClient();
// Retry up to 3 times with exponential backoff
_retryPolicy = Policy
.HandleResult<HttpResponseMessage>(r => !r.IsSuccessStatusCode)
.Or<HttpRequestException>()
.Or<TaskCanceledException>()
.WaitAndRetryAsync(
retryCount: 3,
sleepDurationProvider: retryAttempt =>
TimeSpan.FromSeconds(Math.Pow(2, retryAttempt)),
onRetry: (outcome, timespan, retryCount, context) =>
{
Console.WriteLine($"Retry {retryCount} after {timespan.TotalSeconds}s");
});
}
public async Task<string> FetchWithRetryAsync(string url)
{
HttpResponseMessage response = await _retryPolicy.ExecuteAsync(
async () => await _httpClient.GetAsync(url));
return await response.Content.ReadAsStringAsync();
}
}
Handling HTTP Status Code Errors
Different HTTP status codes require different handling strategies:
public async Task<string> FetchWithStatusHandlingAsync(string url)
{
try
{
HttpResponseMessage response = await _httpClient.GetAsync(url);
switch ((int)response.StatusCode)
{
case 200:
return await response.Content.ReadAsStringAsync();
case 404:
throw new InvalidOperationException($"Page not found: {url}");
case 429:
// Rate limited - wait and retry
var retryAfter = response.Headers.RetryAfter?.Delta ?? TimeSpan.FromSeconds(60);
Console.WriteLine($"Rate limited. Waiting {retryAfter.TotalSeconds}s");
await Task.Delay(retryAfter);
return await FetchWithStatusHandlingAsync(url);
case 403:
case 401:
throw new UnauthorizedAccessException($"Access denied: {response.StatusCode}");
case >= 500:
throw new HttpRequestException($"Server error: {response.StatusCode}");
default:
response.EnsureSuccessStatusCode();
return await response.Content.ReadAsStringAsync();
}
}
catch (HttpRequestException ex)
{
Console.WriteLine($"HTTP error: {ex.Message}");
throw;
}
}
Comprehensive Exception Handling Pattern
Here's a complete pattern combining multiple exception handling strategies:
using System;
using System.Net.Http;
using System.Threading.Tasks;
using Polly;
using Polly.CircuitBreaker;
using Polly.Wrap;
public class ProductionScraper
{
private readonly HttpClient _httpClient;
private readonly AsyncPolicyWrap<HttpResponseMessage> _policyWrap;
public ProductionScraper()
{
_httpClient = new HttpClient
{
Timeout = TimeSpan.FromSeconds(30)
};
// Circuit breaker to prevent overwhelming failing services
var circuitBreaker = Policy
.HandleResult<HttpResponseMessage>(r => !r.IsSuccessStatusCode)
.Or<HttpRequestException>()
.CircuitBreakerAsync(
handledEventsAllowedBeforeBreaking: 5,
durationOfBreak: TimeSpan.FromMinutes(1),
onBreak: (result, duration) =>
{
Console.WriteLine($"Circuit breaker opened for {duration.TotalMinutes}m");
},
onReset: () => Console.WriteLine("Circuit breaker reset"));
// Retry policy with exponential backoff
var retry = Policy
.HandleResult<HttpResponseMessage>(r =>
(int)r.StatusCode >= 500 || r.StatusCode == System.Net.HttpStatusCode.TooManyRequests)
.Or<HttpRequestException>()
.Or<TaskCanceledException>()
.WaitAndRetryAsync(
retryCount: 3,
sleepDurationProvider: retryAttempt => TimeSpan.FromSeconds(Math.Pow(2, retryAttempt)),
onRetry: (outcome, timespan, retryCount, context) =>
{
Console.WriteLine($"Retry attempt {retryCount} after {timespan.TotalSeconds}s delay");
});
// Combine policies
_policyWrap = Policy.WrapAsync(retry, circuitBreaker);
}
public async Task<ScrapingResult> ScrapePageAsync(string url)
{
var result = new ScrapingResult { Url = url };
try
{
HttpResponseMessage response = await _policyWrap.ExecuteAsync(
async () => await _httpClient.GetAsync(url));
result.StatusCode = (int)response.StatusCode;
result.Content = await response.Content.ReadAsStringAsync();
result.Success = true;
return result;
}
catch (BrokenCircuitException ex)
{
result.Error = "Circuit breaker is open - service temporarily unavailable";
result.Exception = ex;
Console.WriteLine($"Circuit breaker open: {ex.Message}");
}
catch (HttpRequestException ex)
{
result.Error = $"Network error: {ex.Message}";
result.Exception = ex;
Console.WriteLine($"HTTP request failed for {url}: {ex.Message}");
}
catch (TaskCanceledException ex)
{
result.Error = "Request timeout";
result.Exception = ex;
Console.WriteLine($"Timeout for {url}: {ex.Message}");
}
catch (Exception ex)
{
result.Error = $"Unexpected error: {ex.Message}";
result.Exception = ex;
Console.WriteLine($"Unexpected error for {url}: {ex}");
}
return result;
}
}
public class ScrapingResult
{
public string Url { get; set; }
public bool Success { get; set; }
public int StatusCode { get; set; }
public string Content { get; set; }
public string Error { get; set; }
public Exception Exception { get; set; }
}
Logging and Monitoring
Proper logging helps diagnose issues and monitor scraper health:
using Microsoft.Extensions.Logging;
public class LoggingScraper
{
private readonly HttpClient _httpClient;
private readonly ILogger<LoggingScraper> _logger;
public LoggingScraper(ILogger<LoggingScraper> logger)
{
_httpClient = new HttpClient();
_logger = logger;
}
public async Task<string> FetchPageAsync(string url)
{
_logger.LogInformation("Fetching {Url}", url);
try
{
var response = await _httpClient.GetAsync(url);
response.EnsureSuccessStatusCode();
_logger.LogInformation("Successfully fetched {Url} with status {StatusCode}",
url, response.StatusCode);
return await response.Content.ReadAsStringAsync();
}
catch (HttpRequestException ex)
{
_logger.LogError(ex, "Failed to fetch {Url}", url);
throw;
}
catch (TaskCanceledException ex)
{
_logger.LogWarning(ex, "Timeout fetching {Url}", url);
throw;
}
}
}
Handling Rate Limiting and 429 Errors
Similar to how to handle timeouts in Puppeteer, managing rate limits requires careful timing:
public class RateLimitedScraper
{
private readonly HttpClient _httpClient;
private DateTime _nextAllowedRequest = DateTime.MinValue;
private readonly object _rateLimitLock = new object();
public async Task<string> FetchWithRateLimitAsync(string url)
{
await WaitForRateLimitAsync();
try
{
var response = await _httpClient.GetAsync(url);
if (response.StatusCode == System.Net.HttpStatusCode.TooManyRequests)
{
var retryAfter = response.Headers.RetryAfter?.Delta
?? TimeSpan.FromSeconds(60);
lock (_rateLimitLock)
{
_nextAllowedRequest = DateTime.UtcNow.Add(retryAfter);
}
Console.WriteLine($"Rate limited. Waiting until {_nextAllowedRequest}");
await Task.Delay(retryAfter);
return await FetchWithRateLimitAsync(url);
}
response.EnsureSuccessStatusCode();
return await response.Content.ReadAsStringAsync();
}
catch (Exception ex)
{
Console.WriteLine($"Error: {ex.Message}");
throw;
}
}
private async Task WaitForRateLimitAsync()
{
DateTime nextAllowed;
lock (_rateLimitLock)
{
nextAllowed = _nextAllowedRequest;
}
var waitTime = nextAllowed - DateTime.UtcNow;
if (waitTime > TimeSpan.Zero)
{
await Task.Delay(waitTime);
}
}
}
Using Try-Catch with Finally for Resource Cleanup
Ensure proper resource disposal even when exceptions occur:
public async Task<bool> DownloadFileAsync(string url, string filepath)
{
HttpClient client = null;
FileStream fileStream = null;
try
{
client = new HttpClient();
var response = await client.GetAsync(url);
response.EnsureSuccessStatusCode();
fileStream = new FileStream(filepath, FileMode.Create);
await response.Content.CopyToAsync(fileStream);
return true;
}
catch (HttpRequestException ex)
{
Console.WriteLine($"Download failed: {ex.Message}");
return false;
}
catch (IOException ex)
{
Console.WriteLine($"File write error: {ex.Message}");
return false;
}
finally
{
fileStream?.Dispose();
client?.Dispose();
}
}
Exception Filters for Selective Handling
Use exception filters to handle only specific error conditions:
public async Task<string> FetchWithFilteredHandlingAsync(string url)
{
try
{
var response = await _httpClient.GetAsync(url);
response.EnsureSuccessStatusCode();
return await response.Content.ReadAsStringAsync();
}
catch (HttpRequestException ex) when (ex.Message.Contains("timeout"))
{
Console.WriteLine("Handling timeout specifically");
return await RetryAfterDelayAsync(url);
}
catch (HttpRequestException ex) when (ex.StatusCode == System.Net.HttpStatusCode.NotFound)
{
Console.WriteLine($"Page not found: {url}");
return null;
}
catch (HttpRequestException ex)
{
Console.WriteLine($"General HTTP error: {ex.Message}");
throw;
}
}
private async Task<string> RetryAfterDelayAsync(string url)
{
await Task.Delay(5000);
return await FetchWithFilteredHandlingAsync(url);
}
Best Practices for Exception Handling
- Use specific exception types: Catch the most specific exception types first, then more general ones
- Implement retry logic: Use libraries like Polly for sophisticated retry patterns
- Log extensively: Record all errors with context for debugging
- Set appropriate timeouts: Prevent indefinite hanging with reasonable timeout values
- Use circuit breakers: Prevent cascading failures when services are down
- Handle rate limiting gracefully: Respect
Retry-After
headers and implement backoff - Clean up resources: Always dispose of HttpClient, streams, and other resources properly
- Validate data: Check for null values and unexpected formats before processing
- Monitor and alert: Track error rates and set up alerts for unusual patterns
- Fail gracefully: Return partial results or default values when appropriate
Using WebScraping.AI for Reliable Scraping
While exception handling is crucial for custom scrapers, using a managed service like WebScraping.AI can significantly reduce the complexity of error handling. The API automatically handles retries, rotating proxies, browser rendering, and handles errors transparently:
using System;
using System.Net.Http;
using System.Threading.Tasks;
public class WebScrapingAIClient
{
private const string ApiUrl = "https://api.webscraping.ai/html";
private readonly string _apiKey;
private readonly HttpClient _httpClient;
public WebScrapingAIClient(string apiKey)
{
_apiKey = apiKey;
_httpClient = new HttpClient();
}
public async Task<string> ScrapeAsync(string url)
{
try
{
var requestUrl = $"{ApiUrl}?api_key={_apiKey}&url={Uri.EscapeDataString(url)}";
var response = await _httpClient.GetAsync(requestUrl);
response.EnsureSuccessStatusCode();
return await response.Content.ReadAsStringAsync();
}
catch (HttpRequestException ex)
{
Console.WriteLine($"API request failed: {ex.Message}");
throw;
}
}
}
Conclusion
Effective exception handling in C# web scraping applications requires a multi-layered approach combining try-catch blocks, retry logic, circuit breakers, timeout management, and comprehensive logging. By implementing these patterns, you'll build scrapers that are resilient to network failures, website changes, and unexpected errors. Whether you're building custom scrapers or using APIs like WebScraping.AI, robust error handling ensures your data collection remains reliable and maintainable.