Table of contents

How do I Handle Exceptions in C# Web Scraping Applications?

Exception handling is critical in C# web scraping applications due to the unpredictable nature of web environments. Network issues, website changes, rate limiting, and unexpected HTML structures can all cause failures. Implementing robust exception handling ensures your scraper remains stable, recovers gracefully from errors, and provides meaningful diagnostics.

Common Exceptions in Web Scraping

Web scraping applications typically encounter several categories of exceptions:

1. Network-Related Exceptions

HttpRequestException occurs when HTTP requests fail due to network issues, DNS failures, or connection problems:

using System;
using System.Net.Http;
using System.Threading.Tasks;

public class WebScraper
{
    private readonly HttpClient _httpClient;

    public WebScraper()
    {
        _httpClient = new HttpClient
        {
            Timeout = TimeSpan.FromSeconds(30)
        };
    }

    public async Task<string> FetchPageAsync(string url)
    {
        try
        {
            HttpResponseMessage response = await _httpClient.GetAsync(url);
            response.EnsureSuccessStatusCode();
            return await response.Content.ReadAsStringAsync();
        }
        catch (HttpRequestException ex)
        {
            Console.WriteLine($"HTTP request failed: {ex.Message}");
            throw;
        }
        catch (TaskCanceledException ex)
        {
            Console.WriteLine($"Request timeout: {ex.Message}");
            throw;
        }
    }
}

2. Timeout Exceptions

TaskCanceledException is thrown when operations exceed configured timeout limits:

public async Task<string> FetchWithTimeoutAsync(string url, int timeoutSeconds = 30)
{
    using var cts = new CancellationTokenSource(TimeSpan.FromSeconds(timeoutSeconds));

    try
    {
        HttpResponseMessage response = await _httpClient.GetAsync(url, cts.Token);
        response.EnsureSuccessStatusCode();
        return await response.Content.ReadAsStringAsync();
    }
    catch (OperationCanceledException)
    {
        throw new TimeoutException($"Request to {url} timed out after {timeoutSeconds} seconds");
    }
}

3. Parsing and Data Extraction Exceptions

HTML parsing can fail when selectors don't match or when data formats are unexpected:

using HtmlAgilityPack;

public class DataExtractor
{
    public string ExtractTitle(string html)
    {
        try
        {
            var doc = new HtmlDocument();
            doc.LoadHtml(html);

            var titleNode = doc.DocumentNode.SelectSingleNode("//title");

            if (titleNode == null)
            {
                throw new InvalidOperationException("Title element not found");
            }

            return titleNode.InnerText.Trim();
        }
        catch (Exception ex) when (ex is XPathException || ex is NullReferenceException)
        {
            Console.WriteLine($"Parsing error: {ex.Message}");
            return string.Empty;
        }
    }
}

Implementing Retry Logic with Exponential Backoff

Transient errors often resolve themselves, making retry logic essential:

using Polly;
using Polly.Retry;

public class ResilientScraper
{
    private readonly HttpClient _httpClient;
    private readonly AsyncRetryPolicy<HttpResponseMessage> _retryPolicy;

    public ResilientScraper()
    {
        _httpClient = new HttpClient();

        // Retry up to 3 times with exponential backoff
        _retryPolicy = Policy
            .HandleResult<HttpResponseMessage>(r => !r.IsSuccessStatusCode)
            .Or<HttpRequestException>()
            .Or<TaskCanceledException>()
            .WaitAndRetryAsync(
                retryCount: 3,
                sleepDurationProvider: retryAttempt =>
                    TimeSpan.FromSeconds(Math.Pow(2, retryAttempt)),
                onRetry: (outcome, timespan, retryCount, context) =>
                {
                    Console.WriteLine($"Retry {retryCount} after {timespan.TotalSeconds}s");
                });
    }

    public async Task<string> FetchWithRetryAsync(string url)
    {
        HttpResponseMessage response = await _retryPolicy.ExecuteAsync(
            async () => await _httpClient.GetAsync(url));

        return await response.Content.ReadAsStringAsync();
    }
}

Handling HTTP Status Code Errors

Different HTTP status codes require different handling strategies:

public async Task<string> FetchWithStatusHandlingAsync(string url)
{
    try
    {
        HttpResponseMessage response = await _httpClient.GetAsync(url);

        switch ((int)response.StatusCode)
        {
            case 200:
                return await response.Content.ReadAsStringAsync();

            case 404:
                throw new InvalidOperationException($"Page not found: {url}");

            case 429:
                // Rate limited - wait and retry
                var retryAfter = response.Headers.RetryAfter?.Delta ?? TimeSpan.FromSeconds(60);
                Console.WriteLine($"Rate limited. Waiting {retryAfter.TotalSeconds}s");
                await Task.Delay(retryAfter);
                return await FetchWithStatusHandlingAsync(url);

            case 403:
            case 401:
                throw new UnauthorizedAccessException($"Access denied: {response.StatusCode}");

            case >= 500:
                throw new HttpRequestException($"Server error: {response.StatusCode}");

            default:
                response.EnsureSuccessStatusCode();
                return await response.Content.ReadAsStringAsync();
        }
    }
    catch (HttpRequestException ex)
    {
        Console.WriteLine($"HTTP error: {ex.Message}");
        throw;
    }
}

Comprehensive Exception Handling Pattern

Here's a complete pattern combining multiple exception handling strategies:

using System;
using System.Net.Http;
using System.Threading.Tasks;
using Polly;
using Polly.CircuitBreaker;
using Polly.Wrap;

public class ProductionScraper
{
    private readonly HttpClient _httpClient;
    private readonly AsyncPolicyWrap<HttpResponseMessage> _policyWrap;

    public ProductionScraper()
    {
        _httpClient = new HttpClient
        {
            Timeout = TimeSpan.FromSeconds(30)
        };

        // Circuit breaker to prevent overwhelming failing services
        var circuitBreaker = Policy
            .HandleResult<HttpResponseMessage>(r => !r.IsSuccessStatusCode)
            .Or<HttpRequestException>()
            .CircuitBreakerAsync(
                handledEventsAllowedBeforeBreaking: 5,
                durationOfBreak: TimeSpan.FromMinutes(1),
                onBreak: (result, duration) =>
                {
                    Console.WriteLine($"Circuit breaker opened for {duration.TotalMinutes}m");
                },
                onReset: () => Console.WriteLine("Circuit breaker reset"));

        // Retry policy with exponential backoff
        var retry = Policy
            .HandleResult<HttpResponseMessage>(r =>
                (int)r.StatusCode >= 500 || r.StatusCode == System.Net.HttpStatusCode.TooManyRequests)
            .Or<HttpRequestException>()
            .Or<TaskCanceledException>()
            .WaitAndRetryAsync(
                retryCount: 3,
                sleepDurationProvider: retryAttempt => TimeSpan.FromSeconds(Math.Pow(2, retryAttempt)),
                onRetry: (outcome, timespan, retryCount, context) =>
                {
                    Console.WriteLine($"Retry attempt {retryCount} after {timespan.TotalSeconds}s delay");
                });

        // Combine policies
        _policyWrap = Policy.WrapAsync(retry, circuitBreaker);
    }

    public async Task<ScrapingResult> ScrapePageAsync(string url)
    {
        var result = new ScrapingResult { Url = url };

        try
        {
            HttpResponseMessage response = await _policyWrap.ExecuteAsync(
                async () => await _httpClient.GetAsync(url));

            result.StatusCode = (int)response.StatusCode;
            result.Content = await response.Content.ReadAsStringAsync();
            result.Success = true;

            return result;
        }
        catch (BrokenCircuitException ex)
        {
            result.Error = "Circuit breaker is open - service temporarily unavailable";
            result.Exception = ex;
            Console.WriteLine($"Circuit breaker open: {ex.Message}");
        }
        catch (HttpRequestException ex)
        {
            result.Error = $"Network error: {ex.Message}";
            result.Exception = ex;
            Console.WriteLine($"HTTP request failed for {url}: {ex.Message}");
        }
        catch (TaskCanceledException ex)
        {
            result.Error = "Request timeout";
            result.Exception = ex;
            Console.WriteLine($"Timeout for {url}: {ex.Message}");
        }
        catch (Exception ex)
        {
            result.Error = $"Unexpected error: {ex.Message}";
            result.Exception = ex;
            Console.WriteLine($"Unexpected error for {url}: {ex}");
        }

        return result;
    }
}

public class ScrapingResult
{
    public string Url { get; set; }
    public bool Success { get; set; }
    public int StatusCode { get; set; }
    public string Content { get; set; }
    public string Error { get; set; }
    public Exception Exception { get; set; }
}

Logging and Monitoring

Proper logging helps diagnose issues and monitor scraper health:

using Microsoft.Extensions.Logging;

public class LoggingScraper
{
    private readonly HttpClient _httpClient;
    private readonly ILogger<LoggingScraper> _logger;

    public LoggingScraper(ILogger<LoggingScraper> logger)
    {
        _httpClient = new HttpClient();
        _logger = logger;
    }

    public async Task<string> FetchPageAsync(string url)
    {
        _logger.LogInformation("Fetching {Url}", url);

        try
        {
            var response = await _httpClient.GetAsync(url);
            response.EnsureSuccessStatusCode();

            _logger.LogInformation("Successfully fetched {Url} with status {StatusCode}",
                url, response.StatusCode);

            return await response.Content.ReadAsStringAsync();
        }
        catch (HttpRequestException ex)
        {
            _logger.LogError(ex, "Failed to fetch {Url}", url);
            throw;
        }
        catch (TaskCanceledException ex)
        {
            _logger.LogWarning(ex, "Timeout fetching {Url}", url);
            throw;
        }
    }
}

Handling Rate Limiting and 429 Errors

Similar to how to handle timeouts in Puppeteer, managing rate limits requires careful timing:

public class RateLimitedScraper
{
    private readonly HttpClient _httpClient;
    private DateTime _nextAllowedRequest = DateTime.MinValue;
    private readonly object _rateLimitLock = new object();

    public async Task<string> FetchWithRateLimitAsync(string url)
    {
        await WaitForRateLimitAsync();

        try
        {
            var response = await _httpClient.GetAsync(url);

            if (response.StatusCode == System.Net.HttpStatusCode.TooManyRequests)
            {
                var retryAfter = response.Headers.RetryAfter?.Delta
                    ?? TimeSpan.FromSeconds(60);

                lock (_rateLimitLock)
                {
                    _nextAllowedRequest = DateTime.UtcNow.Add(retryAfter);
                }

                Console.WriteLine($"Rate limited. Waiting until {_nextAllowedRequest}");
                await Task.Delay(retryAfter);

                return await FetchWithRateLimitAsync(url);
            }

            response.EnsureSuccessStatusCode();
            return await response.Content.ReadAsStringAsync();
        }
        catch (Exception ex)
        {
            Console.WriteLine($"Error: {ex.Message}");
            throw;
        }
    }

    private async Task WaitForRateLimitAsync()
    {
        DateTime nextAllowed;
        lock (_rateLimitLock)
        {
            nextAllowed = _nextAllowedRequest;
        }

        var waitTime = nextAllowed - DateTime.UtcNow;
        if (waitTime > TimeSpan.Zero)
        {
            await Task.Delay(waitTime);
        }
    }
}

Using Try-Catch with Finally for Resource Cleanup

Ensure proper resource disposal even when exceptions occur:

public async Task<bool> DownloadFileAsync(string url, string filepath)
{
    HttpClient client = null;
    FileStream fileStream = null;

    try
    {
        client = new HttpClient();
        var response = await client.GetAsync(url);
        response.EnsureSuccessStatusCode();

        fileStream = new FileStream(filepath, FileMode.Create);
        await response.Content.CopyToAsync(fileStream);

        return true;
    }
    catch (HttpRequestException ex)
    {
        Console.WriteLine($"Download failed: {ex.Message}");
        return false;
    }
    catch (IOException ex)
    {
        Console.WriteLine($"File write error: {ex.Message}");
        return false;
    }
    finally
    {
        fileStream?.Dispose();
        client?.Dispose();
    }
}

Exception Filters for Selective Handling

Use exception filters to handle only specific error conditions:

public async Task<string> FetchWithFilteredHandlingAsync(string url)
{
    try
    {
        var response = await _httpClient.GetAsync(url);
        response.EnsureSuccessStatusCode();
        return await response.Content.ReadAsStringAsync();
    }
    catch (HttpRequestException ex) when (ex.Message.Contains("timeout"))
    {
        Console.WriteLine("Handling timeout specifically");
        return await RetryAfterDelayAsync(url);
    }
    catch (HttpRequestException ex) when (ex.StatusCode == System.Net.HttpStatusCode.NotFound)
    {
        Console.WriteLine($"Page not found: {url}");
        return null;
    }
    catch (HttpRequestException ex)
    {
        Console.WriteLine($"General HTTP error: {ex.Message}");
        throw;
    }
}

private async Task<string> RetryAfterDelayAsync(string url)
{
    await Task.Delay(5000);
    return await FetchWithFilteredHandlingAsync(url);
}

Best Practices for Exception Handling

  1. Use specific exception types: Catch the most specific exception types first, then more general ones
  2. Implement retry logic: Use libraries like Polly for sophisticated retry patterns
  3. Log extensively: Record all errors with context for debugging
  4. Set appropriate timeouts: Prevent indefinite hanging with reasonable timeout values
  5. Use circuit breakers: Prevent cascading failures when services are down
  6. Handle rate limiting gracefully: Respect Retry-After headers and implement backoff
  7. Clean up resources: Always dispose of HttpClient, streams, and other resources properly
  8. Validate data: Check for null values and unexpected formats before processing
  9. Monitor and alert: Track error rates and set up alerts for unusual patterns
  10. Fail gracefully: Return partial results or default values when appropriate

Using WebScraping.AI for Reliable Scraping

While exception handling is crucial for custom scrapers, using a managed service like WebScraping.AI can significantly reduce the complexity of error handling. The API automatically handles retries, rotating proxies, browser rendering, and handles errors transparently:

using System;
using System.Net.Http;
using System.Threading.Tasks;

public class WebScrapingAIClient
{
    private const string ApiUrl = "https://api.webscraping.ai/html";
    private readonly string _apiKey;
    private readonly HttpClient _httpClient;

    public WebScrapingAIClient(string apiKey)
    {
        _apiKey = apiKey;
        _httpClient = new HttpClient();
    }

    public async Task<string> ScrapeAsync(string url)
    {
        try
        {
            var requestUrl = $"{ApiUrl}?api_key={_apiKey}&url={Uri.EscapeDataString(url)}";
            var response = await _httpClient.GetAsync(requestUrl);
            response.EnsureSuccessStatusCode();

            return await response.Content.ReadAsStringAsync();
        }
        catch (HttpRequestException ex)
        {
            Console.WriteLine($"API request failed: {ex.Message}");
            throw;
        }
    }
}

Conclusion

Effective exception handling in C# web scraping applications requires a multi-layered approach combining try-catch blocks, retry logic, circuit breakers, timeout management, and comprehensive logging. By implementing these patterns, you'll build scrapers that are resilient to network failures, website changes, and unexpected errors. Whether you're building custom scrapers or using APIs like WebScraping.AI, robust error handling ensures your data collection remains reliable and maintainable.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon