How do I Implement Try-Catch Error Handling in C# Web Scraping?

Error handling is critical in web scraping applications because network requests, HTML parsing, and data extraction can fail in numerous ways. Implementing proper try-catch blocks in C# ensures your scraper handles exceptions gracefully, logs errors effectively, and continues operating even when encountering problematic pages.

Understanding Common Web Scraping Exceptions

Before implementing error handling, you need to understand the types of exceptions that occur during web scraping:

HttpRequestException: Network connectivity issues, DNS failures, or unreachable servers
TaskCanceledException: Request timeouts or cancelled operations
WebException: HTTP protocol errors and server-side issues
NullReferenceException: Missing HTML elements or null parsing results
FormatException: Invalid data formats during conversion
ArgumentException: Invalid URLs or parameters

Basic Try-Catch Structure for HTTP Requests

When making HTTP requests with HttpClient, wrap your code in try-catch blocks to handle network-related exceptions:

using System;
using System.Net.Http;
using System.Threading.Tasks;

public class WebScraper
{
    private static readonly HttpClient client = new HttpClient();

    public async Task<string> FetchPageAsync(string url)
    {
        try
        {
            HttpResponseMessage response = await client.GetAsync(url);
            response.EnsureSuccessStatusCode();
            string content = await response.Content.ReadAsStringAsync();
            return content;
        }
        catch (HttpRequestException ex)
        {
            Console.WriteLine($"HTTP request failed: {ex.Message}");
            return null;
        }
        catch (TaskCanceledException ex)
        {
            Console.WriteLine($"Request timeout: {ex.Message}");
            return null;
        }
        catch (Exception ex)
        {
            Console.WriteLine($"Unexpected error: {ex.Message}");
            return null;
        }
    }
}

Handling Timeout Exceptions

Setting appropriate timeout values is crucial for web scraping. Here's how to handle timeout exceptions specifically:

public async Task<string> FetchWithTimeoutAsync(string url, int timeoutSeconds = 30)
{
    try
    {
        using (var cts = new CancellationTokenSource(TimeSpan.FromSeconds(timeoutSeconds)))
        {
            HttpResponseMessage response = await client.GetAsync(url, cts.Token);
            response.EnsureSuccessStatusCode();
            return await response.Content.ReadAsStringAsync();
        }
    }
    catch (OperationCanceledException)
    {
        Console.WriteLine($"Request to {url} timed out after {timeoutSeconds} seconds");
        return null;
    }
    catch (HttpRequestException ex) when (ex.InnerException is TimeoutException)
    {
        Console.WriteLine($"Connection timeout for {url}: {ex.Message}");
        return null;
    }
    catch (HttpRequestException ex)
    {
        Console.WriteLine($"HTTP error for {url}: {ex.Message}");
        return null;
    }
}

Exception Filters for Granular Error Handling

C# exception filters allow you to catch specific error scenarios without catching all exceptions of a type:

public async Task<string> FetchWithStatusHandlingAsync(string url)
{
    try
    {
        HttpResponseMessage response = await client.GetAsync(url);
        response.EnsureSuccessStatusCode();
        return await response.Content.ReadAsStringAsync();
    }
    catch (HttpRequestException ex) when (ex.Message.Contains("404"))
    {
        Console.WriteLine($"Page not found: {url}");
        return null;
    }
    catch (HttpRequestException ex) when (ex.Message.Contains("429"))
    {
        Console.WriteLine($"Rate limited on {url}. Implementing backoff...");
        await Task.Delay(5000);
        return await FetchWithStatusHandlingAsync(url); // Retry after delay
    }
    catch (HttpRequestException ex) when (ex.Message.Contains("503"))
    {
        Console.WriteLine($"Service unavailable: {url}");
        return null;
    }
    catch (HttpRequestException ex)
    {
        Console.WriteLine($"Other HTTP error: {ex.Message}");
        return null;
    }
}

Handling HTML Parsing Exceptions

When parsing HTML with libraries like HtmlAgilityPack, protect against missing elements and parsing errors:

using HtmlAgilityPack;
using System.Collections.Generic;

public class DataExtractor
{
    public List<string> ExtractProductTitles(string html)
    {
        var titles = new List<string>();

        try
        {
            var doc = new HtmlDocument();
            doc.LoadHtml(html);

            var nodes = doc.DocumentNode.SelectNodes("//h2[@class='product-title']");

            if (nodes == null)
            {
                Console.WriteLine("No product titles found on page");
                return titles;
            }

            foreach (var node in nodes)
            {
                try
                {
                    string title = node.InnerText?.Trim();
                    if (!string.IsNullOrEmpty(title))
                    {
                        titles.Add(title);
                    }
                }
                catch (NullReferenceException ex)
                {
                    Console.WriteLine($"Null reference while extracting title: {ex.Message}");
                    continue; // Skip this item and continue
                }
            }
        }
        catch (Exception ex)
        {
            Console.WriteLine($"Error parsing HTML: {ex.Message}");
        }

        return titles;
    }
}

Implementing Retry Logic with Exception Handling

Robust scrapers implement retry mechanisms for transient failures:

public async Task<string> FetchWithRetryAsync(string url, int maxRetries = 3)
{
    int retryCount = 0;
    TimeSpan delay = TimeSpan.FromSeconds(2);

    while (retryCount < maxRetries)
    {
        try
        {
            HttpResponseMessage response = await client.GetAsync(url);
            response.EnsureSuccessStatusCode();
            return await response.Content.ReadAsStringAsync();
        }
        catch (HttpRequestException ex)
        {
            retryCount++;
            Console.WriteLine($"Attempt {retryCount} failed for {url}: {ex.Message}");

            if (retryCount >= maxRetries)
            {
                Console.WriteLine($"Max retries reached for {url}");
                throw; // Re-throw after max retries
            }

            Console.WriteLine($"Retrying in {delay.TotalSeconds} seconds...");
            await Task.Delay(delay);
            delay = TimeSpan.FromSeconds(delay.TotalSeconds * 2); // Exponential backoff
        }
        catch (TaskCanceledException ex)
        {
            retryCount++;
            Console.WriteLine($"Timeout on attempt {retryCount} for {url}");

            if (retryCount >= maxRetries)
            {
                Console.WriteLine($"Max retries reached after timeout for {url}");
                return null;
            }

            await Task.Delay(delay);
        }
    }

    return null;
}

Handling SSL Certificate Errors

When scraping websites with SSL certificate issues, you need specialized error handling:

using System.Net.Security;
using System.Security.Cryptography.X509Certificates;

public class SecureWebScraper
{
    private HttpClientHandler CreateHandler()
    {
        var handler = new HttpClientHandler();

        // Custom certificate validation
        handler.ServerCertificateCustomValidationCallback =
            (sender, cert, chain, sslPolicyErrors) =>
            {
                if (sslPolicyErrors == SslPolicyErrors.None)
                    return true;

                Console.WriteLine($"SSL Error: {sslPolicyErrors}");

                // Decide whether to proceed (use with caution)
                return false; // Reject invalid certificates by default
            };

        return handler;
    }

    public async Task<string> FetchSecurePageAsync(string url)
    {
        try
        {
            using (var handler = CreateHandler())
            using (var client = new HttpClient(handler))
            {
                var response = await client.GetAsync(url);
                response.EnsureSuccessStatusCode();
                return await response.Content.ReadAsStringAsync();
            }
        }
        catch (HttpRequestException ex) when (ex.InnerException is System.Security.Authentication.AuthenticationException)
        {
            Console.WriteLine($"SSL/TLS authentication failed for {url}: {ex.Message}");
            return null;
        }
        catch (HttpRequestException ex)
        {
            Console.WriteLine($"Request failed for {url}: {ex.Message}");
            return null;
        }
    }
}

Data Conversion Exception Handling

When converting scraped data to specific types, handle format exceptions:

public class ProductParser
{
    public decimal? ParsePrice(string priceText)
    {
        try
        {
            // Remove currency symbols and whitespace
            string cleanPrice = priceText
                .Replace("$", "")
                .Replace(",", "")
                .Trim();

            return decimal.Parse(cleanPrice);
        }
        catch (FormatException ex)
        {
            Console.WriteLine($"Invalid price format: '{priceText}' - {ex.Message}");
            return null;
        }
        catch (ArgumentNullException ex)
        {
            Console.WriteLine($"Null price value: {ex.Message}");
            return null;
        }
    }

    public DateTime? ParseDate(string dateText)
    {
        try
        {
            return DateTime.Parse(dateText);
        }
        catch (FormatException ex)
        {
            Console.WriteLine($"Invalid date format: '{dateText}' - {ex.Message}");

            // Try alternative formats
            try
            {
                return DateTime.ParseExact(dateText, "MM/dd/yyyy", null);
            }
            catch
            {
                return null;
            }
        }
    }
}

Comprehensive Error Logging

Implement structured logging for debugging and monitoring:

using System.IO;
using System.Text;

public class ErrorLogger
{
    private readonly string logPath;

    public ErrorLogger(string logPath = "scraper_errors.log")
    {
        this.logPath = logPath;
    }

    public void LogError(string url, Exception ex)
    {
        try
        {
            var logEntry = new StringBuilder();
            logEntry.AppendLine($"[{DateTime.UtcNow:yyyy-MM-dd HH:mm:ss}]");
            logEntry.AppendLine($"URL: {url}");
            logEntry.AppendLine($"Exception Type: {ex.GetType().Name}");
            logEntry.AppendLine($"Message: {ex.Message}");
            logEntry.AppendLine($"Stack Trace: {ex.StackTrace}");
            logEntry.AppendLine(new string('-', 80));

            File.AppendAllText(logPath, logEntry.ToString());
        }
        catch (Exception logEx)
        {
            Console.WriteLine($"Failed to write log: {logEx.Message}");
        }
    }
}

// Usage in scraper
public class ScraperWithLogging
{
    private readonly ErrorLogger logger = new ErrorLogger();
    private static readonly HttpClient client = new HttpClient();

    public async Task<string> FetchPageAsync(string url)
    {
        try
        {
            var response = await client.GetAsync(url);
            response.EnsureSuccessStatusCode();
            return await response.Content.ReadAsStringAsync();
        }
        catch (Exception ex)
        {
            logger.LogError(url, ex);
            throw;
        }
    }
}

Best Practices for Exception Handling in Web Scraping

Be Specific: Catch specific exceptions before general ones
Don't Swallow Exceptions: Always log or handle exceptions appropriately
Use Finally Blocks: Ensure resources are cleaned up even when exceptions occur
Implement Retries: Use exponential backoff for transient failures
Validate Input: Check URLs and parameters before making requests
Use Timeout Values: Always set reasonable timeouts to prevent hanging
Log Contextual Information: Include URLs, timestamps, and relevant data in logs
Handle Async Exceptions: Use try-catch with async/await patterns correctly
Consider Circuit Breakers: Stop attempting requests to consistently failing sources
Test Error Paths: Write tests that simulate various failure scenarios

Complete Example: Robust Web Scraper

Here's a complete example incorporating all error handling techniques:

using System;
using System.Collections.Generic;
using System.Net.Http;
using System.Threading.Tasks;
using HtmlAgilityPack;

public class RobustWebScraper
{
    private static readonly HttpClient client = new HttpClient();
    private readonly ErrorLogger logger = new ErrorLogger();

    public async Task<List<Product>> ScrapeProductsAsync(string url)
    {
        var products = new List<Product>();

        try
        {
            string html = await FetchWithRetryAsync(url);

            if (string.IsNullOrEmpty(html))
            {
                Console.WriteLine($"Failed to fetch content from {url}");
                return products;
            }

            products = ExtractProducts(html);
        }
        catch (Exception ex)
        {
            logger.LogError(url, ex);
            Console.WriteLine($"Fatal error scraping {url}: {ex.Message}");
        }

        return products;
    }

    private async Task<string> FetchWithRetryAsync(string url, int maxRetries = 3)
    {
        for (int i = 0; i < maxRetries; i++)
        {
            try
            {
                using (var cts = new System.Threading.CancellationTokenSource(TimeSpan.FromSeconds(30)))
                {
                    var response = await client.GetAsync(url, cts.Token);
                    response.EnsureSuccessStatusCode();
                    return await response.Content.ReadAsStringAsync();
                }
            }
            catch (Exception ex) when (i < maxRetries - 1)
            {
                Console.WriteLine($"Attempt {i + 1} failed, retrying...");
                await Task.Delay(TimeSpan.FromSeconds(Math.Pow(2, i)));
            }
        }

        return null;
    }

    private List<Product> ExtractProducts(string html)
    {
        var products = new List<Product>();

        try
        {
            var doc = new HtmlDocument();
            doc.LoadHtml(html);

            var productNodes = doc.DocumentNode.SelectNodes("//div[@class='product']");

            if (productNodes == null)
                return products;

            foreach (var node in productNodes)
            {
                try
                {
                    var product = new Product
                    {
                        Title = node.SelectSingleNode(".//h2")?.InnerText?.Trim(),
                        Price = ParsePrice(node.SelectSingleNode(".//span[@class='price']")?.InnerText)
                    };

                    if (!string.IsNullOrEmpty(product.Title))
                    {
                        products.Add(product);
                    }
                }
                catch (Exception ex)
                {
                    Console.WriteLine($"Error extracting product: {ex.Message}");
                    // Continue processing other products
                }
            }
        }
        catch (Exception ex)
        {
            Console.WriteLine($"Error parsing HTML: {ex.Message}");
        }

        return products;
    }

    private decimal? ParsePrice(string priceText)
    {
        try
        {
            if (string.IsNullOrWhiteSpace(priceText))
                return null;

            string clean = priceText.Replace("$", "").Replace(",", "").Trim();
            return decimal.Parse(clean);
        }
        catch
        {
            return null;
        }
    }
}

public class Product
{
    public string Title { get; set; }
    public decimal? Price { get; set; }
}

Conclusion

Implementing comprehensive try-catch error handling in C# web scraping requires understanding the exception types you'll encounter, using specific catch blocks, implementing retry logic, and maintaining detailed logs. By following these patterns and best practices, you'll create resilient scrapers that handle failures gracefully and provide valuable debugging information when issues occur.

For complex scraping scenarios requiring JavaScript rendering and advanced error handling, consider using headless browser solutions or specialized web scraping APIs that handle many of these edge cases automatically.

Table of contents

How do I Implement Try-Catch Error Handling in C# Web Scraping?

Understanding Common Web Scraping Exceptions

Basic Try-Catch Structure for HTTP Requests

Handling Timeout Exceptions

Exception Filters for Granular Error Handling

Handling HTML Parsing Exceptions

Implementing Retry Logic with Exception Handling

Handling SSL Certificate Errors

Data Conversion Exception Handling

Comprehensive Error Logging

Best Practices for Exception Handling in Web Scraping

Complete Example: Robust Web Scraper

Conclusion

Try WebScraping.AI for Your Web Scraping Needs

Key Features:

Getting Started:

Related Questions

How can I use streams in C# to efficiently process large web scraping responses?

How do I serialize and deserialize JSON in C# when scraping APIs?

How do I check if a string contains specific text in C# during web scraping?

Get Started Now

Support