Table of contents

How do I Implement Try-Catch Error Handling in C# Web Scraping?

Error handling is critical in web scraping applications because network requests, HTML parsing, and data extraction can fail in numerous ways. Implementing proper try-catch blocks in C# ensures your scraper handles exceptions gracefully, logs errors effectively, and continues operating even when encountering problematic pages.

Understanding Common Web Scraping Exceptions

Before implementing error handling, you need to understand the types of exceptions that occur during web scraping:

  • HttpRequestException: Network connectivity issues, DNS failures, or unreachable servers
  • TaskCanceledException: Request timeouts or cancelled operations
  • WebException: HTTP protocol errors and server-side issues
  • NullReferenceException: Missing HTML elements or null parsing results
  • FormatException: Invalid data formats during conversion
  • ArgumentException: Invalid URLs or parameters

Basic Try-Catch Structure for HTTP Requests

When making HTTP requests with HttpClient, wrap your code in try-catch blocks to handle network-related exceptions:

using System;
using System.Net.Http;
using System.Threading.Tasks;

public class WebScraper
{
    private static readonly HttpClient client = new HttpClient();

    public async Task<string> FetchPageAsync(string url)
    {
        try
        {
            HttpResponseMessage response = await client.GetAsync(url);
            response.EnsureSuccessStatusCode();
            string content = await response.Content.ReadAsStringAsync();
            return content;
        }
        catch (HttpRequestException ex)
        {
            Console.WriteLine($"HTTP request failed: {ex.Message}");
            return null;
        }
        catch (TaskCanceledException ex)
        {
            Console.WriteLine($"Request timeout: {ex.Message}");
            return null;
        }
        catch (Exception ex)
        {
            Console.WriteLine($"Unexpected error: {ex.Message}");
            return null;
        }
    }
}

Handling Timeout Exceptions

Setting appropriate timeout values is crucial for web scraping. Here's how to handle timeout exceptions specifically:

public async Task<string> FetchWithTimeoutAsync(string url, int timeoutSeconds = 30)
{
    try
    {
        using (var cts = new CancellationTokenSource(TimeSpan.FromSeconds(timeoutSeconds)))
        {
            HttpResponseMessage response = await client.GetAsync(url, cts.Token);
            response.EnsureSuccessStatusCode();
            return await response.Content.ReadAsStringAsync();
        }
    }
    catch (OperationCanceledException)
    {
        Console.WriteLine($"Request to {url} timed out after {timeoutSeconds} seconds");
        return null;
    }
    catch (HttpRequestException ex) when (ex.InnerException is TimeoutException)
    {
        Console.WriteLine($"Connection timeout for {url}: {ex.Message}");
        return null;
    }
    catch (HttpRequestException ex)
    {
        Console.WriteLine($"HTTP error for {url}: {ex.Message}");
        return null;
    }
}

Exception Filters for Granular Error Handling

C# exception filters allow you to catch specific error scenarios without catching all exceptions of a type:

public async Task<string> FetchWithStatusHandlingAsync(string url)
{
    try
    {
        HttpResponseMessage response = await client.GetAsync(url);
        response.EnsureSuccessStatusCode();
        return await response.Content.ReadAsStringAsync();
    }
    catch (HttpRequestException ex) when (ex.Message.Contains("404"))
    {
        Console.WriteLine($"Page not found: {url}");
        return null;
    }
    catch (HttpRequestException ex) when (ex.Message.Contains("429"))
    {
        Console.WriteLine($"Rate limited on {url}. Implementing backoff...");
        await Task.Delay(5000);
        return await FetchWithStatusHandlingAsync(url); // Retry after delay
    }
    catch (HttpRequestException ex) when (ex.Message.Contains("503"))
    {
        Console.WriteLine($"Service unavailable: {url}");
        return null;
    }
    catch (HttpRequestException ex)
    {
        Console.WriteLine($"Other HTTP error: {ex.Message}");
        return null;
    }
}

Handling HTML Parsing Exceptions

When parsing HTML with libraries like HtmlAgilityPack, protect against missing elements and parsing errors:

using HtmlAgilityPack;
using System.Collections.Generic;

public class DataExtractor
{
    public List<string> ExtractProductTitles(string html)
    {
        var titles = new List<string>();

        try
        {
            var doc = new HtmlDocument();
            doc.LoadHtml(html);

            var nodes = doc.DocumentNode.SelectNodes("//h2[@class='product-title']");

            if (nodes == null)
            {
                Console.WriteLine("No product titles found on page");
                return titles;
            }

            foreach (var node in nodes)
            {
                try
                {
                    string title = node.InnerText?.Trim();
                    if (!string.IsNullOrEmpty(title))
                    {
                        titles.Add(title);
                    }
                }
                catch (NullReferenceException ex)
                {
                    Console.WriteLine($"Null reference while extracting title: {ex.Message}");
                    continue; // Skip this item and continue
                }
            }
        }
        catch (Exception ex)
        {
            Console.WriteLine($"Error parsing HTML: {ex.Message}");
        }

        return titles;
    }
}

Implementing Retry Logic with Exception Handling

Robust scrapers implement retry mechanisms for transient failures:

public async Task<string> FetchWithRetryAsync(string url, int maxRetries = 3)
{
    int retryCount = 0;
    TimeSpan delay = TimeSpan.FromSeconds(2);

    while (retryCount < maxRetries)
    {
        try
        {
            HttpResponseMessage response = await client.GetAsync(url);
            response.EnsureSuccessStatusCode();
            return await response.Content.ReadAsStringAsync();
        }
        catch (HttpRequestException ex)
        {
            retryCount++;
            Console.WriteLine($"Attempt {retryCount} failed for {url}: {ex.Message}");

            if (retryCount >= maxRetries)
            {
                Console.WriteLine($"Max retries reached for {url}");
                throw; // Re-throw after max retries
            }

            Console.WriteLine($"Retrying in {delay.TotalSeconds} seconds...");
            await Task.Delay(delay);
            delay = TimeSpan.FromSeconds(delay.TotalSeconds * 2); // Exponential backoff
        }
        catch (TaskCanceledException ex)
        {
            retryCount++;
            Console.WriteLine($"Timeout on attempt {retryCount} for {url}");

            if (retryCount >= maxRetries)
            {
                Console.WriteLine($"Max retries reached after timeout for {url}");
                return null;
            }

            await Task.Delay(delay);
        }
    }

    return null;
}

Handling SSL Certificate Errors

When scraping websites with SSL certificate issues, you need specialized error handling:

using System.Net.Security;
using System.Security.Cryptography.X509Certificates;

public class SecureWebScraper
{
    private HttpClientHandler CreateHandler()
    {
        var handler = new HttpClientHandler();

        // Custom certificate validation
        handler.ServerCertificateCustomValidationCallback =
            (sender, cert, chain, sslPolicyErrors) =>
            {
                if (sslPolicyErrors == SslPolicyErrors.None)
                    return true;

                Console.WriteLine($"SSL Error: {sslPolicyErrors}");

                // Decide whether to proceed (use with caution)
                return false; // Reject invalid certificates by default
            };

        return handler;
    }

    public async Task<string> FetchSecurePageAsync(string url)
    {
        try
        {
            using (var handler = CreateHandler())
            using (var client = new HttpClient(handler))
            {
                var response = await client.GetAsync(url);
                response.EnsureSuccessStatusCode();
                return await response.Content.ReadAsStringAsync();
            }
        }
        catch (HttpRequestException ex) when (ex.InnerException is System.Security.Authentication.AuthenticationException)
        {
            Console.WriteLine($"SSL/TLS authentication failed for {url}: {ex.Message}");
            return null;
        }
        catch (HttpRequestException ex)
        {
            Console.WriteLine($"Request failed for {url}: {ex.Message}");
            return null;
        }
    }
}

Data Conversion Exception Handling

When converting scraped data to specific types, handle format exceptions:

public class ProductParser
{
    public decimal? ParsePrice(string priceText)
    {
        try
        {
            // Remove currency symbols and whitespace
            string cleanPrice = priceText
                .Replace("$", "")
                .Replace(",", "")
                .Trim();

            return decimal.Parse(cleanPrice);
        }
        catch (FormatException ex)
        {
            Console.WriteLine($"Invalid price format: '{priceText}' - {ex.Message}");
            return null;
        }
        catch (ArgumentNullException ex)
        {
            Console.WriteLine($"Null price value: {ex.Message}");
            return null;
        }
    }

    public DateTime? ParseDate(string dateText)
    {
        try
        {
            return DateTime.Parse(dateText);
        }
        catch (FormatException ex)
        {
            Console.WriteLine($"Invalid date format: '{dateText}' - {ex.Message}");

            // Try alternative formats
            try
            {
                return DateTime.ParseExact(dateText, "MM/dd/yyyy", null);
            }
            catch
            {
                return null;
            }
        }
    }
}

Comprehensive Error Logging

Implement structured logging for debugging and monitoring:

using System.IO;
using System.Text;

public class ErrorLogger
{
    private readonly string logPath;

    public ErrorLogger(string logPath = "scraper_errors.log")
    {
        this.logPath = logPath;
    }

    public void LogError(string url, Exception ex)
    {
        try
        {
            var logEntry = new StringBuilder();
            logEntry.AppendLine($"[{DateTime.UtcNow:yyyy-MM-dd HH:mm:ss}]");
            logEntry.AppendLine($"URL: {url}");
            logEntry.AppendLine($"Exception Type: {ex.GetType().Name}");
            logEntry.AppendLine($"Message: {ex.Message}");
            logEntry.AppendLine($"Stack Trace: {ex.StackTrace}");
            logEntry.AppendLine(new string('-', 80));

            File.AppendAllText(logPath, logEntry.ToString());
        }
        catch (Exception logEx)
        {
            Console.WriteLine($"Failed to write log: {logEx.Message}");
        }
    }
}

// Usage in scraper
public class ScraperWithLogging
{
    private readonly ErrorLogger logger = new ErrorLogger();
    private static readonly HttpClient client = new HttpClient();

    public async Task<string> FetchPageAsync(string url)
    {
        try
        {
            var response = await client.GetAsync(url);
            response.EnsureSuccessStatusCode();
            return await response.Content.ReadAsStringAsync();
        }
        catch (Exception ex)
        {
            logger.LogError(url, ex);
            throw;
        }
    }
}

Best Practices for Exception Handling in Web Scraping

  1. Be Specific: Catch specific exceptions before general ones
  2. Don't Swallow Exceptions: Always log or handle exceptions appropriately
  3. Use Finally Blocks: Ensure resources are cleaned up even when exceptions occur
  4. Implement Retries: Use exponential backoff for transient failures
  5. Validate Input: Check URLs and parameters before making requests
  6. Use Timeout Values: Always set reasonable timeouts to prevent hanging
  7. Log Contextual Information: Include URLs, timestamps, and relevant data in logs
  8. Handle Async Exceptions: Use try-catch with async/await patterns correctly
  9. Consider Circuit Breakers: Stop attempting requests to consistently failing sources
  10. Test Error Paths: Write tests that simulate various failure scenarios

Complete Example: Robust Web Scraper

Here's a complete example incorporating all error handling techniques:

using System;
using System.Collections.Generic;
using System.Net.Http;
using System.Threading.Tasks;
using HtmlAgilityPack;

public class RobustWebScraper
{
    private static readonly HttpClient client = new HttpClient();
    private readonly ErrorLogger logger = new ErrorLogger();

    public async Task<List<Product>> ScrapeProductsAsync(string url)
    {
        var products = new List<Product>();

        try
        {
            string html = await FetchWithRetryAsync(url);

            if (string.IsNullOrEmpty(html))
            {
                Console.WriteLine($"Failed to fetch content from {url}");
                return products;
            }

            products = ExtractProducts(html);
        }
        catch (Exception ex)
        {
            logger.LogError(url, ex);
            Console.WriteLine($"Fatal error scraping {url}: {ex.Message}");
        }

        return products;
    }

    private async Task<string> FetchWithRetryAsync(string url, int maxRetries = 3)
    {
        for (int i = 0; i < maxRetries; i++)
        {
            try
            {
                using (var cts = new System.Threading.CancellationTokenSource(TimeSpan.FromSeconds(30)))
                {
                    var response = await client.GetAsync(url, cts.Token);
                    response.EnsureSuccessStatusCode();
                    return await response.Content.ReadAsStringAsync();
                }
            }
            catch (Exception ex) when (i < maxRetries - 1)
            {
                Console.WriteLine($"Attempt {i + 1} failed, retrying...");
                await Task.Delay(TimeSpan.FromSeconds(Math.Pow(2, i)));
            }
        }

        return null;
    }

    private List<Product> ExtractProducts(string html)
    {
        var products = new List<Product>();

        try
        {
            var doc = new HtmlDocument();
            doc.LoadHtml(html);

            var productNodes = doc.DocumentNode.SelectNodes("//div[@class='product']");

            if (productNodes == null)
                return products;

            foreach (var node in productNodes)
            {
                try
                {
                    var product = new Product
                    {
                        Title = node.SelectSingleNode(".//h2")?.InnerText?.Trim(),
                        Price = ParsePrice(node.SelectSingleNode(".//span[@class='price']")?.InnerText)
                    };

                    if (!string.IsNullOrEmpty(product.Title))
                    {
                        products.Add(product);
                    }
                }
                catch (Exception ex)
                {
                    Console.WriteLine($"Error extracting product: {ex.Message}");
                    // Continue processing other products
                }
            }
        }
        catch (Exception ex)
        {
            Console.WriteLine($"Error parsing HTML: {ex.Message}");
        }

        return products;
    }

    private decimal? ParsePrice(string priceText)
    {
        try
        {
            if (string.IsNullOrWhiteSpace(priceText))
                return null;

            string clean = priceText.Replace("$", "").Replace(",", "").Trim();
            return decimal.Parse(clean);
        }
        catch
        {
            return null;
        }
    }
}

public class Product
{
    public string Title { get; set; }
    public decimal? Price { get; set; }
}

Conclusion

Implementing comprehensive try-catch error handling in C# web scraping requires understanding the exception types you'll encounter, using specific catch blocks, implementing retry logic, and maintaining detailed logs. By following these patterns and best practices, you'll create resilient scrapers that handle failures gracefully and provide valuable debugging information when issues occur.

For complex scraping scenarios requiring JavaScript rendering and advanced error handling, consider using headless browser solutions or specialized web scraping APIs that handle many of these edge cases automatically.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon