Is Html Agility Pack capable of making HTTP requests?

Short Answer

No, Html Agility Pack itself is not capable of making HTTP requests. It's a .NET library specifically designed for parsing and manipulating HTML documents, not for network communication.

What Html Agility Pack Does

Html Agility Pack excels at: - Parsing HTML documents - Navigating DOM structures - Extracting data using XPath and CSS selectors - Loading HTML from strings, files, or streams

The Correct Approach: HttpClient + Html Agility Pack

For web scraping in .NET, combine HttpClient (for HTTP requests) with Html Agility Pack (for HTML parsing):

Basic Example

using System;
using System.Net.Http;
using System.Threading.Tasks;
using HtmlAgilityPack;

class Program
{
    static async Task Main(string[] args)
    {
        using var httpClient = new HttpClient();

        try
        {
            // Make HTTP request
            string url = "https://example.com";
            var response = await httpClient.GetAsync(url);
            response.EnsureSuccessStatusCode();

            // Get HTML content
            var htmlContent = await response.Content.ReadAsStringAsync();

            // Parse with Html Agility Pack
            var doc = new HtmlDocument();
            doc.LoadHtml(htmlContent);

            // Extract data
            var links = doc.DocumentNode.SelectNodes("//a[@href]");
            if (links != null)
            {
                foreach (var link in links)
                {
                    var href = link.GetAttributeValue("href", "");
                    var text = link.InnerText?.Trim();
                    Console.WriteLine($"Link: {text} -> {href}");
                }
            }
        }
        catch (Exception ex)
        {
            Console.WriteLine($"Error: {ex.Message}");
        }
    }
}

Production-Ready Example with Error Handling

using System;
using System.Net.Http;
using System.Threading.Tasks;
using HtmlAgilityPack;

public class WebScraper
{
    private static readonly HttpClient _httpClient = new HttpClient();

    static WebScraper()
    {
        // Configure HttpClient for web scraping
        _httpClient.DefaultRequestHeaders.Add("User-Agent", 
            "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36");
        _httpClient.Timeout = TimeSpan.FromSeconds(30);
    }

    public static async Task<HtmlDocument> LoadPageAsync(string url)
    {
        try
        {
            using var response = await _httpClient.GetAsync(url);

            if (!response.IsSuccessStatusCode)
            {
                throw new HttpRequestException($"HTTP {response.StatusCode}: {response.ReasonPhrase}");
            }

            var html = await response.Content.ReadAsStringAsync();
            var doc = new HtmlDocument();
            doc.LoadHtml(html);

            return doc;
        }
        catch (HttpRequestException ex)
        {
            Console.WriteLine($"Network error: {ex.Message}");
            throw;
        }
        catch (Exception ex)
        {
            Console.WriteLine($"Unexpected error: {ex.Message}");
            throw;
        }
    }

    public static async Task Main(string[] args)
    {
        try
        {
            var doc = await LoadPageAsync("https://example.com");

            // Extract page title
            var title = doc.DocumentNode.SelectSingleNode("//title")?.InnerText;
            Console.WriteLine($"Page Title: {title}");

            // Extract all headings
            var headings = doc.DocumentNode.SelectNodes("//h1 | //h2 | //h3");
            if (headings != null)
            {
                foreach (var heading in headings)
                {
                    Console.WriteLine($"{heading.Name.ToUpper()}: {heading.InnerText?.Trim()}");
                }
            }
        }
        catch (Exception ex)
        {
            Console.WriteLine($"Scraping failed: {ex.Message}");
        }
    }
}

Best Practices

1. HttpClient Management

// ✅ Good: Reuse HttpClient instance
private static readonly HttpClient _httpClient = new HttpClient();

// ❌ Bad: Creating new instance each time
using var client = new HttpClient(); // Don't do this repeatedly

2. Set Proper Headers

_httpClient.DefaultRequestHeaders.Add("User-Agent", 
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36");

3. Handle Different Content Types

var response = await httpClient.GetAsync(url);
var contentType = response.Content.Headers.ContentType?.MediaType;

if (contentType == "text/html" || contentType == "application/xhtml+xml")
{
    var html = await response.Content.ReadAsStringAsync();
    var doc = new HtmlDocument();
    doc.LoadHtml(html);
    // Process HTML
}

4. Implement Retry Logic

public static async Task<string> GetHtmlWithRetryAsync(string url, int maxRetries = 3)
{
    for (int i = 0; i < maxRetries; i++)
    {
        try
        {
            var response = await _httpClient.GetAsync(url);
            response.EnsureSuccessStatusCode();
            return await response.Content.ReadAsStringAsync();
        }
        catch (HttpRequestException) when (i < maxRetries - 1)
        {
            await Task.Delay(1000 * (i + 1)); // Exponential backoff
        }
    }
    throw new HttpRequestException($"Failed to fetch {url} after {maxRetries} attempts");
}

Alternative HTTP Libraries

While HttpClient is the standard choice, you might also consider:

  • RestSharp: Higher-level REST client
  • Flurl: Fluent URL building and HTTP client
  • Refit: Type-safe REST library

Summary

Html Agility Pack focuses solely on HTML parsing and manipulation. For complete web scraping solutions in .NET:

  1. Use HttpClient to fetch web pages
  2. Use Html Agility Pack to parse the HTML
  3. Follow best practices for network requests and error handling
  4. Consider using dependency injection for HttpClient in larger applications

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon