Short Answer
No, Html Agility Pack itself is not capable of making HTTP requests. It's a .NET library specifically designed for parsing and manipulating HTML documents, not for network communication.
What Html Agility Pack Does
Html Agility Pack excels at: - Parsing HTML documents - Navigating DOM structures - Extracting data using XPath and CSS selectors - Loading HTML from strings, files, or streams
The Correct Approach: HttpClient + Html Agility Pack
For web scraping in .NET, combine HttpClient
(for HTTP requests) with Html Agility Pack (for HTML parsing):
Basic Example
using System;
using System.Net.Http;
using System.Threading.Tasks;
using HtmlAgilityPack;
class Program
{
static async Task Main(string[] args)
{
using var httpClient = new HttpClient();
try
{
// Make HTTP request
string url = "https://example.com";
var response = await httpClient.GetAsync(url);
response.EnsureSuccessStatusCode();
// Get HTML content
var htmlContent = await response.Content.ReadAsStringAsync();
// Parse with Html Agility Pack
var doc = new HtmlDocument();
doc.LoadHtml(htmlContent);
// Extract data
var links = doc.DocumentNode.SelectNodes("//a[@href]");
if (links != null)
{
foreach (var link in links)
{
var href = link.GetAttributeValue("href", "");
var text = link.InnerText?.Trim();
Console.WriteLine($"Link: {text} -> {href}");
}
}
}
catch (Exception ex)
{
Console.WriteLine($"Error: {ex.Message}");
}
}
}
Production-Ready Example with Error Handling
using System;
using System.Net.Http;
using System.Threading.Tasks;
using HtmlAgilityPack;
public class WebScraper
{
private static readonly HttpClient _httpClient = new HttpClient();
static WebScraper()
{
// Configure HttpClient for web scraping
_httpClient.DefaultRequestHeaders.Add("User-Agent",
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36");
_httpClient.Timeout = TimeSpan.FromSeconds(30);
}
public static async Task<HtmlDocument> LoadPageAsync(string url)
{
try
{
using var response = await _httpClient.GetAsync(url);
if (!response.IsSuccessStatusCode)
{
throw new HttpRequestException($"HTTP {response.StatusCode}: {response.ReasonPhrase}");
}
var html = await response.Content.ReadAsStringAsync();
var doc = new HtmlDocument();
doc.LoadHtml(html);
return doc;
}
catch (HttpRequestException ex)
{
Console.WriteLine($"Network error: {ex.Message}");
throw;
}
catch (Exception ex)
{
Console.WriteLine($"Unexpected error: {ex.Message}");
throw;
}
}
public static async Task Main(string[] args)
{
try
{
var doc = await LoadPageAsync("https://example.com");
// Extract page title
var title = doc.DocumentNode.SelectSingleNode("//title")?.InnerText;
Console.WriteLine($"Page Title: {title}");
// Extract all headings
var headings = doc.DocumentNode.SelectNodes("//h1 | //h2 | //h3");
if (headings != null)
{
foreach (var heading in headings)
{
Console.WriteLine($"{heading.Name.ToUpper()}: {heading.InnerText?.Trim()}");
}
}
}
catch (Exception ex)
{
Console.WriteLine($"Scraping failed: {ex.Message}");
}
}
}
Best Practices
1. HttpClient Management
// ✅ Good: Reuse HttpClient instance
private static readonly HttpClient _httpClient = new HttpClient();
// ❌ Bad: Creating new instance each time
using var client = new HttpClient(); // Don't do this repeatedly
2. Set Proper Headers
_httpClient.DefaultRequestHeaders.Add("User-Agent",
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36");
3. Handle Different Content Types
var response = await httpClient.GetAsync(url);
var contentType = response.Content.Headers.ContentType?.MediaType;
if (contentType == "text/html" || contentType == "application/xhtml+xml")
{
var html = await response.Content.ReadAsStringAsync();
var doc = new HtmlDocument();
doc.LoadHtml(html);
// Process HTML
}
4. Implement Retry Logic
public static async Task<string> GetHtmlWithRetryAsync(string url, int maxRetries = 3)
{
for (int i = 0; i < maxRetries; i++)
{
try
{
var response = await _httpClient.GetAsync(url);
response.EnsureSuccessStatusCode();
return await response.Content.ReadAsStringAsync();
}
catch (HttpRequestException) when (i < maxRetries - 1)
{
await Task.Delay(1000 * (i + 1)); // Exponential backoff
}
}
throw new HttpRequestException($"Failed to fetch {url} after {maxRetries} attempts");
}
Alternative HTTP Libraries
While HttpClient
is the standard choice, you might also consider:
- RestSharp: Higher-level REST client
- Flurl: Fluent URL building and HTTP client
- Refit: Type-safe REST library
Summary
Html Agility Pack focuses solely on HTML parsing and manipulation. For complete web scraping solutions in .NET:
- Use
HttpClient
to fetch web pages - Use Html Agility Pack to parse the HTML
- Follow best practices for network requests and error handling
- Consider using dependency injection for
HttpClient
in larger applications