HttpClient
in C# is designed for making HTTP requests and receiving responses, but it doesn't have built-in HTML parsing capabilities. However, you can easily combine HttpClient
with HTML parsing libraries to fetch and parse web pages effectively.
Quick Answer
While HttpClient
cannot parse HTML directly, you can use it to fetch HTML content and then parse it with libraries like HtmlAgilityPack or AngleSharp.
Complete Example with HtmlAgilityPack
Here's a comprehensive example showing how to fetch and parse HTML:
using System;
using System.Net.Http;
using System.Threading.Tasks;
using System.Collections.Generic;
using HtmlAgilityPack;
public class WebScraper
{
private readonly HttpClient _httpClient;
public WebScraper()
{
_httpClient = new HttpClient();
// Set a user agent to avoid being blocked
_httpClient.DefaultRequestHeaders.Add("User-Agent",
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36");
}
public async Task<List<string>> ExtractLinksAsync(string url)
{
var links = new List<string>();
try
{
// Fetch HTML content
var html = await _httpClient.GetStringAsync(url);
// Parse HTML
var doc = new HtmlDocument();
doc.LoadHtml(html);
// Extract all links
var linkNodes = doc.DocumentNode.SelectNodes("//a[@href]");
if (linkNodes != null)
{
foreach (var node in linkNodes)
{
var href = node.GetAttributeValue("href", "");
if (!string.IsNullOrEmpty(href))
{
links.Add(href);
}
}
}
}
catch (HttpRequestException ex)
{
Console.WriteLine($"HTTP Error: {ex.Message}");
}
catch (Exception ex)
{
Console.WriteLine($"Error: {ex.Message}");
}
return links;
}
public void Dispose()
{
_httpClient?.Dispose();
}
}
// Usage example
class Program
{
static async Task Main(string[] args)
{
var scraper = new WebScraper();
var links = await scraper.ExtractLinksAsync("https://example.com");
foreach (var link in links)
{
Console.WriteLine($"Found link: {link}");
}
scraper.Dispose();
}
}
Alternative: Using AngleSharp
AngleSharp is another excellent HTML parsing library with CSS selector support:
using System;
using System.Net.Http;
using System.Threading.Tasks;
using AngleSharp;
using AngleSharp.Html.Dom;
public async Task ParseWithAngleSharp(string url)
{
using var httpClient = new HttpClient();
var html = await httpClient.GetStringAsync(url);
// Create AngleSharp configuration
var config = Configuration.Default;
var context = BrowsingContext.New(config);
// Parse the HTML
var document = await context.OpenAsync(req => req.Content(html));
// Use CSS selectors
var titles = document.QuerySelectorAll("h1, h2, h3");
foreach (var title in titles)
{
Console.WriteLine($"Title: {title.TextContent}");
}
}
Advanced Parsing Examples
Extract Form Data
public async Task<Dictionary<string, string>> ExtractFormFields(string url)
{
var html = await _httpClient.GetStringAsync(url);
var doc = new HtmlDocument();
doc.LoadHtml(html);
var formData = new Dictionary<string, string>();
var inputs = doc.DocumentNode.SelectNodes("//input[@name]");
if (inputs != null)
{
foreach (var input in inputs)
{
var name = input.GetAttributeValue("name", "");
var value = input.GetAttributeValue("value", "");
formData[name] = value;
}
}
return formData;
}
Extract Table Data
public async Task<List<List<string>>> ExtractTableData(string url, string tableSelector = "//table[1]")
{
var html = await _httpClient.GetStringAsync(url);
var doc = new HtmlDocument();
doc.LoadHtml(html);
var table = doc.DocumentNode.SelectSingleNode(tableSelector);
var tableData = new List<List<string>>();
if (table != null)
{
var rows = table.SelectNodes(".//tr");
foreach (var row in rows)
{
var rowData = new List<string>();
var cells = row.SelectNodes(".//td | .//th");
if (cells != null)
{
foreach (var cell in cells)
{
rowData.Add(cell.InnerText.Trim());
}
tableData.Add(rowData);
}
}
}
return tableData;
}
Installation
HtmlAgilityPack
# Package Manager Console
Install-Package HtmlAgilityPack
# .NET CLI
dotnet add package HtmlAgilityPack
AngleSharp
# Package Manager Console
Install-Package AngleSharp
# .NET CLI
dotnet add package AngleSharp
Best Practices
- Reuse HttpClient: Create one instance and reuse it to avoid socket exhaustion
- Set User-Agent: Some websites block requests without proper user agents
- Handle Errors: Always wrap HTTP requests in try-catch blocks
- Respect Rate Limits: Add delays between requests to avoid being blocked
- Check Null Values: Always verify that HTML nodes exist before accessing them
Error Handling
try
{
var html = await httpClient.GetStringAsync(url);
var doc = new HtmlDocument();
doc.LoadHtml(html);
// Your parsing logic here
}
catch (HttpRequestException ex)
{
// Handle HTTP-specific errors (404, 500, etc.)
Console.WriteLine($"HTTP Error: {ex.Message}");
}
catch (TaskCanceledException ex)
{
// Handle timeout
Console.WriteLine($"Request timeout: {ex.Message}");
}
catch (Exception ex)
{
// Handle other exceptions
Console.WriteLine($"Unexpected error: {ex.Message}");
}
Legal Considerations
Always ensure your web scraping activities comply with:
- Website terms of service
- robots.txt
file restrictions
- Rate limiting requirements
- Copyright and data protection laws
Consider using the website's official API if available, as it's often more reliable and ethical than scraping.