How can I use NuGet packages to enhance my C# web scraping projects?
NuGet is the package manager for .NET that provides access to thousands of libraries designed to extend your C# applications. For web scraping projects, NuGet packages offer powerful capabilities ranging from HTML parsing to browser automation, making it easier to extract data from websites efficiently and reliably.
Installing NuGet Packages
Before diving into specific packages, you need to know how to install them. You can install NuGet packages through the NuGet Package Manager Console, the .NET CLI, or Visual Studio's GUI.
Using .NET CLI
# Install a specific package
dotnet add package HtmlAgilityPack
# Install a specific version
dotnet add package HtmlAgilityPack --version 1.11.54
Using Package Manager Console
# In Visual Studio's Package Manager Console
Install-Package HtmlAgilityPack
# Install specific version
Install-Package HtmlAgilityPack -Version 1.11.54
Using PackageReference in .csproj
<Project Sdk="Microsoft.NET.Sdk">
<PropertyGroup>
<OutputType>Exe</OutputType>
<TargetFramework>net8.0</TargetFramework>
</PropertyGroup>
<ItemGroup>
<PackageReference Include="HtmlAgilityPack" Version="1.11.54" />
<PackageReference Include="PuppeteerSharp" Version="12.0.2" />
<PackageReference Include="Newtonsoft.Json" Version="13.0.3" />
</ItemGroup>
</Project>
Essential NuGet Packages for Web Scraping
1. HtmlAgilityPack
HtmlAgilityPack is the most popular HTML parsing library for C#. It provides a robust DOM parser that works with malformed HTML and supports XPath queries.
using HtmlAgilityPack;
using System;
using System.Net.Http;
using System.Threading.Tasks;
public class HtmlParserExample
{
public static async Task Main()
{
using var httpClient = new HttpClient();
var html = await httpClient.GetStringAsync("https://example.com");
var htmlDoc = new HtmlDocument();
htmlDoc.LoadHtml(html);
// Extract data using XPath
var titleNode = htmlDoc.DocumentNode.SelectSingleNode("//h1");
Console.WriteLine($"Title: {titleNode?.InnerText}");
// Extract all links
var links = htmlDoc.DocumentNode.SelectNodes("//a[@href]");
foreach (var link in links ?? Enumerable.Empty<HtmlNode>())
{
var href = link.GetAttributeValue("href", string.Empty);
var text = link.InnerText.Trim();
Console.WriteLine($"Link: {text} -> {href}");
}
}
}
2. PuppeteerSharp
PuppeteerSharp is a .NET port of the popular Puppeteer library, enabling headless Chrome automation. This is essential for scraping JavaScript-heavy websites and single-page applications.
using PuppeteerSharp;
using System;
using System.Threading.Tasks;
public class PuppeteerExample
{
public static async Task Main()
{
// Download browser if not already downloaded
var browserFetcher = new BrowserFetcher();
await browserFetcher.DownloadAsync();
await using var browser = await Puppeteer.LaunchAsync(new LaunchOptions
{
Headless = true,
Args = new[] { "--no-sandbox" }
});
await using var page = await browser.NewPageAsync();
// Navigate and wait for network to be idle
await page.GoToAsync("https://example.com", WaitUntilNavigation.NetworkIdle0);
// Execute JavaScript to extract data
var title = await page.EvaluateExpressionAsync<string>("document.title");
Console.WriteLine($"Page Title: {title}");
// Take a screenshot
await page.ScreenshotAsync("screenshot.png");
// Extract data from elements
var headings = await page.EvaluateExpressionAsync<string[]>(
"Array.from(document.querySelectorAll('h2')).map(h => h.textContent)"
);
foreach (var heading in headings)
{
Console.WriteLine($"Heading: {heading}");
}
}
}
3. AngleSharp
AngleSharp is a modern HTML/CSS parser that provides a standards-compliant approach with excellent performance. It's particularly useful when you need CSS selector support.
using AngleSharp;
using AngleSharp.Dom;
using System;
using System.Linq;
using System.Net.Http;
using System.Threading.Tasks;
public class AngleSharpExample
{
public static async Task Main()
{
var config = Configuration.Default.WithDefaultLoader();
var context = BrowsingContext.New(config);
// Load document from URL
var document = await context.OpenAsync("https://example.com");
// Query using CSS selectors
var title = document.QuerySelector("h1")?.TextContent;
Console.WriteLine($"Title: {title}");
// Extract all paragraphs
var paragraphs = document.QuerySelectorAll("p")
.Select(p => p.TextContent.Trim())
.Where(text => !string.IsNullOrWhiteSpace(text));
foreach (var para in paragraphs)
{
Console.WriteLine($"Paragraph: {para}");
}
// Extract data attributes
var items = document.QuerySelectorAll("[data-product-id]");
foreach (var item in items)
{
var productId = item.GetAttribute("data-product-id");
var name = item.QuerySelector(".product-name")?.TextContent;
Console.WriteLine($"Product: {name} (ID: {productId})");
}
}
}
4. Newtonsoft.Json (Json.NET)
When scraping APIs or parsing JSON data embedded in web pages, Newtonsoft.Json is the industry-standard JSON serialization library.
using Newtonsoft.Json;
using Newtonsoft.Json.Linq;
using System;
using System.Net.Http;
using System.Threading.Tasks;
public class JsonScrapingExample
{
public static async Task Main()
{
using var httpClient = new HttpClient();
var jsonResponse = await httpClient.GetStringAsync("https://api.example.com/data");
// Deserialize to dynamic object
dynamic data = JsonConvert.DeserializeObject(jsonResponse);
Console.WriteLine($"Status: {data.status}");
// Parse with JObject for more control
var jObject = JObject.Parse(jsonResponse);
var items = jObject["items"];
foreach (var item in items)
{
Console.WriteLine($"Item: {item["name"]} - ${item["price"]}");
}
// Deserialize to strongly-typed object
var result = JsonConvert.DeserializeObject<ApiResponse>(jsonResponse);
Console.WriteLine($"Found {result.Items.Count} items");
}
}
public class ApiResponse
{
public string Status { get; set; }
public List<Item> Items { get; set; }
}
public class Item
{
public string Name { get; set; }
public decimal Price { get; set; }
}
5. Polly
Polly is a resilience and transient-fault-handling library that's crucial for robust web scraping. It helps handle retries, circuit breakers, and timeouts.
using Polly;
using Polly.Retry;
using System;
using System.Net.Http;
using System.Threading.Tasks;
public class PollyExample
{
public static async Task Main()
{
// Define a retry policy
var retryPolicy = Policy
.Handle<HttpRequestException>()
.WaitAndRetryAsync(
retryCount: 3,
sleepDurationProvider: attempt => TimeSpan.FromSeconds(Math.Pow(2, attempt)),
onRetry: (exception, timeSpan, retryCount, context) =>
{
Console.WriteLine($"Retry {retryCount} after {timeSpan.TotalSeconds}s due to: {exception.Message}");
}
);
using var httpClient = new HttpClient();
httpClient.Timeout = TimeSpan.FromSeconds(30);
// Execute with retry policy
var html = await retryPolicy.ExecuteAsync(async () =>
{
var response = await httpClient.GetAsync("https://example.com");
response.EnsureSuccessStatusCode();
return await response.Content.ReadAsStringAsync();
});
Console.WriteLine($"Successfully fetched {html.Length} characters");
}
}
6. RestSharp
RestSharp simplifies REST API interactions, making it ideal for scraping data from web APIs.
using RestSharp;
using System;
using System.Threading.Tasks;
public class RestSharpExample
{
public static async Task Main()
{
var client = new RestClient("https://api.example.com");
var request = new RestRequest("/products", Method.Get);
request.AddHeader("Accept", "application/json");
request.AddParameter("category", "electronics");
var response = await client.ExecuteAsync<ProductResponse>(request);
if (response.IsSuccessful && response.Data != null)
{
foreach (var product in response.Data.Products)
{
Console.WriteLine($"{product.Name}: ${product.Price}");
}
}
else
{
Console.WriteLine($"Error: {response.ErrorMessage}");
}
}
}
public class ProductResponse
{
public List<Product> Products { get; set; }
}
public class Product
{
public string Name { get; set; }
public decimal Price { get; set; }
public string Category { get; set; }
}
7. CsvHelper
When exporting scraped data to CSV format, CsvHelper provides a robust solution for reading and writing CSV files.
using CsvHelper;
using CsvHelper.Configuration;
using System.Globalization;
using System.IO;
using System.Collections.Generic;
public class CsvExportExample
{
public static void ExportToCsv(List<ScrapedData> data, string filePath)
{
var config = new CsvConfiguration(CultureInfo.InvariantCulture)
{
HasHeaderRecord = true,
Delimiter = ",",
};
using var writer = new StreamWriter(filePath);
using var csv = new CsvWriter(writer, config);
csv.WriteRecords(data);
}
public static List<ScrapedData> ReadFromCsv(string filePath)
{
using var reader = new StreamReader(filePath);
using var csv = new CsvReader(reader, CultureInfo.InvariantCulture);
return csv.GetRecords<ScrapedData>().ToList();
}
}
public class ScrapedData
{
public string Title { get; set; }
public string Url { get; set; }
public DateTime ScrapedAt { get; set; }
}
Combining Multiple Packages
The real power comes from combining multiple NuGet packages in a single project. Here's an example that uses HtmlAgilityPack for parsing, Polly for reliability, and CsvHelper for data export:
using HtmlAgilityPack;
using Polly;
using CsvHelper;
using System;
using System.Collections.Generic;
using System.Globalization;
using System.IO;
using System.Net.Http;
using System.Threading.Tasks;
public class ComprehensiveScraperExample
{
public static async Task Main()
{
var scraper = new WebScraper();
var results = await scraper.ScrapeProductsAsync("https://example.com/products");
// Export to CSV
using var writer = new StreamWriter("products.csv");
using var csv = new CsvWriter(writer, CultureInfo.InvariantCulture);
csv.WriteRecords(results);
Console.WriteLine($"Scraped {results.Count} products and saved to CSV");
}
}
public class WebScraper
{
private readonly HttpClient _httpClient;
private readonly AsyncRetryPolicy _retryPolicy;
public WebScraper()
{
_httpClient = new HttpClient();
_httpClient.DefaultRequestHeaders.Add("User-Agent",
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36");
_retryPolicy = Policy
.Handle<HttpRequestException>()
.WaitAndRetryAsync(3, attempt => TimeSpan.FromSeconds(Math.Pow(2, attempt)));
}
public async Task<List<ProductInfo>> ScrapeProductsAsync(string url)
{
var products = new List<ProductInfo>();
var html = await _retryPolicy.ExecuteAsync(async () =>
{
var response = await _httpClient.GetAsync(url);
response.EnsureSuccessStatusCode();
return await response.Content.ReadAsStringAsync();
});
var htmlDoc = new HtmlDocument();
htmlDoc.LoadHtml(html);
var productNodes = htmlDoc.DocumentNode.SelectNodes("//div[@class='product']");
if (productNodes != null)
{
foreach (var node in productNodes)
{
var product = new ProductInfo
{
Name = node.SelectSingleNode(".//h3")?.InnerText.Trim(),
Price = node.SelectSingleNode(".//span[@class='price']")?.InnerText.Trim(),
Url = node.SelectSingleNode(".//a")?.GetAttributeValue("href", string.Empty)
};
products.Add(product);
}
}
return products;
}
}
public class ProductInfo
{
public string Name { get; set; }
public string Price { get; set; }
public string Url { get; set; }
}
Best Practices for Using NuGet Packages
1. Version Management
Always specify package versions in your .csproj
file to ensure reproducible builds:
<PackageReference Include="HtmlAgilityPack" Version="1.11.54" />
2. Keep Packages Updated
Regularly update packages to get bug fixes and security patches:
# List outdated packages
dotnet list package --outdated
# Update all packages
dotnet add package HtmlAgilityPack
3. Use Dependency Injection
For larger projects, use dependency injection to manage package dependencies:
using Microsoft.Extensions.DependencyInjection;
using System.Net.Http;
public class Startup
{
public void ConfigureServices(IServiceCollection services)
{
services.AddHttpClient<IWebScraper, WebScraper>();
services.AddSingleton<IHtmlParser, HtmlAgilityPackParser>();
services.AddTransient<IDataExporter, CsvExporter>();
}
}
4. Handle Resource Disposal
Many NuGet packages implement IDisposable
. Always use using
statements or properly dispose of resources:
// Good practice
await using var browser = await Puppeteer.LaunchAsync(options);
using var httpClient = new HttpClient();
// Or with try-finally
var browser = await Puppeteer.LaunchAsync(options);
try
{
// Use browser
}
finally
{
await browser.DisposeAsync();
}
Conclusion
NuGet packages are essential tools that dramatically enhance C# web scraping projects. By leveraging libraries like HtmlAgilityPack for HTML parsing, PuppeteerSharp for browser automation, AngleSharp for CSS selector support, and Polly for reliability, you can build robust and efficient web scrapers. The key is to choose the right combination of packages based on your specific requirements—whether you're parsing static HTML, automating JavaScript-rendered pages, or building resilient data pipelines.
Start by installing the core packages you need, experiment with their APIs, and gradually build more sophisticated scrapers by combining multiple libraries. The .NET ecosystem provides everything you need to create production-ready web scraping solutions.