What is PuppeteerSharp and how do I use it for web scraping in C#?

PuppeteerSharp is a .NET port of the popular Puppeteer library, providing a high-level API to control headless Chrome or Chromium browsers over the DevTools Protocol. It enables C# developers to perform automated browser tasks, including web scraping, testing, and generating screenshots or PDFs of web pages.

Unlike traditional HTTP clients that can only fetch static HTML, PuppeteerSharp executes JavaScript, renders dynamic content, and interacts with web pages just like a real user would. This makes it ideal for scraping modern web applications that rely heavily on JavaScript frameworks like React, Angular, or Vue.js.

Why Use PuppeteerSharp for Web Scraping?

PuppeteerSharp offers several advantages over traditional web scraping libraries:

JavaScript Execution: Renders dynamic content generated by JavaScript frameworks
Real Browser Environment: Bypasses many anti-scraping measures by simulating real user behavior
Complete Page Interaction: Click buttons, fill forms, scroll pages, and navigate complex workflows
Screenshot Capabilities: Capture visual representations of pages for debugging or archival purposes
Network Monitoring: Intercept and monitor network requests to extract API data directly
Modern Web Standards: Supports modern web technologies including WebSockets, Service Workers, and more

Installing PuppeteerSharp

To get started with PuppeteerSharp, install the NuGet package in your C# project:

dotnet add package PuppeteerSharp

Or via the Package Manager Console in Visual Studio:

Install-Package PuppeteerSharp

Before using PuppeteerSharp, you need to download a compatible Chromium browser. This can be done programmatically:

using PuppeteerSharp;

// Download Chromium browser
var browserFetcher = new BrowserFetcher();
await browserFetcher.DownloadAsync();

Basic Web Scraping with PuppeteerSharp

Here's a simple example that demonstrates the core workflow of scraping a webpage:

using PuppeteerSharp;
using System;
using System.Threading.Tasks;

class Program
{
    static async Task Main(string[] args)
    {
        // Launch the browser
        var browser = await Puppeteer.LaunchAsync(new LaunchOptions
        {
            Headless = true // Set to false to see the browser in action
        });

        // Create a new page
        var page = await browser.NewPageAsync();

        // Navigate to the target URL
        await page.GoToAsync("https://example.com");

        // Extract the page title
        var title = await page.GetTitleAsync();
        Console.WriteLine($"Page Title: {title}");

        // Extract text content from an element
        var heading = await page.EvaluateExpressionAsync<string>(
            "document.querySelector('h1').textContent"
        );
        Console.WriteLine($"Main Heading: {heading}");

        // Close the browser
        await browser.CloseAsync();
    }
}

Advanced Data Extraction Techniques

Extracting Multiple Elements

To scrape multiple elements from a page, use EvaluateFunctionAsync to execute JavaScript code:

var products = await page.EvaluateFunctionAsync<Product[]>(@"() => {
    const items = Array.from(document.querySelectorAll('.product-item'));
    return items.map(item => ({
        name: item.querySelector('.product-name').textContent.trim(),
        price: item.querySelector('.product-price').textContent.trim(),
        url: item.querySelector('a').href
    }));
}");

foreach (var product in products)
{
    Console.WriteLine($"{product.Name} - {product.Price}");
}

// Define the Product class
public class Product
{
    public string Name { get; set; }
    public string Price { get; set; }
    public string Url { get; set; }
}

Waiting for Dynamic Content

When scraping pages with dynamic content, you need to wait for elements to load. PuppeteerSharp provides several waiting mechanisms similar to handling AJAX requests in Puppeteer:

// Wait for a specific selector to appear
await page.WaitForSelectorAsync("#dynamic-content", new WaitForSelectorOptions
{
    Timeout = 10000 // Wait up to 10 seconds
});

// Wait for network to be idle (all requests completed)
await page.GoToAsync("https://example.com", new NavigationOptions
{
    WaitUntil = new[] { WaitUntilNavigation.Networkidle0 }
});

// Wait for a custom condition
await page.WaitForFunctionAsync(@"
    () => document.querySelectorAll('.product-item').length > 10
");

Handling Pagination and Navigation

Similar to navigating pages in Puppeteer, you can automate multi-page scraping:

var allData = new List<string>();

for (int pageNum = 1; pageNum <= 5; pageNum++)
{
    await page.GoToAsync($"https://example.com/products?page={pageNum}");

    // Wait for content to load
    await page.WaitForSelectorAsync(".product-list");

    // Extract data from current page
    var pageData = await page.EvaluateFunctionAsync<string[]>(@"() => {
        return Array.from(document.querySelectorAll('.product-name'))
            .map(el => el.textContent);
    }");

    allData.AddRange(pageData);

    // Optional: Add delay to avoid overwhelming the server
    await Task.Delay(1000);
}

Console.WriteLine($"Total products scraped: {allData.Count}");

Interacting with Web Pages

Filling Forms and Clicking Buttons

// Type into input fields
await page.TypeAsync("#username", "myusername");
await page.TypeAsync("#password", "mypassword");

// Click a button
await page.ClickAsync("button[type='submit']");

// Wait for navigation after form submission
await page.WaitForNavigationAsync();

// Select from dropdown
await page.SelectAsync("select#country", "USA");

// Check a checkbox
await page.ClickAsync("input[type='checkbox']#agree");

Handling Infinite Scroll

Many modern websites use infinite scroll instead of traditional pagination:

async Task ScrollToBottom(IPage page)
{
    await page.EvaluateFunctionAsync(@"async () => {
        await new Promise((resolve) => {
            var totalHeight = 0;
            var distance = 100;
            var timer = setInterval(() => {
                var scrollHeight = document.body.scrollHeight;
                window.scrollBy(0, distance);
                totalHeight += distance;

                if(totalHeight >= scrollHeight){
                    clearInterval(timer);
                    resolve();
                }
            }, 100);
        });
    }");
}

// Usage
await ScrollToBottom(page);
await page.WaitForSelectorAsync(".all-content-loaded");

Performance Optimization

Disabling Unnecessary Resources

Speed up scraping by blocking images, fonts, and other non-essential resources:

await page.SetRequestInterceptionAsync(true);

page.Request += async (sender, e) =>
{
    if (e.Request.ResourceType == ResourceType.Image ||
        e.Request.ResourceType == ResourceType.Font ||
        e.Request.ResourceType == ResourceType.StyleSheet)
    {
        await e.Request.AbortAsync();
    }
    else
    {
        await e.Request.ContinueAsync();
    }
};

await page.GoToAsync("https://example.com");

Reusing Browser Instances

For scraping multiple pages, reuse the same browser instance:

var browser = await Puppeteer.LaunchAsync(new LaunchOptions
{
    Headless = true
});

var urls = new[] { "https://example1.com", "https://example2.com", "https://example3.com" };

foreach (var url in urls)
{
    var page = await browser.NewPageAsync();
    await page.GoToAsync(url);

    // Extract data
    var data = await page.GetTitleAsync();
    Console.WriteLine(data);

    await page.CloseAsync(); // Close page, not browser
}

await browser.CloseAsync(); // Close browser after all scraping is done

Error Handling and Best Practices

Implementing Robust Error Handling

try
{
    var browser = await Puppeteer.LaunchAsync(new LaunchOptions
    {
        Headless = true,
        Args = new[] { "--no-sandbox", "--disable-setuid-sandbox" }
    });

    var page = await browser.NewPageAsync();

    // Set a default timeout
    page.DefaultTimeout = 30000;

    try
    {
        await page.GoToAsync("https://example.com", new NavigationOptions
        {
            WaitUntil = new[] { WaitUntilNavigation.Networkidle0 },
            Timeout = 30000
        });

        var content = await page.GetContentAsync();

        // Process content here
    }
    catch (NavigationException navEx)
    {
        Console.WriteLine($"Navigation failed: {navEx.Message}");
    }
    catch (WaitTaskTimeoutException timeoutEx)
    {
        Console.WriteLine($"Timeout occurred: {timeoutEx.Message}");
    }
    finally
    {
        await page.CloseAsync();
    }

    await browser.CloseAsync();
}
catch (Exception ex)
{
    Console.WriteLine($"An error occurred: {ex.Message}");
}

Using User Agents and Headers

Avoid detection by setting realistic user agents and headers:

await page.SetUserAgentAsync(
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 " +
    "(KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36"
);

await page.SetExtraHttpHeadersAsync(new Dictionary<string, string>
{
    { "Accept-Language", "en-US,en;q=0.9" },
    { "Accept-Encoding", "gzip, deflate, br" }
});

Setting Viewport Size

Configure the viewport to match common browser sizes:

await page.SetViewportAsync(new ViewPortOptions
{
    Width = 1920,
    Height = 1080,
    DeviceScaleFactor = 1
});

Taking Screenshots and Generating PDFs

PuppeteerSharp can capture screenshots and generate PDFs for documentation or debugging:

// Take a screenshot
await page.ScreenshotAsync("screenshot.png", new ScreenshotOptions
{
    FullPage = true
});

// Generate a PDF
await page.PdfAsync("page.pdf", new PdfOptions
{
    Format = PaperFormat.A4,
    PrintBackground = true
});

Comparison with Other C# Web Scraping Libraries

| Feature | PuppeteerSharp | HtmlAgilityPack | Selenium WebDriver | |---------|----------------|-----------------|-------------------| | JavaScript Execution | ✓ | ✗ | ✓ | | Headless Mode | ✓ | N/A | ✓ | | Speed | Medium | Fast | Slow | | Memory Usage | Medium | Low | High | | Learning Curve | Medium | Low | Medium | | Dynamic Content | ✓ | ✗ | ✓ |

When to Use PuppeteerSharp

PuppeteerSharp is the best choice when:

Scraping single-page applications (SPAs) or JavaScript-heavy websites
You need to interact with pages (clicking, scrolling, form filling)
Handling authentication workflows or complex user sessions
You need to capture screenshots or generate PDFs
Traditional HTTP requests fail due to anti-scraping measures

For simple HTML parsing tasks without JavaScript, consider using HtmlAgilityPack or AngleSharp for better performance and lower resource consumption.

Conclusion

PuppeteerSharp brings the power of headless browser automation to C# developers, making it possible to scrape even the most complex modern web applications. By executing JavaScript, simulating user interactions, and providing a full browser environment, it overcomes the limitations of traditional HTTP-based scraping approaches.

While it requires more system resources than simpler libraries, the ability to handle dynamic content and bypass many anti-scraping measures makes PuppeteerSharp an invaluable tool for web scraping projects that demand reliability and flexibility.

For production web scraping at scale, consider using a dedicated web scraping API that handles browser management, proxy rotation, and JavaScript rendering automatically, allowing you to focus on extracting and processing data rather than managing infrastructure.

Table of contents