Table of contents

How can I use headless Chrome with C# for web scraping?

Headless Chrome is a powerful tool for web scraping that allows you to run Chrome browser without a graphical interface. This is particularly useful when scraping JavaScript-heavy websites, single-page applications (SPAs), or sites that require user interactions. In C#, you can use headless Chrome through two popular libraries: PuppeteerSharp (a .NET port of Puppeteer) and Selenium WebDriver.

Why Use Headless Chrome for Web Scraping?

Headless browsers solve several common web scraping challenges:

  • JavaScript Rendering: Many modern websites rely heavily on JavaScript to load content dynamically
  • AJAX Requests: Content loaded asynchronously after the initial page load
  • User Interactions: Simulating clicks, form submissions, and scrolling
  • Screenshot Capture: Taking screenshots of web pages for visual verification
  • SPA Navigation: Handling client-side routing in single-page applications

Using PuppeteerSharp with Headless Chrome

PuppeteerSharp is a .NET port of Google's Puppeteer library. It provides a high-level API to control Chrome or Chromium over the DevTools Protocol.

Installation

Install PuppeteerSharp via NuGet Package Manager:

dotnet add package PuppeteerSharp

Or via the Package Manager Console:

Install-Package PuppeteerSharp

Basic Web Scraping Example

Here's a complete example of scraping a website using PuppeteerSharp:

using PuppeteerSharp;
using System;
using System.Threading.Tasks;

namespace WebScrapingExample
{
    class Program
    {
        static async Task Main(string[] args)
        {
            // Download the Chromium browser if not already installed
            var browserFetcher = new BrowserFetcher();
            await browserFetcher.DownloadAsync();

            // Launch the browser in headless mode
            var browser = await Puppeteer.LaunchAsync(new LaunchOptions
            {
                Headless = true,
                Args = new[] { "--no-sandbox", "--disable-setuid-sandbox" }
            });

            // Create a new page
            var page = await browser.NewPageAsync();

            // Navigate to the target URL
            await page.GoToAsync("https://example.com",
                WaitUntilNavigation.Networkidle0);

            // Extract page content
            var pageTitle = await page.GetTitleAsync();
            var htmlContent = await page.GetContentAsync();

            // Extract specific elements
            var headings = await page.EvaluateExpressionAsync<string[]>(
                @"Array.from(document.querySelectorAll('h1, h2, h3'))
                    .map(el => el.textContent.trim())"
            );

            Console.WriteLine($"Page Title: {pageTitle}");
            Console.WriteLine("\nHeadings:");
            foreach (var heading in headings)
            {
                Console.WriteLine($"- {heading}");
            }

            // Close the browser
            await browser.CloseAsync();
        }
    }
}

Advanced Features with PuppeteerSharp

Handling JavaScript-Heavy Pages

Similar to how to handle AJAX requests using Puppeteer, you can wait for specific elements to load:

// Wait for a specific selector to appear
await page.WaitForSelectorAsync("#content", new WaitForSelectorOptions
{
    Timeout = 30000 // 30 seconds timeout
});

// Wait for network to be idle
await page.GoToAsync("https://example.com",
    WaitUntilNavigation.Networkidle2);

// Wait for a custom condition
await page.WaitForFunctionAsync(
    @"() => document.querySelectorAll('.product-item').length >= 20"
);

Taking Screenshots

// Take a full-page screenshot
await page.ScreenshotAsync("screenshot.png", new ScreenshotOptions
{
    FullPage = true
});

// Take a screenshot of a specific element
var element = await page.QuerySelectorAsync(".product-card");
await element.ScreenshotAsync("element.png");

Interacting with Elements

// Click a button
await page.ClickAsync("#load-more-button");

// Type into an input field
await page.TypeAsync("#search-input", "web scraping");

// Submit a form
await page.Keyboard.PressAsync("Enter");

// Scroll to bottom of page
await page.EvaluateExpressionAsync(
    "window.scrollTo(0, document.body.scrollHeight)"
);

Setting Viewport and User Agent

// Set viewport size
await page.SetViewportAsync(new ViewPortOptions
{
    Width = 1920,
    Height = 1080
});

// Set user agent
await page.SetUserAgentAsync(
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
);

Using Selenium WebDriver with Headless Chrome

Selenium is another popular option for browser automation and web scraping. It supports multiple browsers including Chrome.

Installation

Install the required NuGet packages:

dotnet add package Selenium.WebDriver
dotnet add package Selenium.WebDriver.ChromeDriver

Basic Selenium Example

using OpenQA.Selenium;
using OpenQA.Selenium.Chrome;
using OpenQA.Selenium.Support.UI;
using System;
using System.Linq;

namespace SeleniumWebScraping
{
    class Program
    {
        static void Main(string[] args)
        {
            // Configure Chrome options for headless mode
            var chromeOptions = new ChromeOptions();
            chromeOptions.AddArguments("--headless");
            chromeOptions.AddArguments("--no-sandbox");
            chromeOptions.AddArguments("--disable-dev-shm-usage");
            chromeOptions.AddArgument("--disable-gpu");

            // Create Chrome driver
            using (var driver = new ChromeDriver(chromeOptions))
            {
                // Navigate to URL
                driver.Navigate().GoToUrl("https://example.com");

                // Wait for page to load
                var wait = new WebDriverWait(driver, TimeSpan.FromSeconds(10));
                wait.Until(d => d.FindElement(By.TagName("body")));

                // Extract data
                var pageTitle = driver.Title;
                var headings = driver.FindElements(By.CssSelector("h1, h2, h3"))
                    .Select(el => el.Text.Trim())
                    .ToList();

                Console.WriteLine($"Page Title: {pageTitle}");
                Console.WriteLine("\nHeadings:");
                foreach (var heading in headings)
                {
                    Console.WriteLine($"- {heading}");
                }
            }
        }
    }
}

Advanced Selenium Features

Explicit Waits

var wait = new WebDriverWait(driver, TimeSpan.FromSeconds(30));

// Wait for element to be visible
var element = wait.Until(d =>
    d.FindElement(By.Id("dynamic-content")).Displayed ?
    d.FindElement(By.Id("dynamic-content")) : null
);

// Wait for element to be clickable
var button = wait.Until(SeleniumExtras.WaitHelpers
    .ExpectedConditions.ElementToBeClickable(By.Id("submit-btn")));

Handling Multiple Windows and Tabs

// Get current window handle
var mainWindow = driver.CurrentWindowHandle;

// Click link that opens new tab
driver.FindElement(By.Id("new-tab-link")).Click();

// Switch to new tab
var allWindows = driver.WindowHandles;
foreach (var window in allWindows)
{
    if (window != mainWindow)
    {
        driver.SwitchTo().Window(window);
        break;
    }
}

// Do work in new tab...

// Switch back to main window
driver.SwitchTo().Window(mainWindow);

Executing JavaScript

// Execute JavaScript to scroll
var js = (IJavaScriptExecutor)driver;
js.ExecuteScript("window.scrollTo(0, document.body.scrollHeight)");

// Get data using JavaScript
var data = js.ExecuteScript(
    @"return Array.from(document.querySelectorAll('.product'))
        .map(p => ({
            name: p.querySelector('.name').textContent,
            price: p.querySelector('.price').textContent
        }))"
);

PuppeteerSharp vs Selenium: Which to Choose?

PuppeteerSharp Advantages

  • Modern API: Async/await support with cleaner syntax
  • Performance: Generally faster and more lightweight
  • Chrome DevTools Protocol: Direct access to Chrome debugging features
  • Network Interception: Easy request/response modification
  • Better for SPAs: More robust handling of single-page applications

Selenium Advantages

  • Multi-Browser Support: Works with Firefox, Safari, Edge, etc.
  • Mature Ecosystem: Extensive documentation and community support
  • Grid Support: Built-in distributed testing capabilities
  • Industry Standard: Widely used in enterprise environments

Best Practices for Headless Chrome Web Scraping

1. Add Delays and Rate Limiting

// Add random delays between actions
await Task.Delay(TimeSpan.FromSeconds(new Random().Next(1, 3)));

// Use rate limiting for multiple pages
var rateLimiter = new SemaphoreSlim(3); // Max 3 concurrent requests

2. Handle Errors Gracefully

try
{
    await page.GoToAsync(url, new NavigationOptions
    {
        Timeout = 30000,
        WaitUntil = new[] { WaitUntilNavigation.Networkidle0 }
    });
}
catch (NavigationException ex)
{
    Console.WriteLine($"Navigation failed: {ex.Message}");
    // Implement retry logic or fallback
}

3. Use Headless Mode Correctly

// Enable headless mode for production
var launchOptions = new LaunchOptions
{
    Headless = !Debugger.IsAttached, // Show browser when debugging
    Args = new[]
    {
        "--no-sandbox",
        "--disable-setuid-sandbox",
        "--disable-dev-shm-usage",
        "--disable-accelerated-2d-canvas",
        "--disable-gpu"
    }
};

4. Clean Up Resources

// Always dispose of browser resources
await using var browser = await Puppeteer.LaunchAsync(launchOptions);
await using var page = await browser.NewPageAsync();

// Or use try-finally
IBrowser browser = null;
try
{
    browser = await Puppeteer.LaunchAsync(launchOptions);
    // ... scraping logic
}
finally
{
    await browser?.CloseAsync();
}

5. Monitor Network Activity

Just like monitoring network requests in Puppeteer, you can intercept requests:

page.Request += (sender, e) =>
{
    Console.WriteLine($"Request: {e.Request.Method} {e.Request.Url}");
};

page.Response += (sender, e) =>
{
    Console.WriteLine($"Response: {e.Response.Status} {e.Response.Url}");
};

Performance Optimization Tips

1. Block Unnecessary Resources

// Block images, stylesheets, and fonts to improve speed
await page.SetRequestInterceptionAsync(true);
page.Request += async (sender, e) =>
{
    var blockedTypes = new[] { ResourceType.Image, ResourceType.StyleSheet, ResourceType.Font };
    if (blockedTypes.Contains(e.Request.ResourceType))
    {
        await e.Request.AbortAsync();
    }
    else
    {
        await e.Request.ContinueAsync();
    }
};

2. Reuse Browser Instances

// Reuse browser for multiple pages
var browser = await Puppeteer.LaunchAsync(launchOptions);

foreach (var url in urls)
{
    var page = await browser.NewPageAsync();
    await page.GoToAsync(url);
    // ... scrape page
    await page.CloseAsync();
}

await browser.CloseAsync();

3. Use Connection Pooling

// Connect to existing browser instance
var browserWSEndpoint = browser.WebSocketEndpoint;
var browser2 = await Puppeteer.ConnectAsync(new ConnectOptions
{
    BrowserWSEndpoint = browserWSEndpoint
});

Conclusion

Headless Chrome is an essential tool for modern web scraping in C#. PuppeteerSharp offers a modern, performant API ideal for JavaScript-heavy sites and SPAs, while Selenium provides broader browser support and enterprise-grade features. Choose PuppeteerSharp for most C# web scraping projects requiring Chrome automation, and consider Selenium when you need multi-browser compatibility or integration with existing test infrastructure.

For complex scraping scenarios that require managing multiple concurrent sessions, handling dynamic content, or dealing with anti-bot measures, consider using a specialized web scraping API that handles these challenges automatically while providing simple access to rendered HTML and structured data.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon