How do I use Playwright with C# for Modern Web Scraping?
Playwright is a powerful browser automation framework developed by Microsoft that provides excellent support for C# developers. Unlike traditional HTTP-based scraping tools, Playwright allows you to control real browsers (Chromium, Firefox, and WebKit), making it ideal for scraping modern JavaScript-heavy websites, single-page applications, and sites that require user interaction.
Why Choose Playwright for C# Web Scraping?
Playwright offers several advantages over other scraping tools:
- Cross-browser support: Test and scrape across Chromium, Firefox, and WebKit with a single API
- Modern web support: Handles JavaScript rendering, AJAX requests, and dynamic content seamlessly
- Auto-wait mechanism: Automatically waits for elements to be ready before interacting
- Network interception: Monitor, modify, or block network requests
- Native C# support: First-class .NET integration with async/await patterns
- Headless and headed modes: Run browsers invisibly or with UI for debugging
Installing Playwright for C
First, create a new C# project and install the Playwright package:
# Create a new console application
dotnet new console -n PlaywrightScraper
cd PlaywrightScraper
# Install Playwright package
dotnet add package Microsoft.Playwright
# Build the project
dotnet build
# Install browser binaries (Chromium, Firefox, WebKit)
pwsh bin/Debug/net8.0/playwright.ps1 install
For Linux or macOS, use the shell script instead:
./bin/Debug/net8.0/playwright.sh install
Basic Playwright Web Scraping Example
Here's a simple example that demonstrates launching a browser, navigating to a page, and extracting data:
using Microsoft.Playwright;
using System;
using System.Threading.Tasks;
class Program
{
static async Task Main(string[] args)
{
// Install Playwright browsers (only needed once)
var exitCode = Microsoft.Playwright.Program.Main(new[] { "install" });
if (exitCode != 0)
{
Console.WriteLine("Failed to install browsers");
return;
}
// Create playwright instance
using var playwright = await Playwright.CreateAsync();
// Launch browser (headless by default)
await using var browser = await playwright.Chromium.LaunchAsync(new BrowserTypeLaunchOptions
{
Headless = true
});
// Create a new page
var page = await browser.NewPageAsync();
// Navigate to URL
await page.GotoAsync("https://example.com");
// Extract page title
var title = await page.TitleAsync();
Console.WriteLine($"Page title: {title}");
// Extract text content
var heading = await page.Locator("h1").TextContentAsync();
Console.WriteLine($"Heading: {heading}");
// Close browser
await browser.CloseAsync();
}
}
Extracting Data from Multiple Elements
When scraping lists of items, you'll often need to extract data from multiple elements. Playwright provides powerful selectors and iteration methods:
using Microsoft.Playwright;
using System;
using System.Collections.Generic;
using System.Threading.Tasks;
public class ProductScraper
{
public async Task<List<Product>> ScrapeProducts(string url)
{
using var playwright = await Playwright.CreateAsync();
await using var browser = await playwright.Chromium.LaunchAsync();
var page = await browser.NewPageAsync();
await page.GotoAsync(url);
// Wait for product listings to load
await page.WaitForSelectorAsync(".product-item");
// Get all product elements
var productElements = await page.Locator(".product-item").AllAsync();
var products = new List<Product>();
foreach (var element in productElements)
{
var product = new Product
{
Name = await element.Locator(".product-name").TextContentAsync(),
Price = await element.Locator(".product-price").TextContentAsync(),
ImageUrl = await element.Locator("img").GetAttributeAsync("src")
};
products.Add(product);
}
return products;
}
}
public class Product
{
public string Name { get; set; }
public string Price { get; set; }
public string ImageUrl { get; set; }
}
Handling Dynamic Content and AJAX Requests
Modern websites often load content dynamically through AJAX. Playwright's auto-wait functionality makes this seamless, but you can also explicitly wait for specific conditions:
// Wait for specific selector
await page.WaitForSelectorAsync(".dynamic-content", new PageWaitForSelectorOptions
{
State = WaitForSelectorState.Visible,
Timeout = 10000 // 10 seconds
});
// Wait for network to be idle
await page.WaitForLoadStateAsync(LoadState.NetworkIdle);
// Wait for specific URL pattern
await page.WaitForURLAsync("**/search?query=*");
// Wait for custom condition
await page.WaitForFunctionAsync("() => document.querySelectorAll('.item').length > 10");
Similar to handling AJAX requests in browser automation, Playwright provides robust mechanisms for waiting for asynchronous content to load.
Interacting with Pages
Playwright allows you to interact with pages just like a real user would:
using Microsoft.Playwright;
public async Task PerformSearch(string searchTerm)
{
using var playwright = await Playwright.CreateAsync();
await using var browser = await playwright.Chromium.LaunchAsync();
var page = await browser.NewPageAsync();
await page.GotoAsync("https://example.com");
// Fill input field
await page.FillAsync("input[name='search']", searchTerm);
// Click search button
await page.ClickAsync("button[type='submit']");
// Wait for results
await page.WaitForSelectorAsync(".search-results");
// Select dropdown option
await page.SelectOptionAsync("select#category", "books");
// Check checkbox
await page.CheckAsync("input#filter-available");
// Take screenshot
await page.ScreenshotAsync(new PageScreenshotOptions
{
Path = "results.png",
FullPage = true
});
}
Handling Authentication and Sessions
Many scraping tasks require authentication. Playwright makes it easy to handle login flows and maintain sessions:
public async Task LoginAndScrape(string username, string password)
{
using var playwright = await Playwright.CreateAsync();
await using var browser = await playwright.Chromium.LaunchAsync();
var context = await browser.NewContextAsync();
var page = await context.NewPageAsync();
// Navigate to login page
await page.GotoAsync("https://example.com/login");
// Fill credentials
await page.FillAsync("#username", username);
await page.FillAsync("#password", password);
// Submit form
await page.ClickAsync("button[type='submit']");
// Wait for navigation
await page.WaitForURLAsync("**/dashboard");
// Now you're logged in and can scrape protected content
await page.GotoAsync("https://example.com/protected/data");
// Extract data
var data = await page.Locator(".user-data").TextContentAsync();
Console.WriteLine(data);
// Save cookies for later use
var cookies = await context.CookiesAsync();
var cookiesJson = System.Text.Json.JsonSerializer.Serialize(cookies);
await File.WriteAllTextAsync("cookies.json", cookiesJson);
}
For more advanced authentication scenarios, you can reuse saved cookies:
public async Task ScrapeWithSavedCookies()
{
using var playwright = await Playwright.CreateAsync();
await using var browser = await playwright.Chromium.LaunchAsync();
// Load cookies from file
var cookiesJson = await File.ReadAllTextAsync("cookies.json");
var cookies = System.Text.Json.JsonSerializer.Deserialize<List<Cookie>>(cookiesJson);
var context = await browser.NewContextAsync();
await context.AddCookiesAsync(cookies);
var page = await context.NewPageAsync();
await page.GotoAsync("https://example.com/protected/data");
// You're already authenticated
}
This approach is similar to handling authentication in browser automation, ensuring your scraper can access protected resources.
Handling Iframes and Shadow DOM
Some websites use iframes or shadow DOM elements. Playwright provides methods to work with these:
// Working with iframes
var frame = page.FrameLocator("iframe#content-frame");
var iframeContent = await frame.Locator(".inner-content").TextContentAsync();
// Alternative iframe approach
var frameElement = await page.WaitForSelectorAsync("iframe#content-frame");
var contentFrame = await frameElement.ContentFrameAsync();
var text = await contentFrame.Locator(".inner-content").TextContentAsync();
// Working with Shadow DOM
var shadowContent = await page.Locator("custom-element").Locator("internal:control=enter-shadow >> .shadow-content").TextContentAsync();
Monitoring Network Requests
Playwright allows you to intercept and monitor network traffic, which is useful for API-based scraping:
public async Task MonitorNetworkRequests()
{
using var playwright = await Playwright.CreateAsync();
await using var browser = await playwright.Chromium.LaunchAsync();
var page = await browser.NewPageAsync();
// Listen to all requests
page.Request += (_, request) =>
{
Console.WriteLine($"Request: {request.Method} {request.Url}");
};
// Listen to all responses
page.Response += (_, response) =>
{
Console.WriteLine($"Response: {response.Status} {response.Url}");
};
// Intercept specific API calls
await page.RouteAsync("**/api/data", async route =>
{
var response = await route.FetchAsync();
var body = await response.TextAsync();
Console.WriteLine($"API Response: {body}");
await route.ContinueAsync();
});
await page.GotoAsync("https://example.com");
}
Error Handling and Timeouts
Robust error handling is essential for production web scrapers:
public async Task<string> ScrapeWithErrorHandling(string url)
{
try
{
using var playwright = await Playwright.CreateAsync();
await using var browser = await playwright.Chromium.LaunchAsync(new BrowserTypeLaunchOptions
{
Timeout = 30000 // 30 seconds launch timeout
});
var page = await browser.NewPageAsync();
// Set page timeout
page.SetDefaultTimeout(15000); // 15 seconds
try
{
await page.GotoAsync(url, new PageGotoOptions
{
WaitUntil = WaitUntilState.NetworkIdle,
Timeout = 20000
});
}
catch (TimeoutException)
{
Console.WriteLine("Page load timeout, continuing anyway...");
}
// Use try-catch for element selection
string content;
try
{
content = await page.Locator(".main-content").TextContentAsync();
}
catch (TimeoutException)
{
Console.WriteLine("Element not found, using fallback selector");
content = await page.Locator("body").TextContentAsync();
}
return content;
}
catch (PlaywrightException ex)
{
Console.WriteLine($"Playwright error: {ex.Message}");
throw;
}
catch (Exception ex)
{
Console.WriteLine($"Unexpected error: {ex.Message}");
throw;
}
}
Scraping Multiple Pages in Parallel
For large-scale scraping, you can use multiple browser contexts or pages in parallel:
public async Task<List<string>> ScrapeMultipleUrls(List<string> urls)
{
using var playwright = await Playwright.CreateAsync();
await using var browser = await playwright.Chromium.LaunchAsync();
var tasks = urls.Select(async url =>
{
// Each page runs independently
var page = await browser.NewPageAsync();
try
{
await page.GotoAsync(url);
var title = await page.TitleAsync();
await page.CloseAsync();
return title;
}
catch (Exception ex)
{
Console.WriteLine($"Error scraping {url}: {ex.Message}");
await page.CloseAsync();
return null;
}
});
var results = await Task.WhenAll(tasks);
return results.Where(r => r != null).ToList();
}
This pattern is useful when you need to scrape multiple pages efficiently, allowing you to maximize throughput while managing resources effectively.
Best Practices for Playwright Web Scraping
- Respect robots.txt: Always check if scraping is allowed
- Implement rate limiting: Don't overwhelm target servers
- Use appropriate timeouts: Balance between reliability and speed
- Handle errors gracefully: Implement retry logic for transient failures
- Run headless in production: Use
Headless = true
to save resources - Dispose resources properly: Use
using
statements for proper cleanup - Rotate user agents: Avoid detection by varying browser signatures
- Use stealth techniques: Disable automation flags when necessary
// Example with best practices
var context = await browser.NewContextAsync(new BrowserNewContextOptions
{
UserAgent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36",
Viewport = new ViewportSize { Width = 1920, Height = 1080 },
Locale = "en-US",
TimezoneId = "America/New_York"
});
When to Use Playwright vs. HTTP Libraries
Use Playwright when: - The website heavily uses JavaScript to render content - You need to interact with the page (click buttons, fill forms) - You need to handle authentication flows - The site uses anti-scraping measures that detect headless browsers
Use HTTP libraries (like HttpClient) when: - The content is server-rendered HTML - You only need to make API calls - Performance and resource usage are critical concerns - The scraping task is simple and doesn't require browser features
Conclusion
Playwright provides a robust, modern solution for web scraping in C# with excellent support for JavaScript-heavy websites and complex user interactions. Its async/await support integrates seamlessly with C# best practices, making it an excellent choice for developers building scalable scraping solutions. Whether you're scraping dynamic content, handling authentication, or building a large-scale data extraction pipeline, Playwright offers the tools and flexibility you need.