How can I extract text content from specific elements using Puppeteer-Sharp?

Extracting text content from specific HTML elements is one of the most common tasks in web scraping with Puppeteer-Sharp. This comprehensive guide will show you various methods to target and extract text from elements using CSS selectors, XPath expressions, and different text extraction techniques.

Understanding Text Extraction Methods

Puppeteer-Sharp provides several methods for extracting text content from elements:

GetPropertyAsync("textContent") - Gets all text including hidden elements
GetPropertyAsync("innerText") - Gets visible text only, respecting CSS styling
GetPropertyAsync("innerHTML") - Gets HTML content including tags
EvaluateFunctionAsync() - Executes custom JavaScript for complex extractions

Basic Text Extraction with CSS Selectors

Single Element Text Extraction

using PuppeteerSharp;

class Program
{
    static async Task Main(string[] args)
    {
        // Launch browser
        var browser = await Puppeteer.LaunchAsync(new LaunchOptions
        {
            Headless = true,
            Args = new[] { "--no-sandbox", "--disable-setuid-sandbox" }
        });

        var page = await browser.NewPageAsync();
        await page.GoToAsync("https://example.com");

        // Extract text from a single element
        var titleElement = await page.QuerySelectorAsync("h1.main-title");
        if (titleElement != null)
        {
            var titleText = await titleElement.GetPropertyAsync("textContent");
            var title = await titleText.JsonValueAsync<string>();
            Console.WriteLine($"Title: {title}");
        }

        await browser.CloseAsync();
    }
}

Multiple Elements Text Extraction

// Extract text from multiple elements
var articleElements = await page.QuerySelectorAllAsync("article .content p");
var paragraphTexts = new List<string>();

foreach (var element in articleElements)
{
    var textProperty = await element.GetPropertyAsync("textContent");
    var text = await textProperty.JsonValueAsync<string>();
    paragraphTexts.Add(text?.Trim());
}

Console.WriteLine($"Found {paragraphTexts.Count} paragraphs");
paragraphTexts.ForEach(text => Console.WriteLine($"- {text}"));

Advanced Text Extraction Techniques

Using XPath Selectors

// Extract text using XPath expressions
var xpathElements = await page.XPathAsync("//div[@class='product-info']//span[contains(@class, 'price')]");
foreach (var element in xpathElements)
{
    var priceText = await element.GetPropertyAsync("textContent");
    var price = await priceText.JsonValueAsync<string>();
    Console.WriteLine($"Price: {price}");
}

Extracting Specific Attributes and Text

// Extract both text content and attributes
var linkElements = await page.QuerySelectorAllAsync("a.product-link");
var productLinks = new List<(string text, string href)>();

foreach (var link in linkElements)
{
    var textProperty = await link.GetPropertyAsync("textContent");
    var hrefProperty = await link.GetPropertyAsync("href");

    var text = await textProperty.JsonValueAsync<string>();
    var href = await hrefProperty.JsonValueAsync<string>();

    productLinks.Add((text?.Trim(), href));
}

productLinks.ForEach(link => Console.WriteLine($"Link: {link.text} -> {link.href}"));

Handling Dynamic Content

Waiting for Elements Before Extraction

// Wait for dynamic content to load before extracting text
await page.WaitForSelectorAsync(".dynamic-content", new WaitForSelectorOptions
{
    Timeout = 10000 // 10 seconds timeout
});

var dynamicElement = await page.QuerySelectorAsync(".dynamic-content");
var dynamicText = await dynamicElement.GetPropertyAsync("innerText");
var content = await dynamicText.JsonValueAsync<string>();
Console.WriteLine($"Dynamic content: {content}");

Similar to how you might handle AJAX requests using Puppeteer, waiting for dynamic content ensures you capture text that loads asynchronously.

Extracting Text from Loaded Content

// Wait for network activity to complete before extraction
await page.WaitForLoadStateAsync(LoadState.NetworkIdle);

// Extract text from elements that may have been populated by JavaScript
var resultsElements = await page.QuerySelectorAllAsync(".search-result .result-title");
var results = new List<string>();

foreach (var result in resultsElements)
{
    var textContent = await result.EvaluateFunctionAsync<string>("el => el.textContent.trim()");
    if (!string.IsNullOrEmpty(textContent))
    {
        results.Add(textContent);
    }
}

Complex Text Extraction with JavaScript Evaluation

Custom Text Processing

// Use EvaluateFunctionAsync for complex text extraction logic
var extractedData = await page.EvaluateFunctionAsync<List<Dictionary<string, string>>>(
    @"() => {
        const items = [];
        const elements = document.querySelectorAll('.product-card');

        elements.forEach(el => {
            const title = el.querySelector('.product-title')?.textContent?.trim();
            const price = el.querySelector('.product-price')?.textContent?.trim();
            const description = el.querySelector('.product-desc')?.textContent?.trim();

            if (title && price) {
                items.push({
                    title: title,
                    price: price,
                    description: description || 'No description'
                });
            }
        });

        return items;
    }"
);

foreach (var item in extractedData)
{
    Console.WriteLine($"Product: {item["title"]} - {item["price"]}");
    Console.WriteLine($"Description: {item["description"]}\n");
}

Extracting Formatted Text

// Extract text while preserving some formatting
var formattedText = await page.EvaluateFunctionAsync<string>(
    @"(selector) => {
        const element = document.querySelector(selector);
        if (!element) return null;

        // Replace <br> tags with newlines and strip other HTML
        let text = element.innerHTML
            .replace(/<br\s*\/?>/gi, '\n')
            .replace(/<[^>]*>/g, '')
            .replace(/\s+/g, ' ')
            .trim();

        return text;
    }",
    ".article-content"
);

Console.WriteLine($"Formatted content:\n{formattedText}");

Error Handling and Best Practices

Robust Text Extraction

public static async Task<string> SafeExtractText(IPage page, string selector, string fallback = "")
{
    try
    {
        var element = await page.QuerySelectorAsync(selector);
        if (element == null)
        {
            Console.WriteLine($"Element not found: {selector}");
            return fallback;
        }

        var textProperty = await element.GetPropertyAsync("textContent");
        var text = await textProperty.JsonValueAsync<string>();

        return text?.Trim() ?? fallback;
    }
    catch (Exception ex)
    {
        Console.WriteLine($"Error extracting text from {selector}: {ex.Message}");
        return fallback;
    }
}

// Usage
var title = await SafeExtractText(page, "h1.title", "No title found");
var description = await SafeExtractText(page, ".description", "No description available");

Batch Text Extraction

public static async Task<Dictionary<string, string>> ExtractMultipleTexts(IPage page, Dictionary<string, string> selectors)
{
    var results = new Dictionary<string, string>();

    foreach (var kvp in selectors)
    {
        var key = kvp.Key;
        var selector = kvp.Value;

        try
        {
            var text = await page.EvaluateFunctionAsync<string>(
                @"(sel) => {
                    const el = document.querySelector(sel);
                    return el ? el.textContent.trim() : null;
                }",
                selector
            );

            results[key] = text ?? "Not found";
        }
        catch (Exception ex)
        {
            Console.WriteLine($"Error extracting {key}: {ex.Message}");
            results[key] = "Error occurred";
        }
    }

    return results;
}

// Usage
var selectorsToExtract = new Dictionary<string, string>
{
    {"title", "h1.main-title"},
    {"price", ".price-current"},
    {"availability", ".stock-status"},
    {"rating", ".rating-value"}
};

var extractedTexts = await ExtractMultipleTexts(page, selectorsToExtract);
foreach (var result in extractedTexts)
{
    Console.WriteLine($"{result.Key}: {result.Value}");
}

Working with Tables and Structured Data

Extracting Table Data

// Extract data from HTML tables
var tableData = await page.EvaluateFunctionAsync<List<List<string>>>(
    @"(tableSelector) => {
        const table = document.querySelector(tableSelector);
        if (!table) return [];

        const rows = Array.from(table.querySelectorAll('tr'));
        return rows.map(row => {
            const cells = Array.from(row.querySelectorAll('td, th'));
            return cells.map(cell => cell.textContent.trim());
        });
    }",
    "table.data-table"
);

// Process table data
for (int i = 0; i < tableData.Count; i++)
{
    var row = tableData[i];
    Console.WriteLine($"Row {i + 1}: {string.Join(" | ", row)}");
}

Performance Optimization

Efficient Text Extraction

// Batch extract multiple text elements in a single JavaScript execution
var batchResults = await page.EvaluateFunctionAsync<Dictionary<string, object>>(
    @"() => {
        const results = {};

        // Extract multiple types of elements at once
        results.headings = Array.from(document.querySelectorAll('h1, h2, h3'))
            .map(el => el.textContent.trim());

        results.paragraphs = Array.from(document.querySelectorAll('p'))
            .map(el => el.textContent.trim())
            .filter(text => text.length > 0);

        results.links = Array.from(document.querySelectorAll('a[href]'))
            .map(el => ({
                text: el.textContent.trim(),
                href: el.href
            }));

        return results;
    }"
);

var headings = batchResults["headings"] as JArray;
var paragraphs = batchResults["paragraphs"] as JArray;
Console.WriteLine($"Found {headings?.Count} headings");
Console.WriteLine($"Found {paragraphs?.Count} paragraphs");

Integration with Navigation

When extracting text from multiple pages, you can combine text extraction with navigation techniques using Puppeteer. This allows you to scrape text content across entire websites systematically.

// Navigate through pages and extract text
var urls = new[] { "page1.html", "page2.html", "page3.html" };
var allExtractedTexts = new List<Dictionary<string, string>>();

foreach (var url in urls)
{
    await page.GoToAsync($"https://example.com/{url}");
    await page.WaitForLoadStateAsync(LoadState.NetworkIdle);

    var pageTexts = await ExtractMultipleTexts(page, selectorsToExtract);
    pageTexts["source_url"] = url;
    allExtractedTexts.Add(pageTexts);
}

Handling Special Text Scenarios

Extracting Text from Shadow DOM

// Extract text from elements within Shadow DOM
var shadowText = await page.EvaluateFunctionAsync<string>(
    @"() => {
        const hostElement = document.querySelector('#shadow-host');
        if (!hostElement || !hostElement.shadowRoot) return null;

        const shadowElement = hostElement.shadowRoot.querySelector('.shadow-content');
        return shadowElement ? shadowElement.textContent.trim() : null;
    }"
);

if (!string.IsNullOrEmpty(shadowText))
{
    Console.WriteLine($"Shadow DOM text: {shadowText}");
}

Text Extraction with Custom Filters

// Extract and filter text based on specific criteria
var filteredTexts = await page.EvaluateFunctionAsync<List<string>>(
    @"(minLength) => {
        const elements = document.querySelectorAll('p, div, span');
        const texts = [];

        elements.forEach(el => {
            const text = el.textContent.trim();
            if (text.length >= minLength && !text.match(/^\d+$/)) {
                texts.push(text);
            }
        });

        return texts;
    }",
    50 // minimum text length
);

filteredTexts.ForEach(text => Console.WriteLine($"Filtered text: {text}"));

Working with Wait Conditions

For pages with dynamic content, you might need to use wait conditions similar to using the 'waitFor' function in Puppeteer:

// Wait for specific text to appear before extraction
await page.WaitForFunctionAsync(
    @"() => {
        const element = document.querySelector('.loading-content');
        return element && element.textContent.includes('Data loaded');
    }",
    new WaitForFunctionOptions { Timeout = 30000 }
);

// Now extract the loaded content
var loadedContent = await SafeExtractText(page, ".content-container", "Content not found");
Console.WriteLine($"Loaded content: {loadedContent}");

Conclusion

Puppeteer-Sharp provides powerful and flexible methods for extracting text content from web pages. Whether you need simple text extraction from individual elements or complex batch processing of structured data, the techniques shown in this guide will help you build robust web scraping solutions. Remember to handle errors gracefully, wait for dynamic content to load, and optimize your extraction logic for better performance.

The key to successful text extraction is choosing the right method for your specific use case: use textContent for all text including hidden elements, innerText for visible text only, and custom JavaScript evaluation for complex extraction logic.

Table of contents