Table of contents

How do I use Claude AI with web scraping tools like Selenium or Puppeteer?

Combining Claude AI with browser automation tools like Selenium and Puppeteer creates a powerful hybrid approach to web scraping. This integration allows you to leverage Puppeteer or Selenium for browser control and dynamic content rendering, while using Claude AI for intelligent data extraction and interpretation of complex HTML structures.

Why Combine Claude AI with Browser Automation?

Traditional web scraping tools excel at browser automation but struggle with:

  • Complex or inconsistent HTML structures that change frequently
  • Unstructured data that requires contextual understanding
  • Dynamic content where selectors are unreliable
  • Natural language processing of extracted content

Claude AI complements these tools by providing:

  • Intelligent parsing of HTML without brittle selectors
  • Context-aware extraction that understands page semantics
  • Flexible data interpretation that adapts to layout changes
  • Natural language understanding for content analysis

Using Claude AI with Puppeteer

Puppeteer is a Node.js library that provides high-level API to control Chrome or Chromium browsers. Here's how to integrate it with Claude AI:

Basic Puppeteer + Claude Integration

const puppeteer = require('puppeteer');
const Anthropic = require('@anthropic-ai/sdk');

const anthropic = new Anthropic({
  apiKey: process.env.ANTHROPIC_API_KEY,
});

async function scrapeWithClaudeAndPuppeteer(url) {
  // Launch browser and navigate
  const browser = await puppeteer.launch({ headless: true });
  const page = await browser.newPage();

  await page.goto(url, { waitUntil: 'networkidle2' });

  // Wait for dynamic content to load
  await page.waitForSelector('body');

  // Extract the full HTML content
  const htmlContent = await page.content();

  // Close browser
  await browser.close();

  // Send HTML to Claude for intelligent extraction
  const message = await anthropic.messages.create({
    model: 'claude-3-5-sonnet-20241022',
    max_tokens: 4096,
    messages: [{
      role: 'user',
      content: `Extract product information from this HTML page. Return the data as JSON with fields: title, price, description, availability.

HTML:
${htmlContent}`
    }]
  });

  return JSON.parse(message.content[0].text);
}

// Usage
scrapeWithClaudeAndPuppeteer('https://example.com/product')
  .then(data => console.log(data));

Advanced Example: Handling Pagination

When dealing with paginated content, navigate to different pages using Puppeteer and use Claude to extract data from each page:

async function scrapeMultiplePages(baseUrl, maxPages = 5) {
  const browser = await puppeteer.launch({ headless: true });
  const page = await browser.newPage();
  const allProducts = [];

  for (let pageNum = 1; pageNum <= maxPages; pageNum++) {
    const url = `${baseUrl}?page=${pageNum}`;
    await page.goto(url, { waitUntil: 'networkidle2' });

    // Wait for content to load
    await page.waitForSelector('.product-list', { timeout: 5000 });

    const htmlContent = await page.content();

    // Use Claude to extract structured data
    const message = await anthropic.messages.create({
      model: 'claude-3-5-sonnet-20241022',
      max_tokens: 4096,
      messages: [{
        role: 'user',
        content: `Extract all products from this page. Return as JSON array with fields: name, price, rating, url.

HTML:
${htmlContent}`
      }]
    });

    const pageProducts = JSON.parse(message.content[0].text);
    allProducts.push(...pageProducts);
  }

  await browser.close();
  return allProducts;
}

Handling Dynamic Content with waitFor

For pages with AJAX-loaded content, you can handle AJAX requests using Puppeteer before sending to Claude:

async function scrapeAjaxContent(url) {
  const browser = await puppeteer.launch({ headless: true });
  const page = await browser.newPage();

  await page.goto(url);

  // Wait for AJAX content to load
  await page.waitForSelector('.ajax-content', { timeout: 10000 });

  // Additional wait for animations
  await page.waitForTimeout(2000);

  const htmlContent = await page.content();
  await browser.close();

  // Claude extracts the data
  const message = await anthropic.messages.create({
    model: 'claude-3-5-sonnet-20241022',
    max_tokens: 2048,
    messages: [{
      role: 'user',
      content: `Extract the main article content including title, author, date, and body text. Return as JSON.

${htmlContent}`
    }]
  });

  return JSON.parse(message.content[0].text);
}

Using Claude AI with Selenium

Selenium is a popular browser automation framework available in multiple languages. Here's how to use it with Claude in Python:

Basic Selenium + Claude Integration (Python)

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from anthropic import Anthropic
import json
import os

client = Anthropic(api_key=os.environ.get("ANTHROPIC_API_KEY"))

def scrape_with_selenium_and_claude(url):
    # Initialize Selenium WebDriver
    options = webdriver.ChromeOptions()
    options.add_argument('--headless')
    driver = webdriver.Chrome(options=options)

    try:
        # Navigate to URL
        driver.get(url)

        # Wait for page to load
        WebDriverWait(driver, 10).until(
            EC.presence_of_element_located((By.TAG_NAME, "body"))
        )

        # Get page source
        html_content = driver.page_source

        # Send to Claude for extraction
        message = client.messages.create(
            model="claude-3-5-sonnet-20241022",
            max_tokens=4096,
            messages=[{
                "role": "user",
                "content": f"""Extract product details from this HTML.
                Return JSON with: title, price, description, images (array), specifications (object).

                HTML:
                {html_content}"""
            }]
        )

        return json.loads(message.content[0].text)

    finally:
        driver.quit()

# Usage
product_data = scrape_with_selenium_and_claude('https://example.com/product')
print(json.dumps(product_data, indent=2))

Handling Authentication with Selenium + Claude

def scrape_authenticated_content(url, username, password):
    driver = webdriver.Chrome()

    try:
        # Navigate to login page
        driver.get('https://example.com/login')

        # Fill in credentials
        driver.find_element(By.ID, 'username').send_keys(username)
        driver.find_element(By.ID, 'password').send_keys(password)
        driver.find_element(By.ID, 'login-button').click()

        # Wait for login to complete
        WebDriverWait(driver, 10).until(
            EC.url_changes('https://example.com/login')
        )

        # Navigate to protected page
        driver.get(url)
        WebDriverWait(driver, 10).until(
            EC.presence_of_element_located((By.CLASS_NAME, "content"))
        )

        html_content = driver.page_source

        # Use Claude to extract data
        message = client.messages.create(
            model="claude-3-5-sonnet-20241022",
            max_tokens=2048,
            messages=[{
                "role": "user",
                "content": f"Extract user dashboard data including account balance, recent transactions, and notifications. Return as JSON.\n\n{html_content}"
            }]
        )

        return json.loads(message.content[0].text)

    finally:
        driver.quit()

Extracting Data from Specific Elements

Instead of sending the entire page HTML, you can use Selenium to isolate specific sections before sending to Claude:

def scrape_specific_sections(url):
    driver = webdriver.Chrome(options=webdriver.ChromeOptions().add_argument('--headless'))

    try:
        driver.get(url)

        # Wait for specific element
        WebDriverWait(driver, 10).until(
            EC.presence_of_element_located((By.CLASS_NAME, "product-details"))
        )

        # Extract only relevant section
        product_section = driver.find_element(By.CLASS_NAME, "product-details")
        section_html = product_section.get_attribute('innerHTML')

        # Send focused HTML to Claude
        message = client.messages.create(
            model="claude-3-5-sonnet-20241022",
            max_tokens=1024,
            messages=[{
                "role": "user",
                "content": f"Extract product name, SKU, price, and stock status from this HTML:\n\n{section_html}"
            }]
        )

        return json.loads(message.content[0].text)

    finally:
        driver.quit()

Best Practices

1. Minimize HTML Size

Claude has token limits, so extract only necessary content:

// Instead of full page, extract specific sections
const relevantContent = await page.evaluate(() => {
  const main = document.querySelector('main');
  return main ? main.innerHTML : document.body.innerHTML;
});

2. Handle Errors Gracefully

Implement proper error handling for both browser automation and API calls:

from selenium.common.exceptions import TimeoutException, NoSuchElementException

def safe_scrape(url):
    driver = webdriver.Chrome()

    try:
        driver.get(url)
        WebDriverWait(driver, 10).until(
            EC.presence_of_element_located((By.TAG_NAME, "body"))
        )

        html = driver.page_source

        try:
            message = client.messages.create(
                model="claude-3-5-sonnet-20241022",
                max_tokens=2048,
                messages=[{"role": "user", "content": f"Extract data:\n{html}"}]
            )
            return json.loads(message.content[0].text)
        except Exception as e:
            print(f"Claude API error: {e}")
            return None

    except TimeoutException:
        print("Page load timeout")
        return None
    finally:
        driver.quit()

3. Use Structured Output

Guide Claude to return consistent JSON structures:

const prompt = `Extract data and return ONLY valid JSON in this exact format:
{
  "title": "string",
  "price": "number",
  "inStock": "boolean",
  "attributes": ["array", "of", "strings"]
}

HTML:
${htmlContent}`;

4. Implement Rate Limiting

Respect both website rate limits and Claude API rate limits:

import time

def scrape_multiple_urls(urls, delay=2):
    results = []

    for url in urls:
        data = scrape_with_selenium_and_claude(url)
        results.append(data)

        # Delay between requests
        time.sleep(delay)

    return results

Performance Considerations

  • Use headless mode for faster execution
  • Cache browser instances when scraping multiple pages
  • Extract minimal HTML to reduce token usage
  • Batch similar requests to optimize API calls
  • Use Claude's function calling for more reliable structured output

When to Use This Approach

The Selenium/Puppeteer + Claude combination works best for:

  • JavaScript-heavy sites requiring browser rendering
  • Complex layouts where CSS selectors are unreliable
  • Sites with frequent design changes
  • Data requiring interpretation beyond simple extraction
  • Multi-step workflows involving browser sessions

Conclusion

Integrating Claude AI with browser automation tools like Selenium and Puppeteer combines the best of both worlds: reliable browser control with intelligent, flexible data extraction. This approach is particularly valuable for scraping modern web applications where traditional parsing methods fall short.

The key is to use browser automation for navigation, interaction, and rendering, then leverage Claude's understanding of HTML structure and content for extraction—creating a robust, maintainable web scraping solution.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon