How do I use Claude Sonnet for web scraping?

Claude Sonnet is Anthropic's flagship large language model that offers an exceptional balance of intelligence, speed, and cost-effectiveness for web scraping tasks. Claude 3.5 Sonnet, the latest version, excels at understanding HTML structure, extracting structured data from unstructured content, and adapting to dynamic website layouts without requiring brittle CSS selectors or XPath expressions. This makes it an ideal choice for modern web scraping workflows where websites frequently change their structure.

Understanding Claude Sonnet for Web Scraping

Claude Sonnet sits in the middle of Anthropic's model family, offering better performance than Claude Haiku (the fastest model) while being more cost-effective than Claude Opus (the most powerful model). For web scraping specifically, Claude 3.5 Sonnet provides:

Large Context Window: 200,000 tokens, allowing processing of entire web pages
Intelligent Data Extraction: Semantic understanding of HTML content
JSON Mode: Structured output for easy integration with data pipelines
Vision Capabilities: Ability to analyze screenshots for visual scraping
Fast Response Times: Typically 1-3 seconds for extraction tasks
Cost Efficiency: $3 per million input tokens, $15 per million output tokens

Unlike traditional web scraping that breaks when websites update their HTML structure, Claude Sonnet understands content contextually, making your scrapers more resilient to changes.

Getting Started with Claude Sonnet API

Installation and Setup

First, install the Anthropic SDK for your preferred language:

Python:

pip install anthropic

JavaScript/Node.js:

npm install @anthropic-ai/sdk

Set up your API key:

export ANTHROPIC_API_KEY='your-api-key-here'

Basic Web Scraping with Claude Sonnet

Here's a simple example of using Claude Sonnet to extract product information from an e-commerce page:

Python Example:

import anthropic
import requests

# Initialize the Claude client
client = anthropic.Anthropic(api_key="your-api-key")

# Fetch the web page
url = "https://example.com/products/laptop"
response = requests.get(url)
html_content = response.text

# Use Claude Sonnet to extract structured data
message = client.messages.create(
    model="claude-3-5-sonnet-20241022",
    max_tokens=4096,
    messages=[
        {
            "role": "user",
            "content": f"""Extract product information from this HTML and return as JSON.

HTML:
{html_content}

Extract the following fields:
- product_name (string)
- price (number)
- currency (string)
- in_stock (boolean)
- rating (number, 0-5)
- review_count (number)
- description (string)
- specifications (object)

Return ONLY valid JSON, no additional text."""
        }
    ]
)

# Parse the extracted data
import json
product_data = json.loads(message.content[0].text)
print(json.dumps(product_data, indent=2))

JavaScript Example:

const Anthropic = require('@anthropic-ai/sdk');
const axios = require('axios');

const client = new Anthropic({
  apiKey: process.env.ANTHROPIC_API_KEY,
});

async function scrapeProductData(url) {
  // Fetch the HTML content
  const response = await axios.get(url);
  const html = response.data;

  // Extract data using Claude Sonnet
  const message = await client.messages.create({
    model: 'claude-3-5-sonnet-20241022',
    max_tokens: 4096,
    messages: [
      {
        role: 'user',
        content: `Extract product information from this HTML:

${html}

Return JSON with: name, price, availability, rating, features (array), and images (array of URLs).
Return ONLY valid JSON.`
      }
    ]
  });

  return JSON.parse(message.content[0].text);
}

// Usage
scrapeProductData('https://example.com/product/123')
  .then(data => console.log(data))
  .catch(error => console.error('Scraping error:', error));

Advanced Claude Sonnet Web Scraping Techniques

1. Multi-Page Scraping with Intelligent Navigation

Claude Sonnet can analyze page structure to identify pagination links and navigation elements, making it easier to navigate to different pages programmatically:

Python Example with Pagination:

import anthropic
import requests
from bs4 import BeautifulSoup

def scrape_all_pages(start_url):
    client = anthropic.Anthropic(api_key="your-api-key")
    all_products = []
    current_url = start_url

    while current_url:
        # Fetch page
        response = requests.get(current_url)
        html = response.text

        # Extract products from current page
        message = client.messages.create(
            model="claude-3-5-sonnet-20241022",
            max_tokens=8192,
            messages=[
                {
                    "role": "user",
                    "content": f"""Analyze this e-commerce listing page:

{html}

1. Extract all products as a JSON array with: name, price, url
2. Find the "Next Page" URL if it exists

Return JSON: {{"products": [...], "next_page_url": "url or null"}}"""
                }
            ]
        )

        result = json.loads(message.content[0].text)
        all_products.extend(result['products'])

        # Move to next page
        current_url = result.get('next_page_url')

        if current_url:
            print(f"Moving to next page: {current_url}")

    return all_products

# Scrape all pages
products = scrape_all_pages('https://example.com/products?page=1')
print(f"Total products scraped: {len(products)}")

2. Combining Claude Sonnet with Browser Automation

For JavaScript-heavy websites, combine Claude Sonnet with Puppeteer for powerful scraping capabilities. This is especially useful when handling AJAX requests:

Python Example with Pyppeteer:

import asyncio
from pyppeteer import launch
import anthropic
import json

async def scrape_dynamic_content(url):
    # Launch headless browser
    browser = await launch(headless=True)
    page = await browser.newPage()
    await page.goto(url, {'waitUntil': 'networkidle0'})

    # Wait for dynamic content to load
    await page.waitForSelector('.product-list')

    # Get rendered HTML
    html = await page.content()
    await browser.close()

    # Use Claude Sonnet to extract data
    client = anthropic.Anthropic(api_key="your-api-key")

    message = client.messages.create(
        model="claude-3-5-sonnet-20241022",
        max_tokens=8192,
        messages=[
            {
                "role": "user",
                "content": f"""Extract all product listings from this rendered HTML:

{html}

Return as JSON array with: title, price, image_url, product_url, discount_percentage"""
            }
        ]
    )

    return json.loads(message.content[0].text)

# Run the scraper
products = asyncio.get_event_loop().run_until_complete(
    scrape_dynamic_content('https://example.com/sale')
)
print(json.dumps(products, indent=2))

JavaScript Example with Puppeteer:

const puppeteer = require('puppeteer');
const Anthropic = require('@anthropic-ai/sdk');

const client = new Anthropic({
  apiKey: process.env.ANTHROPIC_API_KEY,
});

async function scrapeDynamicPage(url) {
  // Launch browser
  const browser = await puppeteer.launch({ headless: true });
  const page = await browser.newPage();

  await page.goto(url, { waitUntil: 'networkidle2' });

  // Get rendered HTML after JavaScript execution
  const html = await page.content();
  await browser.close();

  // Use Claude to extract structured data
  const message = await client.messages.create({
    model: 'claude-3-5-sonnet-20241022',
    max_tokens: 8192,
    messages: [
      {
        role: 'user',
        content: `Extract job listings from this HTML:

${html}

Return JSON array with: job_title, company, location, salary_range, posted_date, job_type`
      }
    ]
  });

  return JSON.parse(message.content[0].text);
}

// Usage
scrapeDynamicPage('https://example.com/jobs')
  .then(jobs => console.log(jobs));

3. Table and List Extraction

Claude Sonnet excels at parsing complex tables and nested lists that would require extensive XPath or CSS selector logic:

Python Example - Complex Table Extraction:

import anthropic
import requests

def extract_comparison_table(url):
    html = requests.get(url).text

    client = anthropic.Anthropic(api_key="your-api-key")

    message = client.messages.create(
        model="claude-3-5-sonnet-20241022",
        max_tokens=8192,
        messages=[
            {
                "role": "user",
                "content": f"""Extract the pricing comparison table from this HTML:

{html}

Convert to JSON with this structure:
{{
  "plans": [
    {{
      "name": "plan name",
      "price": {{
        "monthly": number,
        "annual": number,
        "currency": "USD"
      }},
      "features": [
        {{
          "name": "feature name",
          "included": boolean,
          "limit": "string or null"
        }}
      ],
      "highlighted": boolean
    }}
  ]
}}

Return ONLY valid JSON."""
            }
        ]
    )

    return json.loads(message.content[0].text)

# Extract pricing data
pricing = extract_comparison_table('https://example.com/pricing')

4. Handling Authentication and Protected Content

When scraping authenticated content, combine session management with Claude Sonnet for intelligent extraction. This approach works well when handling authentication:

Python Example with Session Authentication:

import anthropic
import requests

def scrape_authenticated_content(login_url, target_url, credentials):
    # Create session and login
    session = requests.Session()

    session.post(login_url, data={
        'username': credentials['username'],
        'password': credentials['password']
    })

    # Fetch protected content
    response = session.get(target_url)
    html = response.text

    # Extract data with Claude
    client = anthropic.Anthropic(api_key="your-api-key")

    message = client.messages.create(
        model="claude-3-5-sonnet-20241022",
        max_tokens=4096,
        messages=[
            {
                "role": "user",
                "content": f"""Extract user account information from this dashboard HTML:

{html}

Return JSON with: account_balance, recent_transactions (array), account_status, subscription_tier"""
            }
        ]
    )

    return json.loads(message.content[0].text)

Optimizing Claude Sonnet for Web Scraping

1. Reduce Token Usage and Costs

Claude Sonnet pricing is based on tokens, so optimizing HTML input is crucial:

HTML Optimization Techniques:

from bs4 import BeautifulSoup
import re

def optimize_html_for_claude(html):
    """
    Remove unnecessary elements to reduce token count
    """
    soup = BeautifulSoup(html, 'html.parser')

    # Remove scripts, styles, and other non-content elements
    for tag in soup(['script', 'style', 'svg', 'noscript', 'iframe']):
        tag.decompose()

    # Remove comments
    for comment in soup.findAll(text=lambda text: isinstance(text, Comment)):
        comment.extract()

    # Remove unnecessary attributes
    for tag in soup.find_all(True):
        # Keep only class and id attributes
        attrs_to_keep = ['class', 'id']
        tag.attrs = {k: v for k, v in tag.attrs.items() if k in attrs_to_keep}

    # Get cleaned HTML
    cleaned_html = str(soup)

    # Remove excessive whitespace
    cleaned_html = re.sub(r'\s+', ' ', cleaned_html)

    return cleaned_html

# Usage
optimized_html = optimize_html_for_claude(raw_html)
# This can reduce token usage by 50-70%

2. Smart Caching Strategy

Implement caching to avoid re-processing identical pages:

import hashlib
import json
import os
from datetime import datetime, timedelta

class ClaudeScrapeCache:
    def __init__(self, cache_dir='./scrape_cache', ttl_hours=24):
        self.cache_dir = cache_dir
        self.ttl = timedelta(hours=ttl_hours)
        os.makedirs(cache_dir, exist_ok=True)

    def _get_cache_key(self, html, prompt):
        content = f"{html}{prompt}"
        return hashlib.sha256(content.encode()).hexdigest()

    def get(self, html, prompt):
        cache_key = self._get_cache_key(html, prompt)
        cache_file = os.path.join(self.cache_dir, f"{cache_key}.json")

        if os.path.exists(cache_file):
            # Check if cache is still valid
            file_time = datetime.fromtimestamp(os.path.getmtime(cache_file))
            if datetime.now() - file_time < self.ttl:
                with open(cache_file, 'r') as f:
                    return json.load(f)

        return None

    def set(self, html, prompt, data):
        cache_key = self._get_cache_key(html, prompt)
        cache_file = os.path.join(self.cache_dir, f"{cache_key}.json")

        with open(cache_file, 'w') as f:
            json.dump(data, f)

# Usage
cache = ClaudeScrapeCache(ttl_hours=12)

def scrape_with_cache(url, prompt):
    html = requests.get(url).text

    # Check cache first
    cached_result = cache.get(html, prompt)
    if cached_result:
        print("Using cached result")
        return cached_result

    # Call Claude if not cached
    client = anthropic.Anthropic(api_key="your-api-key")
    message = client.messages.create(
        model="claude-3-5-sonnet-20241022",
        max_tokens=4096,
        messages=[{"role": "user", "content": f"{prompt}\n\n{html}"}]
    )

    result = json.loads(message.content[0].text)
    cache.set(html, prompt, result)

    return result

3. Batch Processing for Efficiency

Process multiple pages efficiently using concurrent requests:

import asyncio
import anthropic
import aiohttp

async def fetch_html(session, url):
    async with session.get(url) as response:
        return await response.text()

async def extract_with_claude(client, html, prompt):
    message = await client.messages.create(
        model="claude-3-5-sonnet-20241022",
        max_tokens=4096,
        messages=[{"role": "user", "content": f"{prompt}\n\n{html}"}]
    )
    return json.loads(message.content[0].text)

async def scrape_multiple_urls(urls, extraction_prompt):
    client = anthropic.AsyncAnthropic(api_key="your-api-key")

    async with aiohttp.ClientSession() as session:
        # Fetch all HTML concurrently
        html_tasks = [fetch_html(session, url) for url in urls]
        html_contents = await asyncio.gather(*html_tasks)

        # Extract data concurrently
        extract_tasks = [
            extract_with_claude(client, html, extraction_prompt)
            for html in html_contents
        ]
        results = await asyncio.gather(*extract_tasks)

    return results

# Scrape 10 pages concurrently
urls = [f"https://example.com/products?page={i}" for i in range(1, 11)]
results = asyncio.run(scrape_multiple_urls(urls, "Extract all products as JSON array"))

Best Practices for Claude Sonnet Web Scraping

1. Crafting Effective Prompts

The quality of your extraction depends heavily on prompt engineering:

# ❌ Poor prompt
prompt = "Get the data from this page"

# ✅ Excellent prompt
prompt = """Extract product information from this e-commerce page.

Required fields:
- product_name: The main product title (string)
- price: Current price in cents (number)
- original_price: Original price if on sale, null otherwise (number or null)
- availability: "in_stock", "out_of_stock", or "preorder" (string)
- images: Array of product image URLs (array of strings)
- specifications: Object with technical specs (object)

Rules:
- Convert all prices to cents (multiply by 100)
- Extract only high-resolution image URLs
- If a field is not found, use null
- Return ONLY valid JSON, no additional text

Example output:
{
  "product_name": "Example Product",
  "price": 2999,
  "original_price": 3999,
  "availability": "in_stock",
  "images": ["https://example.com/img1.jpg"],
  "specifications": {"color": "blue", "size": "large"}
}"""

2. Error Handling and Validation

Always implement robust error handling when monitoring network requests and processing responses:

def safe_claude_extraction(html, prompt, retries=3):
    client = anthropic.Anthropic(api_key="your-api-key")

    for attempt in range(retries):
        try:
            message = client.messages.create(
                model="claude-3-5-sonnet-20241022",
                max_tokens=4096,
                messages=[{"role": "user", "content": f"{prompt}\n\n{html}"}]
            )

            # Attempt to parse JSON
            result = json.loads(message.content[0].text)

            # Validate required fields
            required_fields = ['name', 'price']
            if all(field in result for field in required_fields):
                return result
            else:
                raise ValueError(f"Missing required fields: {required_fields}")

        except json.JSONDecodeError as e:
            print(f"Attempt {attempt + 1}: Invalid JSON - {e}")
            if attempt == retries - 1:
                raise

        except anthropic.APIError as e:
            print(f"Attempt {attempt + 1}: API Error - {e}")
            if attempt < retries - 1:
                time.sleep(2 ** attempt)  # Exponential backoff
            else:
                raise

    return None

3. Rate Limiting and Respectful Scraping

Implement rate limiting to avoid overloading servers:

import time
from functools import wraps

def rate_limit(calls_per_minute=20):
    min_interval = 60.0 / calls_per_minute
    last_called = [0.0]

    def decorator(func):
        @wraps(func)
        def wrapper(*args, **kwargs):
            elapsed = time.time() - last_called[0]
            left_to_wait = min_interval - elapsed

            if left_to_wait > 0:
                time.sleep(left_to_wait)

            ret = func(*args, **kwargs)
            last_called[0] = time.time()
            return ret
        return wrapper
    return decorator

@rate_limit(calls_per_minute=10)
def scrape_with_rate_limit(url):
    # Your scraping logic here
    pass

Comparing Claude Sonnet to Other Models

When to Use Claude Sonnet vs. Haiku vs. Opus

Claude 3.5 Sonnet: Best for most web scraping tasks. Balanced intelligence and cost.
Claude 3 Haiku: Use for simple, high-volume extraction where speed and cost are priorities.
Claude 3 Opus: Reserve for complex, multi-step reasoning or when maximum accuracy is critical.

Cost Comparison Example:

# Approximate costs for processing a 50KB HTML page (≈12,500 tokens)
# and generating 1KB structured output (≈250 tokens)

models_cost = {
    "claude-3-haiku-20240307": {
        "input": 12500 * 0.25 / 1_000_000,   # $0.003125
        "output": 250 * 1.25 / 1_000_000,    # $0.0003125
        "total": "$0.0034"
    },
    "claude-3-5-sonnet-20241022": {
        "input": 12500 * 3 / 1_000_000,      # $0.0375
        "output": 250 * 15 / 1_000_000,      # $0.00375
        "total": "$0.041"
    },
    "claude-3-opus-20240229": {
        "input": 12500 * 15 / 1_000_000,     # $0.1875
        "output": 250 * 75 / 1_000_000,      # $0.01875
        "total": "$0.206"
    }
}

# For 1,000 pages:
# Haiku: ~$3.40
# Sonnet: ~$41
# Opus: ~$206

Real-World Use Cases

E-commerce Price Monitoring

def monitor_competitor_prices(product_urls):
    client = anthropic.Anthropic(api_key="your-api-key")
    price_data = []

    for url in product_urls:
        html = requests.get(url).text

        message = client.messages.create(
            model="claude-3-5-sonnet-20241022",
            max_tokens=1024,
            messages=[{
                "role": "user",
                "content": f"""Extract pricing info from this product page:

{html}

Return JSON: {{"product_name": "...", "current_price": number, "currency": "...", "in_stock": boolean}}"""
            }]
        )

        data = json.loads(message.content[0].text)
        data['url'] = url
        data['timestamp'] = datetime.now().isoformat()
        price_data.append(data)

    return price_data

News Article Extraction

def extract_article_content(article_url):
    html = requests.get(article_url).text

    client = anthropic.Anthropic(api_key="your-api-key")

    message = client.messages.create(
        model="claude-3-5-sonnet-20241022",
        max_tokens=8192,
        messages=[{
            "role": "user",
            "content": f"""Extract article content from this news page:

{html}

Return JSON with:
- headline (string)
- author (string or null)
- published_date (ISO 8601 string)
- article_text (string, full article body)
- tags (array of strings)
- image_url (string or null)"""
        }]
    )

    return json.loads(message.content[0].text)

Conclusion

Claude 3.5 Sonnet provides a powerful, intelligent approach to web scraping that significantly reduces maintenance overhead compared to traditional selector-based methods. Its ability to understand context, adapt to layout changes, and extract structured data from complex HTML makes it ideal for modern web scraping workflows.

While Claude Sonnet adds API costs to your scraping operations, the benefits often outweigh the expenses:

Reduced Development Time: Write extraction logic in minutes, not hours
Lower Maintenance: No breaking when websites update their HTML
Better Accuracy: Semantic understanding reduces extraction errors
Flexibility: Handles edge cases and variations automatically

For the best results, combine Claude Sonnet with traditional scraping tools: use browser automation for JavaScript-heavy sites, implement caching to control costs, and optimize your HTML input to reduce token usage. This hybrid approach gives you the reliability of conventional scraping with the intelligence and adaptability of AI-powered extraction.

Whether you're building a price monitoring system, aggregating news content, or extracting product data at scale, Claude Sonnet offers a modern solution that adapts to the ever-changing landscape of web scraping.

Table of contents