What is the Claude API documentation for web scraping?

The Claude API documentation for web scraping is Anthropic's official developer resource that provides comprehensive guidance on using Claude's AI capabilities for extracting, parsing, and structuring data from web content. The documentation is available at https://docs.anthropic.com/ and includes detailed information about API endpoints, authentication, request/response formats, and best practices for implementing AI-powered web scraping workflows.

Unlike traditional web scraping tools that rely on rigid XPath or CSS selectors, Claude API enables intelligent data extraction using natural language prompts. This approach allows developers to describe what data they need rather than how to extract it, making scraping more resilient to website changes.

Understanding the Claude API Structure

The Claude API documentation is organized into several key sections that are particularly relevant for web scraping:

Messages API

The core endpoint for web scraping is the Messages API (/v1/messages), which accepts text input (including HTML content) and returns structured responses based on your instructions.

Basic Request Structure:

import anthropic

client = anthropic.Anthropic(api_key="your-api-key")

message = client.messages.create(
    model="claude-3-5-sonnet-20241022",
    max_tokens=1024,
    messages=[
        {
            "role": "user",
            "content": "Extract product name, price, and rating from this HTML: <html>...</html>"
        }
    ]
)

print(message.content)

JavaScript/Node.js Example:

import Anthropic from '@anthropic-ai/sdk';

const client = new Anthropic({
    apiKey: process.env.ANTHROPIC_API_KEY,
});

const message = await client.messages.create({
    model: 'claude-3-5-sonnet-20241022',
    max_tokens: 1024,
    messages: [
        {
            role: 'user',
            content: 'Extract product details from this HTML: <html>...</html>'
        }
    ]
});

console.log(message.content);

Available Models for Web Scraping

According to the documentation, Claude offers several models optimized for different use cases:

Claude 3.5 Sonnet: Best balance of intelligence and speed for most scraping tasks
Claude 3 Opus: Highest accuracy for complex data extraction
Claude 3 Haiku: Fastest and most cost-effective for simple extraction tasks

Key Features for Web Scraping

1. Structured Output with Tool Use

The Claude API documentation describes "tool use" (function calling) as a powerful feature for getting structured JSON output from unstructured web content. This is essential for web scraping workflows.

import anthropic
import json

client = anthropic.Anthropic(api_key="your-api-key")

# Define extraction schema
tools = [
    {
        "name": "extract_product_data",
        "description": "Extract structured product information from HTML",
        "input_schema": {
            "type": "object",
            "properties": {
                "products": {
                    "type": "array",
                    "items": {
                        "type": "object",
                        "properties": {
                            "name": {"type": "string"},
                            "price": {"type": "number"},
                            "rating": {"type": "number"},
                            "availability": {"type": "string"}
                        },
                        "required": ["name", "price"]
                    }
                }
            },
            "required": ["products"]
        }
    }
]

# Send HTML content with tool definition
response = client.messages.create(
    model="claude-3-5-sonnet-20241022",
    max_tokens=2048,
    tools=tools,
    messages=[
        {
            "role": "user",
            "content": f"Extract all products from this e-commerce page: {html_content}"
        }
    ]
)

# Parse structured output
for content in response.content:
    if content.type == "tool_use":
        extracted_data = content.input
        print(json.dumps(extracted_data, indent=2))

2. Vision Capabilities for Screenshot Analysis

The documentation highlights Claude's vision capabilities, which allow processing screenshots alongside HTML. This is particularly useful when dealing with dynamically rendered content similar to handling AJAX requests using Puppeteer.

import Anthropic from '@anthropic-ai/sdk';
import fs from 'fs';

const client = new Anthropic({
    apiKey: process.env.ANTHROPIC_API_KEY,
});

// Read screenshot as base64
const imageData = fs.readFileSync('screenshot.png').toString('base64');

const message = await client.messages.create({
    model: 'claude-3-5-sonnet-20241022',
    max_tokens: 2048,
    messages: [
        {
            role: 'user',
            content: [
                {
                    type: 'image',
                    source: {
                        type: 'base64',
                        media_type: 'image/png',
                        data: imageData,
                    },
                },
                {
                    type: 'text',
                    text: 'Extract all visible product information from this e-commerce page screenshot.'
                }
            ],
        },
    ],
});

console.log(message.content);

3. Context Windows and Token Limits

The documentation specifies context windows for each model, which is crucial for processing large HTML documents:

Claude 3.5 Sonnet: 200,000 tokens (~150,000 words)
Claude 3 Opus: 200,000 tokens
Claude 3 Haiku: 200,000 tokens

For very large web pages, you may need to chunk the HTML content or extract only the relevant sections before sending to the API.

Authentication and API Keys

According to the official documentation, authentication requires an API key obtained from the Anthropic Console:

# Set environment variable
export ANTHROPIC_API_KEY='your-api-key-here'

# Python: Using environment variable
import os
from anthropic import Anthropic

client = Anthropic(
    api_key=os.environ.get("ANTHROPIC_API_KEY")
)

// JavaScript: Using environment variable
import Anthropic from '@anthropic-ai/sdk';

const client = new Anthropic({
    apiKey: process.env.ANTHROPIC_API_KEY
});

Rate Limits and Pricing

The Claude API documentation outlines tier-based rate limits:

Tier 1 (Free): 5 requests per minute
Tier 2: 50 requests per minute
Tier 3: 1,000 requests per minute
Tier 4: 4,000 requests per minute

For web scraping at scale, you'll need to implement rate limiting and potentially upgrade your tier. Pricing is based on tokens processed:

Input tokens: $3 per million tokens (Sonnet)
Output tokens: $15 per million tokens (Sonnet)

Best Practices from the Documentation

1. Preprocessing HTML

The documentation recommends cleaning HTML before sending it to the API to reduce token usage:

from bs4 import BeautifulSoup
import anthropic

def clean_html(html_content):
    """Remove scripts, styles, and unnecessary attributes"""
    soup = BeautifulSoup(html_content, 'html.parser')

    # Remove script and style tags
    for tag in soup(['script', 'style', 'noscript']):
        tag.decompose()

    # Remove comments
    for comment in soup.find_all(string=lambda text: isinstance(text, str)):
        if '<!--' in str(comment):
            comment.extract()

    return str(soup)

# Use cleaned HTML
cleaned_html = clean_html(raw_html)
client = anthropic.Anthropic(api_key="your-api-key")

response = client.messages.create(
    model="claude-3-5-sonnet-20241022",
    max_tokens=1024,
    messages=[
        {
            "role": "user",
            "content": f"Extract product data from: {cleaned_html}"
        }
    ]
)

2. Using System Prompts

The documentation describes system prompts as a way to set consistent behavior across multiple scraping requests:

const message = await client.messages.create({
    model: 'claude-3-5-sonnet-20241022',
    max_tokens: 2048,
    system: 'You are a web scraping assistant. Always return data in valid JSON format. Extract only factual information present in the HTML.',
    messages: [
        {
            role: 'user',
            content: `Extract all article titles and publication dates from: ${htmlContent}`
        }
    ]
});

3. Handling Pagination and Multiple Pages

When scraping multiple pages, the documentation suggests maintaining conversation context for consistency:

import anthropic

client = anthropic.Anthropic(api_key="your-api-key")

# Maintain conversation history
conversation = []

for page_num, html_content in enumerate(pages, 1):
    conversation.append({
        "role": "user",
        "content": f"Extract products from page {page_num}: {html_content}"
    })

    response = client.messages.create(
        model="claude-3-5-sonnet-20241022",
        max_tokens=2048,
        messages=conversation
    )

    # Add assistant response to conversation
    conversation.append({
        "role": "assistant",
        "content": response.content
    })

    print(f"Page {page_num} extracted")

Integrating with Traditional Scraping Tools

The Claude API documentation works well when combined with traditional scraping libraries. For example, you can use browser automation tools for navigating to different pages using Puppeteer and then use Claude for intelligent data extraction:

import puppeteer from 'puppeteer';
import Anthropic from '@anthropic-ai/sdk';

const anthropicClient = new Anthropic({
    apiKey: process.env.ANTHROPIC_API_KEY
});

async function scrapeWithClaude(url) {
    // Use Puppeteer to fetch rendered HTML
    const browser = await puppeteer.launch();
    const page = await browser.newPage();
    await page.goto(url, { waitUntil: 'networkidle0' });

    const htmlContent = await page.content();
    await browser.close();

    // Use Claude to extract data
    const message = await anthropicClient.messages.create({
        model: 'claude-3-5-sonnet-20241022',
        max_tokens: 2048,
        messages: [
            {
                role: 'user',
                content: `Extract all product information from this HTML: ${htmlContent}`
            }
        ]
    });

    return message.content;
}

scrapeWithClaude('https://example.com/products').then(console.log);

Error Handling and Retry Logic

The Claude API documentation recommends implementing exponential backoff for rate limit errors:

import time
import anthropic
from anthropic import APIError, RateLimitError

def extract_with_retry(html_content, max_retries=3):
    client = anthropic.Anthropic(api_key="your-api-key")

    for attempt in range(max_retries):
        try:
            response = client.messages.create(
                model="claude-3-5-sonnet-20241022",
                max_tokens=1024,
                messages=[
                    {
                        "role": "user",
                        "content": f"Extract data from: {html_content}"
                    }
                ]
            )
            return response.content

        except RateLimitError:
            if attempt < max_retries - 1:
                wait_time = 2 ** attempt
                print(f"Rate limited. Waiting {wait_time} seconds...")
                time.sleep(wait_time)
            else:
                raise

        except APIError as e:
            print(f"API error: {e}")
            raise

SDK Documentation

The official documentation provides SDKs for multiple languages:

Python: pip install anthropic
JavaScript/TypeScript: npm install @anthropic-ai/sdk
Java: Maven/Gradle available
Go: go get github.com/anthropics/anthropic-sdk-go

Each SDK includes comprehensive examples and type definitions for web scraping use cases.

Additional Resources

Beyond the core API documentation, Anthropic provides:

API Reference: Complete endpoint specifications and parameters
Cookbook: Practical examples including web scraping patterns
Rate Limit Headers: Real-time information about your usage
Streaming Responses: For processing large extractions progressively

Conclusion

The Claude API documentation provides a comprehensive foundation for building intelligent web scraping solutions. By combining Claude's natural language understanding with traditional scraping tools, developers can create robust, maintainable scrapers that adapt to website changes. The official documentation at docs.anthropic.com should be your primary reference, supplemented by the SDK-specific documentation for your chosen programming language.

For production web scraping, consider using Claude's structured output capabilities through tool use, implement proper error handling and rate limiting, and preprocess HTML to optimize token usage and costs. When dealing with complex single-page applications, similar to crawling SPAs using Puppeteer, combining browser automation with Claude's extraction capabilities provides the most robust solution.

Table of contents

What is the Claude API documentation for web scraping?

Understanding the Claude API Structure

Messages API

Available Models for Web Scraping

Key Features for Web Scraping

1. Structured Output with Tool Use

2. Vision Capabilities for Screenshot Analysis

3. Context Windows and Token Limits

Authentication and API Keys

Rate Limits and Pricing

Best Practices from the Documentation

1. Preprocessing HTML

2. Using System Prompts

3. Handling Pagination and Multiple Pages

Integrating with Traditional Scraping Tools

Error Handling and Retry Logic

SDK Documentation

Additional Resources

Conclusion

Try WebScraping.AI for Your Web Scraping Needs

Key Features:

Getting Started:

Related Questions

How does Claude AI compare to traditional web scraping tools?

Can Claude AI scrape dynamic websites?

How do I use Claude Sonnet for web scraping?

Get Started Now

Support