What is the Claude API documentation for web scraping?
The Claude API documentation for web scraping is Anthropic's official developer resource that provides comprehensive guidance on using Claude's AI capabilities for extracting, parsing, and structuring data from web content. The documentation is available at https://docs.anthropic.com/
and includes detailed information about API endpoints, authentication, request/response formats, and best practices for implementing AI-powered web scraping workflows.
Unlike traditional web scraping tools that rely on rigid XPath or CSS selectors, Claude API enables intelligent data extraction using natural language prompts. This approach allows developers to describe what data they need rather than how to extract it, making scraping more resilient to website changes.
Understanding the Claude API Structure
The Claude API documentation is organized into several key sections that are particularly relevant for web scraping:
Messages API
The core endpoint for web scraping is the Messages API (/v1/messages
), which accepts text input (including HTML content) and returns structured responses based on your instructions.
Basic Request Structure:
import anthropic
client = anthropic.Anthropic(api_key="your-api-key")
message = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=1024,
messages=[
{
"role": "user",
"content": "Extract product name, price, and rating from this HTML: <html>...</html>"
}
]
)
print(message.content)
JavaScript/Node.js Example:
import Anthropic from '@anthropic-ai/sdk';
const client = new Anthropic({
apiKey: process.env.ANTHROPIC_API_KEY,
});
const message = await client.messages.create({
model: 'claude-3-5-sonnet-20241022',
max_tokens: 1024,
messages: [
{
role: 'user',
content: 'Extract product details from this HTML: <html>...</html>'
}
]
});
console.log(message.content);
Available Models for Web Scraping
According to the documentation, Claude offers several models optimized for different use cases:
- Claude 3.5 Sonnet: Best balance of intelligence and speed for most scraping tasks
- Claude 3 Opus: Highest accuracy for complex data extraction
- Claude 3 Haiku: Fastest and most cost-effective for simple extraction tasks
Key Features for Web Scraping
1. Structured Output with Tool Use
The Claude API documentation describes "tool use" (function calling) as a powerful feature for getting structured JSON output from unstructured web content. This is essential for web scraping workflows.
import anthropic
import json
client = anthropic.Anthropic(api_key="your-api-key")
# Define extraction schema
tools = [
{
"name": "extract_product_data",
"description": "Extract structured product information from HTML",
"input_schema": {
"type": "object",
"properties": {
"products": {
"type": "array",
"items": {
"type": "object",
"properties": {
"name": {"type": "string"},
"price": {"type": "number"},
"rating": {"type": "number"},
"availability": {"type": "string"}
},
"required": ["name", "price"]
}
}
},
"required": ["products"]
}
}
]
# Send HTML content with tool definition
response = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=2048,
tools=tools,
messages=[
{
"role": "user",
"content": f"Extract all products from this e-commerce page: {html_content}"
}
]
)
# Parse structured output
for content in response.content:
if content.type == "tool_use":
extracted_data = content.input
print(json.dumps(extracted_data, indent=2))
2. Vision Capabilities for Screenshot Analysis
The documentation highlights Claude's vision capabilities, which allow processing screenshots alongside HTML. This is particularly useful when dealing with dynamically rendered content similar to handling AJAX requests using Puppeteer.
import Anthropic from '@anthropic-ai/sdk';
import fs from 'fs';
const client = new Anthropic({
apiKey: process.env.ANTHROPIC_API_KEY,
});
// Read screenshot as base64
const imageData = fs.readFileSync('screenshot.png').toString('base64');
const message = await client.messages.create({
model: 'claude-3-5-sonnet-20241022',
max_tokens: 2048,
messages: [
{
role: 'user',
content: [
{
type: 'image',
source: {
type: 'base64',
media_type: 'image/png',
data: imageData,
},
},
{
type: 'text',
text: 'Extract all visible product information from this e-commerce page screenshot.'
}
],
},
],
});
console.log(message.content);
3. Context Windows and Token Limits
The documentation specifies context windows for each model, which is crucial for processing large HTML documents:
- Claude 3.5 Sonnet: 200,000 tokens (~150,000 words)
- Claude 3 Opus: 200,000 tokens
- Claude 3 Haiku: 200,000 tokens
For very large web pages, you may need to chunk the HTML content or extract only the relevant sections before sending to the API.
Authentication and API Keys
According to the official documentation, authentication requires an API key obtained from the Anthropic Console:
# Set environment variable
export ANTHROPIC_API_KEY='your-api-key-here'
# Python: Using environment variable
import os
from anthropic import Anthropic
client = Anthropic(
api_key=os.environ.get("ANTHROPIC_API_KEY")
)
// JavaScript: Using environment variable
import Anthropic from '@anthropic-ai/sdk';
const client = new Anthropic({
apiKey: process.env.ANTHROPIC_API_KEY
});
Rate Limits and Pricing
The Claude API documentation outlines tier-based rate limits:
- Tier 1 (Free): 5 requests per minute
- Tier 2: 50 requests per minute
- Tier 3: 1,000 requests per minute
- Tier 4: 4,000 requests per minute
For web scraping at scale, you'll need to implement rate limiting and potentially upgrade your tier. Pricing is based on tokens processed:
- Input tokens: $3 per million tokens (Sonnet)
- Output tokens: $15 per million tokens (Sonnet)
Best Practices from the Documentation
1. Preprocessing HTML
The documentation recommends cleaning HTML before sending it to the API to reduce token usage:
from bs4 import BeautifulSoup
import anthropic
def clean_html(html_content):
"""Remove scripts, styles, and unnecessary attributes"""
soup = BeautifulSoup(html_content, 'html.parser')
# Remove script and style tags
for tag in soup(['script', 'style', 'noscript']):
tag.decompose()
# Remove comments
for comment in soup.find_all(string=lambda text: isinstance(text, str)):
if '<!--' in str(comment):
comment.extract()
return str(soup)
# Use cleaned HTML
cleaned_html = clean_html(raw_html)
client = anthropic.Anthropic(api_key="your-api-key")
response = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=1024,
messages=[
{
"role": "user",
"content": f"Extract product data from: {cleaned_html}"
}
]
)
2. Using System Prompts
The documentation describes system prompts as a way to set consistent behavior across multiple scraping requests:
const message = await client.messages.create({
model: 'claude-3-5-sonnet-20241022',
max_tokens: 2048,
system: 'You are a web scraping assistant. Always return data in valid JSON format. Extract only factual information present in the HTML.',
messages: [
{
role: 'user',
content: `Extract all article titles and publication dates from: ${htmlContent}`
}
]
});
3. Handling Pagination and Multiple Pages
When scraping multiple pages, the documentation suggests maintaining conversation context for consistency:
import anthropic
client = anthropic.Anthropic(api_key="your-api-key")
# Maintain conversation history
conversation = []
for page_num, html_content in enumerate(pages, 1):
conversation.append({
"role": "user",
"content": f"Extract products from page {page_num}: {html_content}"
})
response = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=2048,
messages=conversation
)
# Add assistant response to conversation
conversation.append({
"role": "assistant",
"content": response.content
})
print(f"Page {page_num} extracted")
Integrating with Traditional Scraping Tools
The Claude API documentation works well when combined with traditional scraping libraries. For example, you can use browser automation tools for navigating to different pages using Puppeteer and then use Claude for intelligent data extraction:
import puppeteer from 'puppeteer';
import Anthropic from '@anthropic-ai/sdk';
const anthropicClient = new Anthropic({
apiKey: process.env.ANTHROPIC_API_KEY
});
async function scrapeWithClaude(url) {
// Use Puppeteer to fetch rendered HTML
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto(url, { waitUntil: 'networkidle0' });
const htmlContent = await page.content();
await browser.close();
// Use Claude to extract data
const message = await anthropicClient.messages.create({
model: 'claude-3-5-sonnet-20241022',
max_tokens: 2048,
messages: [
{
role: 'user',
content: `Extract all product information from this HTML: ${htmlContent}`
}
]
});
return message.content;
}
scrapeWithClaude('https://example.com/products').then(console.log);
Error Handling and Retry Logic
The Claude API documentation recommends implementing exponential backoff for rate limit errors:
import time
import anthropic
from anthropic import APIError, RateLimitError
def extract_with_retry(html_content, max_retries=3):
client = anthropic.Anthropic(api_key="your-api-key")
for attempt in range(max_retries):
try:
response = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=1024,
messages=[
{
"role": "user",
"content": f"Extract data from: {html_content}"
}
]
)
return response.content
except RateLimitError:
if attempt < max_retries - 1:
wait_time = 2 ** attempt
print(f"Rate limited. Waiting {wait_time} seconds...")
time.sleep(wait_time)
else:
raise
except APIError as e:
print(f"API error: {e}")
raise
SDK Documentation
The official documentation provides SDKs for multiple languages:
- Python:
pip install anthropic
- JavaScript/TypeScript:
npm install @anthropic-ai/sdk
- Java: Maven/Gradle available
- Go:
go get github.com/anthropics/anthropic-sdk-go
Each SDK includes comprehensive examples and type definitions for web scraping use cases.
Additional Resources
Beyond the core API documentation, Anthropic provides:
- API Reference: Complete endpoint specifications and parameters
- Cookbook: Practical examples including web scraping patterns
- Rate Limit Headers: Real-time information about your usage
- Streaming Responses: For processing large extractions progressively
Conclusion
The Claude API documentation provides a comprehensive foundation for building intelligent web scraping solutions. By combining Claude's natural language understanding with traditional scraping tools, developers can create robust, maintainable scrapers that adapt to website changes. The official documentation at docs.anthropic.com
should be your primary reference, supplemented by the SDK-specific documentation for your chosen programming language.
For production web scraping, consider using Claude's structured output capabilities through tool use, implement proper error handling and rate limiting, and preprocess HTML to optimize token usage and costs. When dealing with complex single-page applications, similar to crawling SPAs using Puppeteer, combining browser automation with Claude's extraction capabilities provides the most robust solution.