What are the token limits for Claude API in web scraping?
Understanding token limits is crucial when using the Claude API for web scraping tasks. Claude's token limits determine how much text you can process in a single API call, which directly impacts your ability to extract data from web pages efficiently.
Claude API Token Limits by Model
Different Claude models have varying token limits, also known as context windows. Here's a breakdown of the current limits:
Claude 3.5 Sonnet (claude-3-5-sonnet-20241022)
- Context Window: 200,000 tokens
- Maximum Output: 8,192 tokens
- Best For: Complex web scraping tasks requiring deep analysis of large pages
Claude 3 Opus (claude-3-opus-20240229)
- Context Window: 200,000 tokens
- Maximum Output: 4,096 tokens
- Best For: High-accuracy extraction from extensive HTML documents
Claude 3 Sonnet (claude-3-sonnet-20240229)
- Context Window: 200,000 tokens
- Maximum Output: 4,096 tokens
- Best For: Balanced performance for medium-sized web pages
Claude 3 Haiku (claude-3-haiku-20240307)
- Context Window: 200,000 tokens
- Maximum Output: 4,096 tokens
- Best For: Fast, cost-effective scraping of simpler pages
Understanding Tokens
A token is approximately 3-4 characters in English text. For web scraping:
- 1 token ≈ 0.75 words
- 100 tokens ≈ 75 words
- 1,000 tokens ≈ 750 words
- 10,000 tokens ≈ 7,500 words
HTML markup significantly increases token count compared to plain text, as tags, attributes, and whitespace all consume tokens.
Token Consumption in Web Scraping
When scraping with Claude, tokens are consumed by:
- System Instructions: Your prompts and instructions (typically 100-500 tokens)
- HTML Content: The web page content you're analyzing (varies widely)
- Examples: Few-shot examples you provide (if any)
- Response: Claude's extracted data output
Example Token Calculation
import anthropic
import requests
def estimate_tokens(text):
"""Rough estimation: 1 token ≈ 4 characters"""
return len(text) // 4
# Fetch a web page
url = "https://example.com/products"
response = requests.get(url)
html_content = response.text
# Estimate tokens
prompt = "Extract all product names and prices from this HTML"
total_input_tokens = estimate_tokens(prompt + html_content)
print(f"Estimated input tokens: {total_input_tokens:,}")
print(f"Remaining capacity: {200000 - total_input_tokens:,} tokens")
Optimizing Token Usage for Web Scraping
1. HTML Preprocessing
Remove unnecessary content before sending to Claude:
from bs4 import BeautifulSoup
import anthropic
def clean_html_for_claude(html_content):
"""Remove scripts, styles, and unnecessary attributes"""
soup = BeautifulSoup(html_content, 'html.parser')
# Remove script and style elements
for script in soup(["script", "style", "svg", "noscript"]):
script.decompose()
# Remove comments
for comment in soup.find_all(text=lambda text: isinstance(text, Comment)):
comment.extract()
# Get text or simplified HTML
return str(soup)
# Use with Claude API
client = anthropic.Anthropic(api_key="your-api-key")
html = requests.get("https://example.com/article").text
cleaned_html = clean_html_for_claude(html)
message = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=4096,
messages=[
{
"role": "user",
"content": f"Extract the article title, author, and publication date:\n\n{cleaned_html}"
}
]
)
print(message.content[0].text)
2. Chunking Large Pages
For pages exceeding token limits, split content into chunks:
const Anthropic = require('@anthropic-ai/sdk');
const axios = require('axios');
const cheerio = require('cheerio');
async function scrapeInChunks(url) {
const client = new Anthropic({
apiKey: process.env.ANTHROPIC_API_KEY,
});
// Fetch and parse HTML
const response = await axios.get(url);
const $ = cheerio.load(response.data);
// Split content into sections
const sections = [];
$('article section').each((i, section) => {
sections.push($(section).html());
});
// Process each section
const results = [];
for (const section of sections) {
const message = await client.messages.create({
model: 'claude-3-haiku-20240307',
max_tokens: 2048,
messages: [{
role: 'user',
content: `Extract key information from this section:\n\n${section}`
}]
});
results.push(message.content[0].text);
}
return results;
}
scrapeInChunks('https://example.com/long-article')
.then(data => console.log(data));
3. Use Selective Extraction
Target specific elements instead of sending entire pages:
from bs4 import BeautifulSoup
import anthropic
def extract_product_sections(html):
"""Extract only product-related sections"""
soup = BeautifulSoup(html, 'html.parser')
# Find product containers
products = soup.find_all('div', class_='product-card')
# Combine into compact HTML
return '\n'.join([str(p) for p in products[:50]]) # Limit to 50 products
client = anthropic.Anthropic(api_key="your-api-key")
html = requests.get("https://example.com/products").text
product_html = extract_product_sections(html)
response = client.messages.create(
model="claude-3-haiku-20240307",
max_tokens=8192,
messages=[{
"role": "user",
"content": f"Extract product names, prices, and ratings as JSON:\n\n{product_html}"
}]
)
print(response.content[0].text)
Handling Token Limit Errors
When you exceed token limits, Claude returns an error. Here's how to handle it:
import anthropic
from anthropic import APIError
def scrape_with_fallback(html_content, prompt):
client = anthropic.Anthropic(api_key="your-api-key")
models = [
"claude-3-5-sonnet-20241022",
"claude-3-haiku-20240307"
]
for model in models:
try:
message = client.messages.create(
model=model,
max_tokens=4096,
messages=[{
"role": "user",
"content": f"{prompt}\n\n{html_content}"
}]
)
return message.content[0].text
except APIError as e:
if "max_tokens" in str(e).lower():
# Try with reduced content
reduced_content = html_content[:len(html_content)//2]
print(f"Reducing content size and retrying...")
html_content = reduced_content
else:
raise
return None
Cost Optimization Strategies
Token usage directly impacts costs. Here are strategies to optimize:
1. Use Markdown Instead of HTML
Converting HTML to Markdown reduces token count by 40-60%:
from markdownify import markdownify as md
html = "<div><h1>Product Title</h1><p>Description here</p></div>"
markdown = md(html)
# Markdown uses fewer tokens than HTML
print(f"HTML length: {len(html)}")
print(f"Markdown length: {len(markdown)}")
2. Cache Common Prompts
Reuse system prompts to save tokens across multiple requests:
def create_scraper_with_cache(client):
"""Create a scraper function with cached instructions"""
system_prompt = """You are a web scraping assistant.
Extract structured data from HTML and return as JSON.
Focus on accuracy and completeness."""
def scrape(html_content, fields):
return client.messages.create(
model="claude-3-haiku-20240307",
max_tokens=2048,
system=system_prompt,
messages=[{
"role": "user",
"content": f"Extract these fields: {fields}\n\nHTML:\n{html_content}"
}]
)
return scrape
3. Batch Similar Requests
Process multiple similar pages in one request:
async function batchScrapeProducts(urls) {
const client = new Anthropic({
apiKey: process.env.ANTHROPIC_API_KEY,
});
// Fetch all pages
const pages = await Promise.all(
urls.map(url => axios.get(url))
);
// Combine into one prompt
const combined = pages.map((page, i) =>
`PAGE ${i + 1}:\n${page.data}`
).join('\n\n---\n\n');
const message = await client.messages.create({
model: 'claude-3-haiku-20240307',
max_tokens: 4096,
messages: [{
role: 'user',
content: `Extract product data from each page:\n\n${combined}`
}]
});
return message.content[0].text;
}
Monitoring Token Usage
Track token consumption to optimize your scraping pipeline:
import anthropic
client = anthropic.Anthropic(api_key="your-api-key")
message = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=1024,
messages=[{
"role": "user",
"content": "Extract data from this HTML: <html>...</html>"
}]
)
# Check token usage
usage = message.usage
print(f"Input tokens: {usage.input_tokens}")
print(f"Output tokens: {usage.output_tokens}")
print(f"Total tokens: {usage.input_tokens + usage.output_tokens}")
# Calculate cost (example rates)
input_cost = usage.input_tokens * 0.003 / 1000 # $0.003 per 1K tokens
output_cost = usage.output_tokens * 0.015 / 1000 # $0.015 per 1K tokens
print(f"Estimated cost: ${input_cost + output_cost:.6f}")
Best Practices for Token Management
- Preprocess HTML: Remove scripts, styles, and unnecessary attributes before sending to Claude
- Use Selective Selectors: Extract only relevant sections using CSS selectors or XPath
- Choose the Right Model: Use Claude Haiku for simple extractions to save tokens and costs
- Implement Chunking: Split large pages into manageable sections
- Monitor Usage: Track token consumption per request to identify optimization opportunities
- Cache Results: Store extracted data to avoid re-processing the same content
- Use Streaming: For large responses, use streaming to get partial results faster
Comparing with Traditional Scraping
While Claude offers powerful AI-based extraction, traditional tools like handling AJAX requests using Puppeteer or using CSS selectors can be more token-efficient for structured data. Consider using Claude when:
- Page structure varies significantly
- You need semantic understanding of content
- Traditional selectors are fragile or complex
- You're extracting data from dynamic single-page applications
Conclusion
Claude API's 200,000 token context window provides ample capacity for most web scraping tasks. By understanding token consumption, preprocessing HTML content, and implementing chunking strategies, you can efficiently extract data from even the largest web pages while managing costs effectively.
Remember that token limits affect both input (your prompts and HTML) and output (Claude's responses). Always monitor usage, optimize your preprocessing pipeline, and choose the appropriate Claude model based on your accuracy and cost requirements. When dealing with complex browser automation scenarios, consider combining Claude with tools for handling browser sessions in Puppeteer to create a robust scraping solution.