How do I use Claude Sonnet for web scraping?
Claude Sonnet is Anthropic's flagship large language model that offers an exceptional balance of intelligence, speed, and cost-effectiveness for web scraping tasks. Claude 3.5 Sonnet, the latest version, excels at understanding HTML structure, extracting structured data from unstructured content, and adapting to dynamic website layouts without requiring brittle CSS selectors or XPath expressions. This makes it an ideal choice for modern web scraping workflows where websites frequently change their structure.
Understanding Claude Sonnet for Web Scraping
Claude Sonnet sits in the middle of Anthropic's model family, offering better performance than Claude Haiku (the fastest model) while being more cost-effective than Claude Opus (the most powerful model). For web scraping specifically, Claude 3.5 Sonnet provides:
- Large Context Window: 200,000 tokens, allowing processing of entire web pages
- Intelligent Data Extraction: Semantic understanding of HTML content
- JSON Mode: Structured output for easy integration with data pipelines
- Vision Capabilities: Ability to analyze screenshots for visual scraping
- Fast Response Times: Typically 1-3 seconds for extraction tasks
- Cost Efficiency: $3 per million input tokens, $15 per million output tokens
Unlike traditional web scraping that breaks when websites update their HTML structure, Claude Sonnet understands content contextually, making your scrapers more resilient to changes.
Getting Started with Claude Sonnet API
Installation and Setup
First, install the Anthropic SDK for your preferred language:
Python:
pip install anthropic
JavaScript/Node.js:
npm install @anthropic-ai/sdk
Set up your API key:
export ANTHROPIC_API_KEY='your-api-key-here'
Basic Web Scraping with Claude Sonnet
Here's a simple example of using Claude Sonnet to extract product information from an e-commerce page:
Python Example:
import anthropic
import requests
# Initialize the Claude client
client = anthropic.Anthropic(api_key="your-api-key")
# Fetch the web page
url = "https://example.com/products/laptop"
response = requests.get(url)
html_content = response.text
# Use Claude Sonnet to extract structured data
message = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=4096,
messages=[
{
"role": "user",
"content": f"""Extract product information from this HTML and return as JSON.
HTML:
{html_content}
Extract the following fields:
- product_name (string)
- price (number)
- currency (string)
- in_stock (boolean)
- rating (number, 0-5)
- review_count (number)
- description (string)
- specifications (object)
Return ONLY valid JSON, no additional text."""
}
]
)
# Parse the extracted data
import json
product_data = json.loads(message.content[0].text)
print(json.dumps(product_data, indent=2))
JavaScript Example:
const Anthropic = require('@anthropic-ai/sdk');
const axios = require('axios');
const client = new Anthropic({
apiKey: process.env.ANTHROPIC_API_KEY,
});
async function scrapeProductData(url) {
// Fetch the HTML content
const response = await axios.get(url);
const html = response.data;
// Extract data using Claude Sonnet
const message = await client.messages.create({
model: 'claude-3-5-sonnet-20241022',
max_tokens: 4096,
messages: [
{
role: 'user',
content: `Extract product information from this HTML:
${html}
Return JSON with: name, price, availability, rating, features (array), and images (array of URLs).
Return ONLY valid JSON.`
}
]
});
return JSON.parse(message.content[0].text);
}
// Usage
scrapeProductData('https://example.com/product/123')
.then(data => console.log(data))
.catch(error => console.error('Scraping error:', error));
Advanced Claude Sonnet Web Scraping Techniques
1. Multi-Page Scraping with Intelligent Navigation
Claude Sonnet can analyze page structure to identify pagination links and navigation elements, making it easier to navigate to different pages programmatically:
Python Example with Pagination:
import anthropic
import requests
from bs4 import BeautifulSoup
def scrape_all_pages(start_url):
client = anthropic.Anthropic(api_key="your-api-key")
all_products = []
current_url = start_url
while current_url:
# Fetch page
response = requests.get(current_url)
html = response.text
# Extract products from current page
message = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=8192,
messages=[
{
"role": "user",
"content": f"""Analyze this e-commerce listing page:
{html}
1. Extract all products as a JSON array with: name, price, url
2. Find the "Next Page" URL if it exists
Return JSON: {{"products": [...], "next_page_url": "url or null"}}"""
}
]
)
result = json.loads(message.content[0].text)
all_products.extend(result['products'])
# Move to next page
current_url = result.get('next_page_url')
if current_url:
print(f"Moving to next page: {current_url}")
return all_products
# Scrape all pages
products = scrape_all_pages('https://example.com/products?page=1')
print(f"Total products scraped: {len(products)}")
2. Combining Claude Sonnet with Browser Automation
For JavaScript-heavy websites, combine Claude Sonnet with Puppeteer for powerful scraping capabilities. This is especially useful when handling AJAX requests:
Python Example with Pyppeteer:
import asyncio
from pyppeteer import launch
import anthropic
import json
async def scrape_dynamic_content(url):
# Launch headless browser
browser = await launch(headless=True)
page = await browser.newPage()
await page.goto(url, {'waitUntil': 'networkidle0'})
# Wait for dynamic content to load
await page.waitForSelector('.product-list')
# Get rendered HTML
html = await page.content()
await browser.close()
# Use Claude Sonnet to extract data
client = anthropic.Anthropic(api_key="your-api-key")
message = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=8192,
messages=[
{
"role": "user",
"content": f"""Extract all product listings from this rendered HTML:
{html}
Return as JSON array with: title, price, image_url, product_url, discount_percentage"""
}
]
)
return json.loads(message.content[0].text)
# Run the scraper
products = asyncio.get_event_loop().run_until_complete(
scrape_dynamic_content('https://example.com/sale')
)
print(json.dumps(products, indent=2))
JavaScript Example with Puppeteer:
const puppeteer = require('puppeteer');
const Anthropic = require('@anthropic-ai/sdk');
const client = new Anthropic({
apiKey: process.env.ANTHROPIC_API_KEY,
});
async function scrapeDynamicPage(url) {
// Launch browser
const browser = await puppeteer.launch({ headless: true });
const page = await browser.newPage();
await page.goto(url, { waitUntil: 'networkidle2' });
// Get rendered HTML after JavaScript execution
const html = await page.content();
await browser.close();
// Use Claude to extract structured data
const message = await client.messages.create({
model: 'claude-3-5-sonnet-20241022',
max_tokens: 8192,
messages: [
{
role: 'user',
content: `Extract job listings from this HTML:
${html}
Return JSON array with: job_title, company, location, salary_range, posted_date, job_type`
}
]
});
return JSON.parse(message.content[0].text);
}
// Usage
scrapeDynamicPage('https://example.com/jobs')
.then(jobs => console.log(jobs));
3. Table and List Extraction
Claude Sonnet excels at parsing complex tables and nested lists that would require extensive XPath or CSS selector logic:
Python Example - Complex Table Extraction:
import anthropic
import requests
def extract_comparison_table(url):
html = requests.get(url).text
client = anthropic.Anthropic(api_key="your-api-key")
message = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=8192,
messages=[
{
"role": "user",
"content": f"""Extract the pricing comparison table from this HTML:
{html}
Convert to JSON with this structure:
{{
"plans": [
{{
"name": "plan name",
"price": {{
"monthly": number,
"annual": number,
"currency": "USD"
}},
"features": [
{{
"name": "feature name",
"included": boolean,
"limit": "string or null"
}}
],
"highlighted": boolean
}}
]
}}
Return ONLY valid JSON."""
}
]
)
return json.loads(message.content[0].text)
# Extract pricing data
pricing = extract_comparison_table('https://example.com/pricing')
4. Handling Authentication and Protected Content
When scraping authenticated content, combine session management with Claude Sonnet for intelligent extraction. This approach works well when handling authentication:
Python Example with Session Authentication:
import anthropic
import requests
def scrape_authenticated_content(login_url, target_url, credentials):
# Create session and login
session = requests.Session()
session.post(login_url, data={
'username': credentials['username'],
'password': credentials['password']
})
# Fetch protected content
response = session.get(target_url)
html = response.text
# Extract data with Claude
client = anthropic.Anthropic(api_key="your-api-key")
message = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=4096,
messages=[
{
"role": "user",
"content": f"""Extract user account information from this dashboard HTML:
{html}
Return JSON with: account_balance, recent_transactions (array), account_status, subscription_tier"""
}
]
)
return json.loads(message.content[0].text)
Optimizing Claude Sonnet for Web Scraping
1. Reduce Token Usage and Costs
Claude Sonnet pricing is based on tokens, so optimizing HTML input is crucial:
HTML Optimization Techniques:
from bs4 import BeautifulSoup
import re
def optimize_html_for_claude(html):
"""
Remove unnecessary elements to reduce token count
"""
soup = BeautifulSoup(html, 'html.parser')
# Remove scripts, styles, and other non-content elements
for tag in soup(['script', 'style', 'svg', 'noscript', 'iframe']):
tag.decompose()
# Remove comments
for comment in soup.findAll(text=lambda text: isinstance(text, Comment)):
comment.extract()
# Remove unnecessary attributes
for tag in soup.find_all(True):
# Keep only class and id attributes
attrs_to_keep = ['class', 'id']
tag.attrs = {k: v for k, v in tag.attrs.items() if k in attrs_to_keep}
# Get cleaned HTML
cleaned_html = str(soup)
# Remove excessive whitespace
cleaned_html = re.sub(r'\s+', ' ', cleaned_html)
return cleaned_html
# Usage
optimized_html = optimize_html_for_claude(raw_html)
# This can reduce token usage by 50-70%
2. Smart Caching Strategy
Implement caching to avoid re-processing identical pages:
import hashlib
import json
import os
from datetime import datetime, timedelta
class ClaudeScrapeCache:
def __init__(self, cache_dir='./scrape_cache', ttl_hours=24):
self.cache_dir = cache_dir
self.ttl = timedelta(hours=ttl_hours)
os.makedirs(cache_dir, exist_ok=True)
def _get_cache_key(self, html, prompt):
content = f"{html}{prompt}"
return hashlib.sha256(content.encode()).hexdigest()
def get(self, html, prompt):
cache_key = self._get_cache_key(html, prompt)
cache_file = os.path.join(self.cache_dir, f"{cache_key}.json")
if os.path.exists(cache_file):
# Check if cache is still valid
file_time = datetime.fromtimestamp(os.path.getmtime(cache_file))
if datetime.now() - file_time < self.ttl:
with open(cache_file, 'r') as f:
return json.load(f)
return None
def set(self, html, prompt, data):
cache_key = self._get_cache_key(html, prompt)
cache_file = os.path.join(self.cache_dir, f"{cache_key}.json")
with open(cache_file, 'w') as f:
json.dump(data, f)
# Usage
cache = ClaudeScrapeCache(ttl_hours=12)
def scrape_with_cache(url, prompt):
html = requests.get(url).text
# Check cache first
cached_result = cache.get(html, prompt)
if cached_result:
print("Using cached result")
return cached_result
# Call Claude if not cached
client = anthropic.Anthropic(api_key="your-api-key")
message = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=4096,
messages=[{"role": "user", "content": f"{prompt}\n\n{html}"}]
)
result = json.loads(message.content[0].text)
cache.set(html, prompt, result)
return result
3. Batch Processing for Efficiency
Process multiple pages efficiently using concurrent requests:
import asyncio
import anthropic
import aiohttp
async def fetch_html(session, url):
async with session.get(url) as response:
return await response.text()
async def extract_with_claude(client, html, prompt):
message = await client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=4096,
messages=[{"role": "user", "content": f"{prompt}\n\n{html}"}]
)
return json.loads(message.content[0].text)
async def scrape_multiple_urls(urls, extraction_prompt):
client = anthropic.AsyncAnthropic(api_key="your-api-key")
async with aiohttp.ClientSession() as session:
# Fetch all HTML concurrently
html_tasks = [fetch_html(session, url) for url in urls]
html_contents = await asyncio.gather(*html_tasks)
# Extract data concurrently
extract_tasks = [
extract_with_claude(client, html, extraction_prompt)
for html in html_contents
]
results = await asyncio.gather(*extract_tasks)
return results
# Scrape 10 pages concurrently
urls = [f"https://example.com/products?page={i}" for i in range(1, 11)]
results = asyncio.run(scrape_multiple_urls(urls, "Extract all products as JSON array"))
Best Practices for Claude Sonnet Web Scraping
1. Crafting Effective Prompts
The quality of your extraction depends heavily on prompt engineering:
# ❌ Poor prompt
prompt = "Get the data from this page"
# ✅ Excellent prompt
prompt = """Extract product information from this e-commerce page.
Required fields:
- product_name: The main product title (string)
- price: Current price in cents (number)
- original_price: Original price if on sale, null otherwise (number or null)
- availability: "in_stock", "out_of_stock", or "preorder" (string)
- images: Array of product image URLs (array of strings)
- specifications: Object with technical specs (object)
Rules:
- Convert all prices to cents (multiply by 100)
- Extract only high-resolution image URLs
- If a field is not found, use null
- Return ONLY valid JSON, no additional text
Example output:
{
"product_name": "Example Product",
"price": 2999,
"original_price": 3999,
"availability": "in_stock",
"images": ["https://example.com/img1.jpg"],
"specifications": {"color": "blue", "size": "large"}
}"""
2. Error Handling and Validation
Always implement robust error handling when monitoring network requests and processing responses:
def safe_claude_extraction(html, prompt, retries=3):
client = anthropic.Anthropic(api_key="your-api-key")
for attempt in range(retries):
try:
message = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=4096,
messages=[{"role": "user", "content": f"{prompt}\n\n{html}"}]
)
# Attempt to parse JSON
result = json.loads(message.content[0].text)
# Validate required fields
required_fields = ['name', 'price']
if all(field in result for field in required_fields):
return result
else:
raise ValueError(f"Missing required fields: {required_fields}")
except json.JSONDecodeError as e:
print(f"Attempt {attempt + 1}: Invalid JSON - {e}")
if attempt == retries - 1:
raise
except anthropic.APIError as e:
print(f"Attempt {attempt + 1}: API Error - {e}")
if attempt < retries - 1:
time.sleep(2 ** attempt) # Exponential backoff
else:
raise
return None
3. Rate Limiting and Respectful Scraping
Implement rate limiting to avoid overloading servers:
import time
from functools import wraps
def rate_limit(calls_per_minute=20):
min_interval = 60.0 / calls_per_minute
last_called = [0.0]
def decorator(func):
@wraps(func)
def wrapper(*args, **kwargs):
elapsed = time.time() - last_called[0]
left_to_wait = min_interval - elapsed
if left_to_wait > 0:
time.sleep(left_to_wait)
ret = func(*args, **kwargs)
last_called[0] = time.time()
return ret
return wrapper
return decorator
@rate_limit(calls_per_minute=10)
def scrape_with_rate_limit(url):
# Your scraping logic here
pass
Comparing Claude Sonnet to Other Models
When to Use Claude Sonnet vs. Haiku vs. Opus
- Claude 3.5 Sonnet: Best for most web scraping tasks. Balanced intelligence and cost.
- Claude 3 Haiku: Use for simple, high-volume extraction where speed and cost are priorities.
- Claude 3 Opus: Reserve for complex, multi-step reasoning or when maximum accuracy is critical.
Cost Comparison Example:
# Approximate costs for processing a 50KB HTML page (≈12,500 tokens)
# and generating 1KB structured output (≈250 tokens)
models_cost = {
"claude-3-haiku-20240307": {
"input": 12500 * 0.25 / 1_000_000, # $0.003125
"output": 250 * 1.25 / 1_000_000, # $0.0003125
"total": "$0.0034"
},
"claude-3-5-sonnet-20241022": {
"input": 12500 * 3 / 1_000_000, # $0.0375
"output": 250 * 15 / 1_000_000, # $0.00375
"total": "$0.041"
},
"claude-3-opus-20240229": {
"input": 12500 * 15 / 1_000_000, # $0.1875
"output": 250 * 75 / 1_000_000, # $0.01875
"total": "$0.206"
}
}
# For 1,000 pages:
# Haiku: ~$3.40
# Sonnet: ~$41
# Opus: ~$206
Real-World Use Cases
E-commerce Price Monitoring
def monitor_competitor_prices(product_urls):
client = anthropic.Anthropic(api_key="your-api-key")
price_data = []
for url in product_urls:
html = requests.get(url).text
message = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=1024,
messages=[{
"role": "user",
"content": f"""Extract pricing info from this product page:
{html}
Return JSON: {{"product_name": "...", "current_price": number, "currency": "...", "in_stock": boolean}}"""
}]
)
data = json.loads(message.content[0].text)
data['url'] = url
data['timestamp'] = datetime.now().isoformat()
price_data.append(data)
return price_data
News Article Extraction
def extract_article_content(article_url):
html = requests.get(article_url).text
client = anthropic.Anthropic(api_key="your-api-key")
message = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=8192,
messages=[{
"role": "user",
"content": f"""Extract article content from this news page:
{html}
Return JSON with:
- headline (string)
- author (string or null)
- published_date (ISO 8601 string)
- article_text (string, full article body)
- tags (array of strings)
- image_url (string or null)"""
}]
)
return json.loads(message.content[0].text)
Conclusion
Claude 3.5 Sonnet provides a powerful, intelligent approach to web scraping that significantly reduces maintenance overhead compared to traditional selector-based methods. Its ability to understand context, adapt to layout changes, and extract structured data from complex HTML makes it ideal for modern web scraping workflows.
While Claude Sonnet adds API costs to your scraping operations, the benefits often outweigh the expenses:
- Reduced Development Time: Write extraction logic in minutes, not hours
- Lower Maintenance: No breaking when websites update their HTML
- Better Accuracy: Semantic understanding reduces extraction errors
- Flexibility: Handles edge cases and variations automatically
For the best results, combine Claude Sonnet with traditional scraping tools: use browser automation for JavaScript-heavy sites, implement caching to control costs, and optimize your HTML input to reduce token usage. This hybrid approach gives you the reliability of conventional scraping with the intelligence and adaptability of AI-powered extraction.
Whether you're building a price monitoring system, aggregating news content, or extracting product data at scale, Claude Sonnet offers a modern solution that adapts to the ever-changing landscape of web scraping.