What is the Best Way to Structure Prompts for Claude AI When Scraping?
Structuring effective prompts for Claude AI is crucial for successful web scraping projects. A well-crafted prompt can dramatically improve extraction accuracy, reduce token usage, and ensure consistent, structured output. This guide covers proven strategies for prompt engineering specifically designed for web scraping tasks with Claude AI.
Core Principles of Prompt Structuring for Web Scraping
When using Claude AI for web scraping, your prompt should follow a clear three-part structure:
- Task Definition: Clearly state what data you need to extract
- Context Provision: Supply the HTML or text content to analyze
- Output Format: Specify exactly how results should be structured
This separation ensures Claude understands both what to do and how to return the results, minimizing ambiguity and improving accuracy.
Essential Prompt Components
1. Clear Task Instructions
Begin your prompt with explicit instructions about the extraction task:
import anthropic
client = anthropic.Anthropic(api_key="your-api-key")
prompt = """Extract product information from the following HTML.
Your task:
- Extract the product name, price, availability, and description
- If a field is not found, return null
- Ensure prices are extracted as numbers without currency symbols
- Extract availability as a boolean (true if in stock, false otherwise)
HTML Content:
{html_content}
Return the data as valid JSON."""
The key here is specificity. Rather than saying "extract product data," break down exactly which fields you need and how they should be interpreted.
2. Structured Output Schema
Define a clear schema for the output. Claude works exceptionally well with JSON schemas:
const fetch = require('node-fetch');
const prompt = `Extract the following fields from the HTML and return ONLY valid JSON:
{
"title": "string",
"price": number,
"currency": "string (ISO code)",
"inStock": boolean,
"rating": number,
"reviewCount": number,
"images": ["array of image URLs"],
"specifications": {
"key": "value pairs of product specs"
}
}
HTML:
${htmlContent}
Return only the JSON object, no additional text.`;
const message = await anthropic.messages.create({
model: "claude-3-5-sonnet-20241022",
max_tokens: 1024,
messages: [
{ role: "user", content: prompt }
]
});
This approach ensures consistent output structure across multiple scraping requests.
3. HTML Context Optimization
How you provide HTML context significantly impacts performance and cost:
Minimize HTML Size: Strip unnecessary elements before sending to Claude:
from bs4 import BeautifulSoup
import re
def clean_html(html):
soup = BeautifulSoup(html, 'html.parser')
# Remove script and style tags
for tag in soup(['script', 'style', 'noscript', 'svg']):
tag.decompose()
# Remove comments
for comment in soup.find_all(text=lambda text: isinstance(text, Comment)):
comment.extract()
# Remove unnecessary attributes
for tag in soup.find_all(True):
tag.attrs = {k: v for k, v in tag.attrs.items()
if k in ['class', 'id', 'href', 'src', 'alt', 'title']}
return str(soup)
# Use cleaned HTML in prompt
cleaned = clean_html(raw_html)
prompt = f"""Extract product details from this HTML:
{cleaned}
Return as JSON with fields: name, price, description."""
Focus on Relevant Sections: When dealing with large pages, isolate the relevant section:
def extract_main_content(html):
soup = BeautifulSoup(html, 'html.parser')
# Try to find main content area
main_content = (
soup.find('main') or
soup.find('article') or
soup.find(id='content') or
soup.find(class_='product-details')
)
return str(main_content) if main_content else str(soup)
focused_html = extract_main_content(raw_html)
Advanced Prompt Patterns for Web Scraping
Pattern 1: Multi-Field Extraction with Fallbacks
For complex pages where data might appear in different formats:
prompt = """Extract product information with the following fallback rules:
PRICE EXTRACTION:
1. First, look for elements with class 'price' or 'product-price'
2. If not found, search for currency symbols ($, €, £) followed by numbers
3. If multiple prices exist, extract the largest one (usually the original price)
TITLE EXTRACTION:
1. Check <h1> tags first
2. Fall back to og:title meta tag
3. Finally, use the <title> tag content
IMAGES:
1. Look for product image galleries
2. Extract all img src attributes within product containers
3. Filter out icons and thumbnails (images smaller than 200x200px in filename)
HTML:
{html_content}
Output Format:
{{
"title": "string",
"price": {{
"current": number,
"original": number,
"discount_percentage": number
}},
"images": ["array of full-size image URLs"]
}}"""
Pattern 2: List Extraction with Pagination Context
When scraping multiple items from a listing page:
const prompt = `Extract all product listings from this search results page.
For EACH product card, extract:
- Product name
- Price
- Rating (as number from 0-5)
- Number of reviews
- Product URL
- Primary image URL
Return as an array of products. If no products found, return empty array.
Example output format:
[
{
"name": "Product Name",
"price": 29.99,
"rating": 4.5,
"reviewCount": 128,
"url": "/product/abc",
"imageUrl": "https://example.com/image.jpg"
}
]
HTML:
${listingPageHtml}
Return ONLY the JSON array.`;
Pattern 3: Conditional Extraction Based on Page Type
For scraping different types of pages with a single prompt:
prompt = """Analyze this HTML and determine the page type, then extract relevant data.
PAGE TYPES:
1. Product Page: Extract name, price, description, specifications
2. Category Page: Extract list of product links and names
3. Article Page: Extract title, author, date, content
4. Other: Return page_type as 'unknown'
Always include a 'page_type' field in your response.
HTML:
{html_content}
Output schema:
{{
"page_type": "product|category|article|unknown",
"data": {{
// Type-specific fields here
}}
}}"""
Token Optimization Strategies
Claude's API charges based on input and output tokens, so optimization is crucial for cost-effective scraping:
1. Use Markdown Instead of HTML
When possible, convert HTML to markdown before sending to Claude. This can reduce token count by 40-60%:
from markdownify import markdownify as md
# Convert HTML to markdown
markdown_content = md(html_content)
prompt = f"""Extract product information from this content:
{markdown_content}
Return as JSON: {{"name": "", "price": 0, "description": ""}}"""
2. Implement Intelligent Chunking
For very large pages, extract data in chunks:
def chunk_and_extract(html, chunk_size=4000):
soup = BeautifulSoup(html, 'html.parser')
items = soup.find_all(class_='product-item')
results = []
chunk = []
for item in items:
chunk.append(str(item))
if len(' '.join(chunk)) > chunk_size:
# Process chunk
prompt = f"Extract products from: {' '.join(chunk)}"
response = claude_api_call(prompt)
results.extend(response)
chunk = []
# Process remaining items
if chunk:
prompt = f"Extract products from: {' '.join(chunk)}"
response = claude_api_call(prompt)
results.extend(response)
return results
Handling Dynamic Content and JavaScript-Rendered Pages
When scraping pages that rely heavily on JavaScript, you'll need to obtain the fully rendered HTML first. This is where tools like Puppeteer for handling AJAX requests become essential:
const puppeteer = require('puppeteer');
const Anthropic = require('@anthropic-ai/sdk');
async function scrapeWithClaude(url) {
// First, render the page with Puppeteer
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto(url, { waitUntil: 'networkidle0' });
const html = await page.content();
await browser.close();
// Then, extract data with Claude
const anthropic = new Anthropic({
apiKey: process.env.ANTHROPIC_API_KEY,
});
const message = await anthropic.messages.create({
model: "claude-3-5-sonnet-20241022",
max_tokens: 2048,
messages: [
{
role: "user",
content: `Extract product data from this HTML:
${html}
Return JSON: {"name": "", "price": 0, "features": []}`
}
]
});
return JSON.parse(message.content[0].text);
}
For single-page applications, you may also want to explore techniques for crawling SPAs using Puppeteer before passing the content to Claude.
Error Handling and Validation
Always include validation instructions in your prompts:
prompt = """Extract data from the HTML below.
VALIDATION RULES:
- If a required field is missing, set it to null
- If price cannot be extracted as a valid number, return null
- If date format is ambiguous, use ISO 8601 format (YYYY-MM-DD)
- Ensure all URLs are absolute, not relative
- Validate email addresses match standard email format
If the HTML doesn't contain product information, return:
{{"error": "No product data found", "data": null}}
HTML:
{html_content}
Return valid JSON only."""
Then validate the response:
import json
def safe_extract(html_content):
response = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=1024,
messages=[{"role": "user", "content": prompt.format(html_content=html_content)}]
)
try:
result = json.loads(response.content[0].text)
# Validate required fields
if "error" in result:
return None
required_fields = ["name", "price"]
if all(result.get(field) for field in required_fields):
return result
else:
return None
except json.JSONDecodeError:
print("Invalid JSON response from Claude")
return None
Best Practices Summary
- Be Specific: Clearly define what data to extract and in what format
- Provide Examples: Show Claude the expected output structure
- Minimize Input: Clean and reduce HTML before sending
- Use JSON Schema: Define exact output structure
- Handle Edge Cases: Include instructions for missing data
- Validate Output: Always parse and validate responses
- Optimize Tokens: Use markdown, chunk large pages, remove unnecessary HTML
- Test Iteratively: Refine prompts based on actual results
Complete Working Example
Here's a full implementation combining all best practices:
import anthropic
import json
from markdownify import markdownify as md
from bs4 import BeautifulSoup
class ClaudeWebScraper:
def __init__(self, api_key):
self.client = anthropic.Anthropic(api_key=api_key)
def clean_html(self, html):
soup = BeautifulSoup(html, 'html.parser')
for tag in soup(['script', 'style', 'noscript']):
tag.decompose()
return str(soup)
def extract_product(self, html):
cleaned = self.clean_html(html)
markdown = md(cleaned)
prompt = f"""Extract product information from the following content.
Required fields:
- name: Product name (string)
- price: Current price (number, no currency symbol)
- currency: Currency code (string, e.g., USD, EUR)
- description: Product description (string)
- inStock: Availability (boolean)
- images: Array of image URLs (array of strings)
If any field cannot be found, use null.
Content:
{markdown}
Return ONLY valid JSON matching this structure:
{{
"name": null,
"price": null,
"currency": null,
"description": null,
"inStock": null,
"images": []
}}"""
response = self.client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=2048,
messages=[{"role": "user", "content": prompt}]
)
try:
return json.loads(response.content[0].text)
except json.JSONDecodeError:
return None
# Usage
scraper = ClaudeWebScraper(api_key="your-api-key")
result = scraper.extract_product(html_content)
print(json.dumps(result, indent=2))
Conclusion
Structuring prompts effectively for Claude AI in web scraping requires a balance between clarity, specificity, and token efficiency. By following these patterns—clear task definition, structured output schemas, HTML optimization, and robust error handling—you can build reliable, cost-effective scraping systems that leverage Claude's powerful language understanding capabilities.
Remember to always test your prompts with various edge cases and refine them based on real-world results. The investment in prompt engineering pays dividends in extraction accuracy and reduced API costs.