How do I provide examples to an LLM for better web scraping results?
Providing examples to Large Language Models (LLMs) is one of the most effective techniques for improving web scraping accuracy and consistency. This approach, known as few-shot learning or in-context learning, helps LLMs understand the exact structure, format, and type of data you want to extract from web pages. By showing the model concrete examples of input-output pairs, you can dramatically reduce hallucinations and improve extraction quality.
Why Examples Matter in LLM-Based Web Scraping
LLMs are powerful pattern recognition systems. When you provide examples, you're essentially teaching the model what patterns to look for and how to format the output. This is particularly valuable for web scraping because:
- Reduces ambiguity: Examples clarify exactly what data you want
- Improves consistency: The model follows the demonstrated format
- Handles edge cases: You can show how to handle missing or unusual data
- Minimizes hallucinations: Clear examples reduce made-up data
- Speeds up development: Less trial-and-error with prompt engineering
Few-Shot Prompting for Web Scraping
Few-shot prompting involves providing 2-5 examples of the task you want the LLM to perform. Here's a practical example for extracting product information:
Python Example with OpenAI API
import openai
import json
openai.api_key = "your-api-key"
# HTML content from a product page
html_content = """
<div class="product">
<h1>Wireless Headphones Pro</h1>
<span class="price">$199.99</span>
<div class="rating">4.5 stars</div>
<p class="description">Premium noise-cancelling headphones</p>
</div>
"""
# Create a prompt with examples
prompt = f"""Extract product information from HTML and return as JSON.
Example 1:
Input: <div class="product"><h1>Smart Watch</h1><span class="price">$299</span><div class="rating">4.2 stars</div></div>
Output: {{"name": "Smart Watch", "price": 299, "rating": 4.2}}
Example 2:
Input: <div class="product"><h1>Laptop Stand</h1><span class="price">$49.99</span><div class="rating">4.8 stars</div></div>
Output: {{"name": "Laptop Stand", "price": 49.99, "rating": 4.8}}
Example 3:
Input: <div class="product"><h1>USB Cable</h1><span class="price">$12</span></div>
Output: {{"name": "USB Cable", "price": 12, "rating": null}}
Now extract from this HTML:
{html_content}
"""
response = openai.chat.completions.create(
model="gpt-4",
messages=[
{"role": "system", "content": "You are a web scraping assistant that extracts structured data from HTML."},
{"role": "user", "content": prompt}
],
temperature=0 # Lower temperature for consistent output
)
result = json.loads(response.choices[0].message.content)
print(result)
JavaScript Example with OpenAI API
const OpenAI = require('openai');
const openai = new OpenAI({
apiKey: process.env.OPENAI_API_KEY
});
async function extractWithExamples(html) {
const prompt = `Extract product information from HTML and return as JSON.
Example 1:
Input: <div class="item"><h2>Coffee Maker</h2><span class="cost">$89.99</span></div>
Output: {"product": "Coffee Maker", "price": 89.99}
Example 2:
Input: <div class="item"><h2>Blender</h2><span class="cost">$59.50</span></div>
Output: {"product": "Blender", "price": 59.50}
Now extract from:
${html}`;
const response = await openai.chat.completions.create({
model: 'gpt-4',
messages: [
{
role: 'system',
content: 'You extract structured data from HTML. Always return valid JSON.'
},
{
role: 'user',
content: prompt
}
],
temperature: 0
});
return JSON.parse(response.choices[0].message.content);
}
// Usage
const html = '<div class="item"><h2>Toaster</h2><span class="cost">$34.99</span></div>';
extractWithExamples(html).then(result => console.log(result));
Best Practices for Example Selection
1. Use Representative Examples
Choose examples that reflect the variety of data you'll encounter:
prompt = """Extract article metadata from HTML.
Example 1 - Standard article:
Input: <article><h1>AI Revolution</h1><span class="author">Jane Smith</span><time>2024-01-15</time></article>
Output: {"title": "AI Revolution", "author": "Jane Smith", "date": "2024-01-15"}
Example 2 - Missing author:
Input: <article><h1>Breaking News</h1><time>2024-01-16</time></article>
Output: {"title": "Breaking News", "author": null, "date": "2024-01-16"}
Example 3 - Multiple authors:
Input: <article><h1>Research Paper</h1><span class="author">John Doe, Mary Johnson</span><time>2024-01-14</time></article>
Output: {"title": "Research Paper", "author": "John Doe, Mary Johnson", "date": "2024-01-14"}
"""
2. Show Edge Cases
Include examples of missing data, null values, and unusual formats:
few_shot_examples = """
Example with all fields present:
Input: <div class="listing"><h3>3BR Apartment</h3><p>$2,500/month</p><span>Available Now</span></div>
Output: {"title": "3BR Apartment", "price": 2500, "currency": "USD", "period": "month", "status": "available"}
Example with missing price:
Input: <div class="listing"><h3>Studio</h3><span>Contact for price</span></div>
Output: {"title": "Studio", "price": null, "currency": null, "period": null, "status": "contact"}
Example with different currency:
Input: <div class="listing"><h3>2BR House</h3><p>€1,800/month</p></div>
Output: {"title": "2BR House", "price": 1800, "currency": "EUR", "period": "month", "status": "available"}
"""
3. Demonstrate Output Format Consistency
Maintain consistent JSON structure across all examples:
const systemPrompt = `You extract job listings from HTML. Always use this exact JSON structure:
{
"title": string,
"company": string,
"location": string or null,
"salary": {"min": number, "max": number, "currency": string} or null,
"remote": boolean
}`;
const examples = `
Example 1:
<div class="job"><h2>Senior Developer</h2><span>TechCorp</span><p>New York</p><p>$120k-150k</p></div>
→ {"title": "Senior Developer", "company": "TechCorp", "location": "New York", "salary": {"min": 120000, "max": 150000, "currency": "USD"}, "remote": false}
Example 2:
<div class="job"><h2>Remote Designer</h2><span>DesignCo</span><p>Remote</p></div>
→ {"title": "Remote Designer", "company": "DesignCo", "location": null, "salary": null, "remote": true}
`;
Using Function Calling with Examples
Modern LLM APIs like OpenAI's function calling combine examples with structured schemas:
import openai
# Define the extraction schema
functions = [
{
"name": "extract_product",
"description": "Extract product information from HTML",
"parameters": {
"type": "object",
"properties": {
"name": {"type": "string", "description": "Product name"},
"price": {"type": "number", "description": "Price in USD"},
"rating": {"type": "number", "description": "Rating out of 5"},
"in_stock": {"type": "boolean", "description": "Availability status"}
},
"required": ["name", "price"]
}
}
]
# Provide examples in the system message
system_message = """You extract product data from HTML. Examples:
Input: <div><h1>Keyboard</h1><span>$79.99</span><div>4.3★</div><p>In Stock</p></div>
Output: {"name": "Keyboard", "price": 79.99, "rating": 4.3, "in_stock": true}
Input: <div><h1>Mouse Pad</h1><span>$15</span><div>Out of Stock</div></div>
Output: {"name": "Mouse Pad", "price": 15, "rating": null, "in_stock": false}"""
response = openai.chat.completions.create(
model="gpt-4",
messages=[
{"role": "system", "content": system_message},
{"role": "user", "content": html_to_scrape}
],
functions=functions,
function_call={"name": "extract_product"}
)
# Extract the structured data
result = json.loads(response.choices[0].message.function_call.arguments)
Chain-of-Thought Prompting with Examples
For complex scraping tasks, show the LLM how to reason through the extraction:
prompt = """Extract event information from HTML with step-by-step reasoning.
Example 1:
HTML: <div class="event"><h2>Tech Conference 2024</h2><p>March 15-17, 2024</p><p>San Francisco, CA</p><span>$299</span></div>
Reasoning:
1. Event name is in the <h2> tag: "Tech Conference 2024"
2. Date range in first <p>: "March 15-17, 2024" → start: 2024-03-15, end: 2024-03-17
3. Location in second <p>: "San Francisco, CA"
4. Price in <span>: $299 → 299
Output: {"name": "Tech Conference 2024", "start_date": "2024-03-15", "end_date": "2024-03-17", "location": "San Francisco, CA", "price": 299}
Now extract from this HTML using the same reasoning approach:
{your_html}
"""
Handling Multiple Item Extraction
When scraping lists, provide examples of how to handle multiple items:
const prompt = `Extract all products from the HTML as a JSON array.
Example:
Input:
<div class="products">
<div class="item"><h3>Laptop</h3><p>$999</p></div>
<div class="item"><h3>Monitor</h3><p>$299</p></div>
<div class="item"><h3>Keyboard</h3><p>$79</p></div>
</div>
Output:
[
{"name": "Laptop", "price": 999},
{"name": "Monitor", "price": 299},
{"name": "Keyboard", "price": 79}
]
Now extract all products from:
${html}`;
Optimizing Example Count
The optimal number of examples depends on task complexity:
- Simple extraction (1-3 fields): 2-3 examples
- Moderate complexity (4-8 fields): 3-5 examples
- Complex extraction (8+ fields, nested data): 5-7 examples
More examples aren't always better—they consume tokens and may not improve results beyond a certain point.
Using Examples with AI Web Scraping APIs
Services like WebScraping.AI allow you to provide examples directly in API calls:
import requests
# Use WebScraping.AI with example-based extraction
response = requests.post(
'https://api.webscraping.ai/ai/question',
params={
'api_key': 'your_api_key',
'url': 'https://example.com/products'
},
json={
'question': '''Extract product info as JSON.
Examples:
1. Input: <div class="prod"><h1>Shirt</h1><span>$29.99</span></div>
Output: {"product": "Shirt", "price": 29.99}
2. Input: <div class="prod"><h1>Pants</h1><span>$49.99</span></div>
Output: {"product": "Pants", "price": 49.99}
Extract from the current page.'''
}
)
print(response.json())
Common Pitfalls to Avoid
1. Inconsistent Example Formats
# Bad: Inconsistent formatting
examples = """
Example 1: {"name": "Product A", "price": 100}
Example 2: {'name': 'Product B', 'cost': 200} # Different key name, single quotes
"""
# Good: Consistent formatting
examples = """
Example 1: {"name": "Product A", "price": 100}
Example 2: {"name": "Product B", "price": 200}
"""
2. Overly Complex Examples
Keep examples focused on the task at hand. Don't include irrelevant HTML or data.
3. Not Showing Null Handling
Always include examples with missing data to teach the model how to handle nulls.
Conclusion
Providing well-crafted examples is essential for accurate LLM-based web scraping. By using few-shot learning, function calling, and chain-of-thought prompting with clear examples, you can significantly improve extraction quality while reducing hallucinations and inconsistencies.
Remember to: - Use 2-5 representative examples - Show edge cases and null handling - Maintain consistent output formats - Keep examples simple and focused - Test with real-world data variations
With these techniques, you can build robust LLM-based scrapers that reliably extract structured data from even the most complex web pages.