How do I provide examples to an LLM for better web scraping results?

Providing examples to Large Language Models (LLMs) is one of the most effective techniques for improving web scraping accuracy and consistency. This approach, known as few-shot learning or in-context learning, helps LLMs understand the exact structure, format, and type of data you want to extract from web pages. By showing the model concrete examples of input-output pairs, you can dramatically reduce hallucinations and improve extraction quality.

Why Examples Matter in LLM-Based Web Scraping

LLMs are powerful pattern recognition systems. When you provide examples, you're essentially teaching the model what patterns to look for and how to format the output. This is particularly valuable for web scraping because:

Reduces ambiguity: Examples clarify exactly what data you want
Improves consistency: The model follows the demonstrated format
Handles edge cases: You can show how to handle missing or unusual data
Minimizes hallucinations: Clear examples reduce made-up data
Speeds up development: Less trial-and-error with prompt engineering

Few-Shot Prompting for Web Scraping

Few-shot prompting involves providing 2-5 examples of the task you want the LLM to perform. Here's a practical example for extracting product information:

Python Example with OpenAI API

import openai
import json

openai.api_key = "your-api-key"

# HTML content from a product page
html_content = """
<div class="product">
    <h1>Wireless Headphones Pro</h1>
    <span class="price">$199.99</span>
    <div class="rating">4.5 stars</div>
    <p class="description">Premium noise-cancelling headphones</p>
</div>
"""

# Create a prompt with examples
prompt = f"""Extract product information from HTML and return as JSON.

Example 1:
Input: <div class="product"><h1>Smart Watch</h1><span class="price">$299</span><div class="rating">4.2 stars</div></div>
Output: {{"name": "Smart Watch", "price": 299, "rating": 4.2}}

Example 2:
Input: <div class="product"><h1>Laptop Stand</h1><span class="price">$49.99</span><div class="rating">4.8 stars</div></div>
Output: {{"name": "Laptop Stand", "price": 49.99, "rating": 4.8}}

Example 3:
Input: <div class="product"><h1>USB Cable</h1><span class="price">$12</span></div>
Output: {{"name": "USB Cable", "price": 12, "rating": null}}

Now extract from this HTML:
{html_content}
"""

response = openai.chat.completions.create(
    model="gpt-4",
    messages=[
        {"role": "system", "content": "You are a web scraping assistant that extracts structured data from HTML."},
        {"role": "user", "content": prompt}
    ],
    temperature=0  # Lower temperature for consistent output
)

result = json.loads(response.choices[0].message.content)
print(result)

JavaScript Example with OpenAI API

const OpenAI = require('openai');

const openai = new OpenAI({
  apiKey: process.env.OPENAI_API_KEY
});

async function extractWithExamples(html) {
  const prompt = `Extract product information from HTML and return as JSON.

Example 1:
Input: <div class="item"><h2>Coffee Maker</h2><span class="cost">$89.99</span></div>
Output: {"product": "Coffee Maker", "price": 89.99}

Example 2:
Input: <div class="item"><h2>Blender</h2><span class="cost">$59.50</span></div>
Output: {"product": "Blender", "price": 59.50}

Now extract from:
${html}`;

  const response = await openai.chat.completions.create({
    model: 'gpt-4',
    messages: [
      {
        role: 'system',
        content: 'You extract structured data from HTML. Always return valid JSON.'
      },
      {
        role: 'user',
        content: prompt
      }
    ],
    temperature: 0
  });

  return JSON.parse(response.choices[0].message.content);
}

// Usage
const html = '<div class="item"><h2>Toaster</h2><span class="cost">$34.99</span></div>';
extractWithExamples(html).then(result => console.log(result));

Best Practices for Example Selection

1. Use Representative Examples

Choose examples that reflect the variety of data you'll encounter:

prompt = """Extract article metadata from HTML.

Example 1 - Standard article:
Input: <article><h1>AI Revolution</h1><span class="author">Jane Smith</span><time>2024-01-15</time></article>
Output: {"title": "AI Revolution", "author": "Jane Smith", "date": "2024-01-15"}

Example 2 - Missing author:
Input: <article><h1>Breaking News</h1><time>2024-01-16</time></article>
Output: {"title": "Breaking News", "author": null, "date": "2024-01-16"}

Example 3 - Multiple authors:
Input: <article><h1>Research Paper</h1><span class="author">John Doe, Mary Johnson</span><time>2024-01-14</time></article>
Output: {"title": "Research Paper", "author": "John Doe, Mary Johnson", "date": "2024-01-14"}
"""

2. Show Edge Cases

Include examples of missing data, null values, and unusual formats:

few_shot_examples = """
Example with all fields present:
Input: <div class="listing"><h3>3BR Apartment</h3><p>$2,500/month</p><span>Available Now</span></div>
Output: {"title": "3BR Apartment", "price": 2500, "currency": "USD", "period": "month", "status": "available"}

Example with missing price:
Input: <div class="listing"><h3>Studio</h3><span>Contact for price</span></div>
Output: {"title": "Studio", "price": null, "currency": null, "period": null, "status": "contact"}

Example with different currency:
Input: <div class="listing"><h3>2BR House</h3><p>€1,800/month</p></div>
Output: {"title": "2BR House", "price": 1800, "currency": "EUR", "period": "month", "status": "available"}
"""

3. Demonstrate Output Format Consistency

Maintain consistent JSON structure across all examples:

const systemPrompt = `You extract job listings from HTML. Always use this exact JSON structure:
{
  "title": string,
  "company": string,
  "location": string or null,
  "salary": {"min": number, "max": number, "currency": string} or null,
  "remote": boolean
}`;

const examples = `
Example 1:
<div class="job"><h2>Senior Developer</h2><span>TechCorp</span><p>New York</p><p>$120k-150k</p></div>
→ {"title": "Senior Developer", "company": "TechCorp", "location": "New York", "salary": {"min": 120000, "max": 150000, "currency": "USD"}, "remote": false}

Example 2:
<div class="job"><h2>Remote Designer</h2><span>DesignCo</span><p>Remote</p></div>
→ {"title": "Remote Designer", "company": "DesignCo", "location": null, "salary": null, "remote": true}
`;

Using Function Calling with Examples

Modern LLM APIs like OpenAI's function calling combine examples with structured schemas:

import openai

# Define the extraction schema
functions = [
    {
        "name": "extract_product",
        "description": "Extract product information from HTML",
        "parameters": {
            "type": "object",
            "properties": {
                "name": {"type": "string", "description": "Product name"},
                "price": {"type": "number", "description": "Price in USD"},
                "rating": {"type": "number", "description": "Rating out of 5"},
                "in_stock": {"type": "boolean", "description": "Availability status"}
            },
            "required": ["name", "price"]
        }
    }
]

# Provide examples in the system message
system_message = """You extract product data from HTML. Examples:

Input: <div><h1>Keyboard</h1><span>$79.99</span><div>4.3★</div><p>In Stock</p></div>
Output: {"name": "Keyboard", "price": 79.99, "rating": 4.3, "in_stock": true}

Input: <div><h1>Mouse Pad</h1><span>$15</span><div>Out of Stock</div></div>
Output: {"name": "Mouse Pad", "price": 15, "rating": null, "in_stock": false}"""

response = openai.chat.completions.create(
    model="gpt-4",
    messages=[
        {"role": "system", "content": system_message},
        {"role": "user", "content": html_to_scrape}
    ],
    functions=functions,
    function_call={"name": "extract_product"}
)

# Extract the structured data
result = json.loads(response.choices[0].message.function_call.arguments)

Chain-of-Thought Prompting with Examples

For complex scraping tasks, show the LLM how to reason through the extraction:

prompt = """Extract event information from HTML with step-by-step reasoning.

Example 1:
HTML: <div class="event"><h2>Tech Conference 2024</h2><p>March 15-17, 2024</p><p>San Francisco, CA</p><span>$299</span></div>

Reasoning:
1. Event name is in the <h2> tag: "Tech Conference 2024"
2. Date range in first <p>: "March 15-17, 2024" → start: 2024-03-15, end: 2024-03-17
3. Location in second <p>: "San Francisco, CA"
4. Price in <span>: $299 → 299

Output: {"name": "Tech Conference 2024", "start_date": "2024-03-15", "end_date": "2024-03-17", "location": "San Francisco, CA", "price": 299}

Now extract from this HTML using the same reasoning approach:
{your_html}
"""

Handling Multiple Item Extraction

When scraping lists, provide examples of how to handle multiple items:

const prompt = `Extract all products from the HTML as a JSON array.

Example:
Input:
<div class="products">
  <div class="item"><h3>Laptop</h3><p>$999</p></div>
  <div class="item"><h3>Monitor</h3><p>$299</p></div>
  <div class="item"><h3>Keyboard</h3><p>$79</p></div>
</div>

Output:
[
  {"name": "Laptop", "price": 999},
  {"name": "Monitor", "price": 299},
  {"name": "Keyboard", "price": 79}
]

Now extract all products from:
${html}`;

Optimizing Example Count

The optimal number of examples depends on task complexity:

Simple extraction (1-3 fields): 2-3 examples
Moderate complexity (4-8 fields): 3-5 examples
Complex extraction (8+ fields, nested data): 5-7 examples

More examples aren't always better—they consume tokens and may not improve results beyond a certain point.

Using Examples with AI Web Scraping APIs

Services like WebScraping.AI allow you to provide examples directly in API calls:

import requests

# Use WebScraping.AI with example-based extraction
response = requests.post(
    'https://api.webscraping.ai/ai/question',
    params={
        'api_key': 'your_api_key',
        'url': 'https://example.com/products'
    },
    json={
        'question': '''Extract product info as JSON.

Examples:
1. Input: <div class="prod"><h1>Shirt</h1><span>$29.99</span></div>
   Output: {"product": "Shirt", "price": 29.99}

2. Input: <div class="prod"><h1>Pants</h1><span>$49.99</span></div>
   Output: {"product": "Pants", "price": 49.99}

Extract from the current page.'''
    }
)

print(response.json())

Common Pitfalls to Avoid

1. Inconsistent Example Formats

# Bad: Inconsistent formatting
examples = """
Example 1: {"name": "Product A", "price": 100}
Example 2: {'name': 'Product B', 'cost': 200}  # Different key name, single quotes
"""

# Good: Consistent formatting
examples = """
Example 1: {"name": "Product A", "price": 100}
Example 2: {"name": "Product B", "price": 200}
"""

2. Overly Complex Examples

Keep examples focused on the task at hand. Don't include irrelevant HTML or data.

3. Not Showing Null Handling

Always include examples with missing data to teach the model how to handle nulls.

Conclusion

Providing well-crafted examples is essential for accurate LLM-based web scraping. By using few-shot learning, function calling, and chain-of-thought prompting with clear examples, you can significantly improve extraction quality while reducing hallucinations and inconsistencies.

Remember to: - Use 2-5 representative examples - Show edge cases and null handling - Maintain consistent output formats - Keep examples simple and focused - Test with real-world data variations

With these techniques, you can build robust LLM-based scrapers that reliably extract structured data from even the most complex web pages.

Table of contents

How do I provide examples to an LLM for better web scraping results?

Why Examples Matter in LLM-Based Web Scraping

Few-Shot Prompting for Web Scraping

Python Example with OpenAI API

JavaScript Example with OpenAI API

Best Practices for Example Selection

1. Use Representative Examples

2. Show Edge Cases

3. Demonstrate Output Format Consistency

Using Function Calling with Examples

Chain-of-Thought Prompting with Examples

Handling Multiple Item Extraction

Optimizing Example Count

Using Examples with AI Web Scraping APIs

Common Pitfalls to Avoid

1. Inconsistent Example Formats

2. Overly Complex Examples

3. Not Showing Null Handling

Conclusion

Try WebScraping.AI for Your Web Scraping Needs

Key Features:

Getting Started:

Related Questions

How many examples should I include in my LLM scraping prompt?

How do I handle rate limiting when using LLM APIs for web scraping?

Can I use multiple LLM providers for web scraping to improve reliability?

Get Started Now

Support