Table of contents

How do I provide examples to an LLM for better web scraping results?

Providing examples to Large Language Models (LLMs) is one of the most effective techniques for improving web scraping accuracy and consistency. This approach, known as few-shot learning or in-context learning, helps LLMs understand the exact structure, format, and type of data you want to extract from web pages. By showing the model concrete examples of input-output pairs, you can dramatically reduce hallucinations and improve extraction quality.

Why Examples Matter in LLM-Based Web Scraping

LLMs are powerful pattern recognition systems. When you provide examples, you're essentially teaching the model what patterns to look for and how to format the output. This is particularly valuable for web scraping because:

  • Reduces ambiguity: Examples clarify exactly what data you want
  • Improves consistency: The model follows the demonstrated format
  • Handles edge cases: You can show how to handle missing or unusual data
  • Minimizes hallucinations: Clear examples reduce made-up data
  • Speeds up development: Less trial-and-error with prompt engineering

Few-Shot Prompting for Web Scraping

Few-shot prompting involves providing 2-5 examples of the task you want the LLM to perform. Here's a practical example for extracting product information:

Python Example with OpenAI API

import openai
import json

openai.api_key = "your-api-key"

# HTML content from a product page
html_content = """
<div class="product">
    <h1>Wireless Headphones Pro</h1>
    <span class="price">$199.99</span>
    <div class="rating">4.5 stars</div>
    <p class="description">Premium noise-cancelling headphones</p>
</div>
"""

# Create a prompt with examples
prompt = f"""Extract product information from HTML and return as JSON.

Example 1:
Input: <div class="product"><h1>Smart Watch</h1><span class="price">$299</span><div class="rating">4.2 stars</div></div>
Output: {{"name": "Smart Watch", "price": 299, "rating": 4.2}}

Example 2:
Input: <div class="product"><h1>Laptop Stand</h1><span class="price">$49.99</span><div class="rating">4.8 stars</div></div>
Output: {{"name": "Laptop Stand", "price": 49.99, "rating": 4.8}}

Example 3:
Input: <div class="product"><h1>USB Cable</h1><span class="price">$12</span></div>
Output: {{"name": "USB Cable", "price": 12, "rating": null}}

Now extract from this HTML:
{html_content}
"""

response = openai.chat.completions.create(
    model="gpt-4",
    messages=[
        {"role": "system", "content": "You are a web scraping assistant that extracts structured data from HTML."},
        {"role": "user", "content": prompt}
    ],
    temperature=0  # Lower temperature for consistent output
)

result = json.loads(response.choices[0].message.content)
print(result)

JavaScript Example with OpenAI API

const OpenAI = require('openai');

const openai = new OpenAI({
  apiKey: process.env.OPENAI_API_KEY
});

async function extractWithExamples(html) {
  const prompt = `Extract product information from HTML and return as JSON.

Example 1:
Input: <div class="item"><h2>Coffee Maker</h2><span class="cost">$89.99</span></div>
Output: {"product": "Coffee Maker", "price": 89.99}

Example 2:
Input: <div class="item"><h2>Blender</h2><span class="cost">$59.50</span></div>
Output: {"product": "Blender", "price": 59.50}

Now extract from:
${html}`;

  const response = await openai.chat.completions.create({
    model: 'gpt-4',
    messages: [
      {
        role: 'system',
        content: 'You extract structured data from HTML. Always return valid JSON.'
      },
      {
        role: 'user',
        content: prompt
      }
    ],
    temperature: 0
  });

  return JSON.parse(response.choices[0].message.content);
}

// Usage
const html = '<div class="item"><h2>Toaster</h2><span class="cost">$34.99</span></div>';
extractWithExamples(html).then(result => console.log(result));

Best Practices for Example Selection

1. Use Representative Examples

Choose examples that reflect the variety of data you'll encounter:

prompt = """Extract article metadata from HTML.

Example 1 - Standard article:
Input: <article><h1>AI Revolution</h1><span class="author">Jane Smith</span><time>2024-01-15</time></article>
Output: {"title": "AI Revolution", "author": "Jane Smith", "date": "2024-01-15"}

Example 2 - Missing author:
Input: <article><h1>Breaking News</h1><time>2024-01-16</time></article>
Output: {"title": "Breaking News", "author": null, "date": "2024-01-16"}

Example 3 - Multiple authors:
Input: <article><h1>Research Paper</h1><span class="author">John Doe, Mary Johnson</span><time>2024-01-14</time></article>
Output: {"title": "Research Paper", "author": "John Doe, Mary Johnson", "date": "2024-01-14"}
"""

2. Show Edge Cases

Include examples of missing data, null values, and unusual formats:

few_shot_examples = """
Example with all fields present:
Input: <div class="listing"><h3>3BR Apartment</h3><p>$2,500/month</p><span>Available Now</span></div>
Output: {"title": "3BR Apartment", "price": 2500, "currency": "USD", "period": "month", "status": "available"}

Example with missing price:
Input: <div class="listing"><h3>Studio</h3><span>Contact for price</span></div>
Output: {"title": "Studio", "price": null, "currency": null, "period": null, "status": "contact"}

Example with different currency:
Input: <div class="listing"><h3>2BR House</h3><p>€1,800/month</p></div>
Output: {"title": "2BR House", "price": 1800, "currency": "EUR", "period": "month", "status": "available"}
"""

3. Demonstrate Output Format Consistency

Maintain consistent JSON structure across all examples:

const systemPrompt = `You extract job listings from HTML. Always use this exact JSON structure:
{
  "title": string,
  "company": string,
  "location": string or null,
  "salary": {"min": number, "max": number, "currency": string} or null,
  "remote": boolean
}`;

const examples = `
Example 1:
<div class="job"><h2>Senior Developer</h2><span>TechCorp</span><p>New York</p><p>$120k-150k</p></div>
→ {"title": "Senior Developer", "company": "TechCorp", "location": "New York", "salary": {"min": 120000, "max": 150000, "currency": "USD"}, "remote": false}

Example 2:
<div class="job"><h2>Remote Designer</h2><span>DesignCo</span><p>Remote</p></div>
→ {"title": "Remote Designer", "company": "DesignCo", "location": null, "salary": null, "remote": true}
`;

Using Function Calling with Examples

Modern LLM APIs like OpenAI's function calling combine examples with structured schemas:

import openai

# Define the extraction schema
functions = [
    {
        "name": "extract_product",
        "description": "Extract product information from HTML",
        "parameters": {
            "type": "object",
            "properties": {
                "name": {"type": "string", "description": "Product name"},
                "price": {"type": "number", "description": "Price in USD"},
                "rating": {"type": "number", "description": "Rating out of 5"},
                "in_stock": {"type": "boolean", "description": "Availability status"}
            },
            "required": ["name", "price"]
        }
    }
]

# Provide examples in the system message
system_message = """You extract product data from HTML. Examples:

Input: <div><h1>Keyboard</h1><span>$79.99</span><div>4.3★</div><p>In Stock</p></div>
Output: {"name": "Keyboard", "price": 79.99, "rating": 4.3, "in_stock": true}

Input: <div><h1>Mouse Pad</h1><span>$15</span><div>Out of Stock</div></div>
Output: {"name": "Mouse Pad", "price": 15, "rating": null, "in_stock": false}"""

response = openai.chat.completions.create(
    model="gpt-4",
    messages=[
        {"role": "system", "content": system_message},
        {"role": "user", "content": html_to_scrape}
    ],
    functions=functions,
    function_call={"name": "extract_product"}
)

# Extract the structured data
result = json.loads(response.choices[0].message.function_call.arguments)

Chain-of-Thought Prompting with Examples

For complex scraping tasks, show the LLM how to reason through the extraction:

prompt = """Extract event information from HTML with step-by-step reasoning.

Example 1:
HTML: <div class="event"><h2>Tech Conference 2024</h2><p>March 15-17, 2024</p><p>San Francisco, CA</p><span>$299</span></div>

Reasoning:
1. Event name is in the <h2> tag: "Tech Conference 2024"
2. Date range in first <p>: "March 15-17, 2024" → start: 2024-03-15, end: 2024-03-17
3. Location in second <p>: "San Francisco, CA"
4. Price in <span>: $299 → 299

Output: {"name": "Tech Conference 2024", "start_date": "2024-03-15", "end_date": "2024-03-17", "location": "San Francisco, CA", "price": 299}

Now extract from this HTML using the same reasoning approach:
{your_html}
"""

Handling Multiple Item Extraction

When scraping lists, provide examples of how to handle multiple items:

const prompt = `Extract all products from the HTML as a JSON array.

Example:
Input:
<div class="products">
  <div class="item"><h3>Laptop</h3><p>$999</p></div>
  <div class="item"><h3>Monitor</h3><p>$299</p></div>
  <div class="item"><h3>Keyboard</h3><p>$79</p></div>
</div>

Output:
[
  {"name": "Laptop", "price": 999},
  {"name": "Monitor", "price": 299},
  {"name": "Keyboard", "price": 79}
]

Now extract all products from:
${html}`;

Optimizing Example Count

The optimal number of examples depends on task complexity:

  • Simple extraction (1-3 fields): 2-3 examples
  • Moderate complexity (4-8 fields): 3-5 examples
  • Complex extraction (8+ fields, nested data): 5-7 examples

More examples aren't always better—they consume tokens and may not improve results beyond a certain point.

Using Examples with AI Web Scraping APIs

Services like WebScraping.AI allow you to provide examples directly in API calls:

import requests

# Use WebScraping.AI with example-based extraction
response = requests.post(
    'https://api.webscraping.ai/ai/question',
    params={
        'api_key': 'your_api_key',
        'url': 'https://example.com/products'
    },
    json={
        'question': '''Extract product info as JSON.

Examples:
1. Input: <div class="prod"><h1>Shirt</h1><span>$29.99</span></div>
   Output: {"product": "Shirt", "price": 29.99}

2. Input: <div class="prod"><h1>Pants</h1><span>$49.99</span></div>
   Output: {"product": "Pants", "price": 49.99}

Extract from the current page.'''
    }
)

print(response.json())

Common Pitfalls to Avoid

1. Inconsistent Example Formats

# Bad: Inconsistent formatting
examples = """
Example 1: {"name": "Product A", "price": 100}
Example 2: {'name': 'Product B', 'cost': 200}  # Different key name, single quotes
"""

# Good: Consistent formatting
examples = """
Example 1: {"name": "Product A", "price": 100}
Example 2: {"name": "Product B", "price": 200}
"""

2. Overly Complex Examples

Keep examples focused on the task at hand. Don't include irrelevant HTML or data.

3. Not Showing Null Handling

Always include examples with missing data to teach the model how to handle nulls.

Conclusion

Providing well-crafted examples is essential for accurate LLM-based web scraping. By using few-shot learning, function calling, and chain-of-thought prompting with clear examples, you can significantly improve extraction quality while reducing hallucinations and inconsistencies.

Remember to: - Use 2-5 representative examples - Show edge cases and null handling - Maintain consistent output formats - Keep examples simple and focused - Test with real-world data variations

With these techniques, you can build robust LLM-based scrapers that reliably extract structured data from even the most complex web pages.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon