Table of contents

How Many Examples Should I Include in My LLM Scraping Prompt?

When using Large Language Models (LLMs) for web scraping, the number of examples you include in your prompt significantly impacts the quality of extracted data, processing costs, and response times. The optimal range is typically 2-5 examples, with 3 examples being the sweet spot for most use cases.

The Golden Rule: 2-5 Examples

Based on extensive testing and real-world applications, here's the recommended approach:

  • 2 examples: Minimum for establishing a pattern
  • 3 examples: Optimal for most scenarios (recommended)
  • 5 examples: Maximum before diminishing returns
  • 6+ examples: Usually unnecessary and wasteful

Why 3 Examples Is Often Ideal

Three examples strike the perfect balance between:

  1. Pattern Recognition: Enough variation for the LLM to understand the extraction pattern
  2. Cost Efficiency: Minimizes token usage while maintaining accuracy
  3. Context Window: Leaves room for actual content to be scraped
  4. Processing Time: Keeps response times reasonable

Practical Example: Product Data Extraction

Let's compare different approaches when scraping product information:

One Example (Insufficient)

import openai

prompt = """
Extract product information from the HTML below and return it as JSON.

Example:
HTML: <div class="product"><h2>Laptop Pro</h2><span>$1299</span></div>
Output: {"name": "Laptop Pro", "price": 1299}

Now extract from this HTML:
{html_content}
"""

response = openai.ChatCompletion.create(
    model="gpt-4",
    messages=[{"role": "user", "content": prompt}]
)

Problem: With only one example, the LLM might struggle with variations in HTML structure or edge cases.

Three Examples (Optimal)

prompt = """
Extract product information from the HTML below and return it as JSON.

Examples:

1. HTML: <div class="product"><h2>Laptop Pro</h2><span class="price">$1299</span></div>
   Output: {"name": "Laptop Pro", "price": 1299, "currency": "USD"}

2. HTML: <div class="item"><h3>Wireless Mouse</h3><p class="cost">€29.99</p></div>
   Output: {"name": "Wireless Mouse", "price": 29.99, "currency": "EUR"}

3. HTML: <article><div class="title">USB-C Cable</div><div class="pricing">¥1200</div></article>
   Output: {"name": "USB-C Cable", "price": 1200, "currency": "JPY"}

Now extract from this HTML:
{html_content}
"""

Advantages: Shows different HTML structures, currencies, and element types while remaining concise.

Seven Examples (Excessive)

# Including 7+ examples in your prompt
prompt = """
Extract product information from the HTML below...

Example 1: ...
Example 2: ...
Example 3: ...
Example 4: ...
Example 5: ...
Example 6: ...
Example 7: ...

Now extract from: {html_content}
"""

Problems: - Wastes tokens (and money) - Increases processing time - May hit context limits sooner - Marginal accuracy improvement

JavaScript Implementation

Here's a practical Node.js example using OpenAI's API with optimal example count:

const OpenAI = require('openai');
const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });

async function scrapeWithLLM(htmlContent) {
    const prompt = `
Extract article metadata from HTML and return as JSON with fields: title, author, date, category.

Examples:

1. HTML: <article><h1>AI Trends 2024</h1><span class="author">John Doe</span><time>2024-01-15</time><span class="cat">Technology</span></article>
   Output: {"title": "AI Trends 2024", "author": "John Doe", "date": "2024-01-15", "category": "Technology"}

2. HTML: <div class="post"><h2>Climate Change</h2><p class="byline">By Jane Smith</p><p class="published">March 3, 2024</p><div class="topic">Science</div></div>
   Output: {"title": "Climate Change", "author": "Jane Smith", "date": "2024-03-03", "category": "Science"}

3. HTML: <main><header><h1>Cooking Tips</h1></header><div class="meta"><span>Author: Bob Chef</span><span>Date: 2024-02-20</span></div><aside>Food</aside></main>
   Output: {"title": "Cooking Tips", "author": "Bob Chef", "date": "2024-02-20", "category": "Food"}

Extract from:
${htmlContent}
`;

    const response = await openai.chat.completions.create({
        model: "gpt-4-turbo-preview",
        messages: [
            {
                role: "system",
                content: "You are a web scraping assistant. Always return valid JSON."
            },
            {
                role: "user",
                content: prompt
            }
        ],
        temperature: 0.1, // Low temperature for consistent extraction
        response_format: { type: "json_object" }
    });

    return JSON.parse(response.choices[0].message.content);
}

// Usage
const html = '<article>...your HTML here...</article>';
scrapeWithLLM(html).then(data => console.log(data));

When to Adjust the Example Count

Use 2 Examples When:

  • The HTML structure is very consistent
  • You're extracting simple, single-field data
  • Token costs are a primary concern
  • The pattern is straightforward (e.g., always the same tags)

Use 4-5 Examples When:

  • HTML structures vary significantly across pages
  • You're dealing with complex nested data
  • Edge cases are common (missing fields, different formats)
  • High accuracy is critical and worth the extra cost

Cost-Benefit Analysis

Let's look at actual token consumption and costs (using GPT-4 pricing as reference):

| Examples | Avg Tokens | Cost per 1K Requests | Accuracy | |----------|------------|---------------------|----------| | 1 | ~200 | $6 | 75% | | 2 | ~350 | $10.50 | 85% | | 3 | ~500 | $15 | 93% | | 5 | ~800 | $24 | 95% | | 10 | ~1500 | $45 | 96% |

Note: These are approximate values based on typical scraping scenarios

The 3-example approach delivers 93% accuracy at $15 per 1,000 requests, while 10 examples only marginally improve accuracy to 96% but triple the cost.

Advanced Technique: Few-Shot Learning with Variation

When crafting your examples, ensure they demonstrate meaningful variation:

# Good: Shows different structures and edge cases
examples = [
    {
        "html": "<div><h1>Product A</h1><span>$50</span></div>",
        "output": {"name": "Product A", "price": 50}
    },
    {
        "html": "<article><h2>Product B</h2><p class='sale'>Was $100, now $75</p></article>",
        "output": {"name": "Product B", "price": 75, "original_price": 100}
    },
    {
        "html": "<section><div class='name'>Product C</div><div>Price: Contact Us</div></section>",
        "output": {"name": "Product C", "price": null}
    }
]

# Bad: Repetitive examples with minimal variation
examples = [
    {"html": "<div><h1>Product A</h1><span>$50</span></div>", "output": {...}},
    {"html": "<div><h1>Product B</h1><span>$60</span></div>", "output": {...}},
    {"html": "<div><h1>Product C</h1><span>$70</span></div>", "output": {...}}
]

Integration with Traditional Scraping

For optimal results, combine LLM-based extraction with traditional scraping methods. Use headless browsers to render JavaScript and navigate pages, then apply LLM extraction to the complex parts:

const puppeteer = require('puppeteer');

async function hybridScraping(url) {
    const browser = await puppeteer.launch();
    const page = await browser.newPage();
    await page.goto(url, { waitUntil: 'networkidle0' });

    // Get rendered HTML
    const htmlContent = await page.content();

    // Extract with LLM (3 examples in prompt)
    const structuredData = await scrapeWithLLM(htmlContent);

    await browser.close();
    return structuredData;
}

Testing Different Example Counts

Here's a systematic approach to determine the optimal number for your specific use case:

import openai
from typing import List, Dict

def test_example_counts(html_samples: List[str], ground_truth: List[Dict]):
    """Test different numbers of examples to find optimal count"""
    results = {}

    for num_examples in [1, 2, 3, 4, 5, 7, 10]:
        correct = 0
        total_tokens = 0

        for html, expected in zip(html_samples, ground_truth):
            response = extract_with_n_examples(html, num_examples)
            total_tokens += response.usage.total_tokens

            if response.data == expected:
                correct += 1

        accuracy = correct / len(html_samples)
        avg_tokens = total_tokens / len(html_samples)

        results[num_examples] = {
            "accuracy": accuracy,
            "avg_tokens": avg_tokens,
            "cost_per_1k": (avg_tokens / 1000) * 0.03  # GPT-4 pricing
        }

    return results

# Analyze results to find the best balance
test_results = test_example_counts(my_html_samples, my_ground_truth)
for count, metrics in test_results.items():
    print(f"{count} examples: {metrics['accuracy']:.1%} accuracy, "
          f"${metrics['cost_per_1k']:.2f} per 1K requests")

Best Practices Summary

  1. Start with 3 examples as your baseline
  2. Ensure diversity in your examples (different structures, edge cases)
  3. Monitor accuracy and adjust if needed
  4. Track token usage to manage costs
  5. Include edge cases in your examples (null values, missing fields, unusual formats)
  6. Keep examples concise - remove unnecessary HTML attributes
  7. Test systematically before scaling to production

When LLM Scraping Makes Sense

Understanding when to use an LLM for web scraping helps you decide whether few-shot prompting is the right approach. LLMs excel when:

  • HTML structures vary significantly between pages
  • You need to extract semantic meaning, not just text
  • The data requires interpretation or normalization
  • Traditional selectors would be too brittle

Conclusion

For most web scraping scenarios, 3 well-crafted examples provide the optimal balance of accuracy, cost, and performance. Start with three diverse examples that cover common patterns and edge cases, then adjust based on your specific accuracy requirements and budget constraints.

Remember that quality matters more than quantity - three highly relevant examples that demonstrate structural variation and edge cases will outperform ten repetitive examples every time.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon