Table of contents

What are examples of AI prompt optimization for web scraping?

AI prompt optimization is crucial for effective web scraping with Large Language Models (LLMs) like GPT-4, Claude, or Gemini. Well-crafted prompts can dramatically improve extraction accuracy, reduce API costs, and minimize hallucinations. This guide explores proven prompt optimization techniques with practical examples.

Understanding Prompt Optimization Basics

Prompt optimization for web scraping involves structuring your instructions to help the AI model understand exactly what data to extract, in what format, and with what level of precision. Unlike traditional selectors, AI models interpret natural language, making prompt quality the primary determinant of success.

Key Principles

  1. Be specific and explicit about what you want
  2. Provide clear output format specifications
  3. Include examples when possible
  4. Set constraints to prevent hallucinations
  5. Use structured output formats like JSON

Example 1: Basic Product Information Extraction

Unoptimized Prompt

import openai

html_content = """
<div class="product">
  <h1>Premium Wireless Headphones</h1>
  <span class="price">$299.99</span>
  <p class="description">Noise-canceling over-ear headphones with 30-hour battery life</p>
</div>
"""

response = openai.ChatCompletion.create(
    model="gpt-4",
    messages=[{
        "role": "user",
        "content": f"Extract product info from: {html_content}"
    }]
)

This vague prompt may return inconsistent results or miss important fields.

Optimized Prompt

import openai
import json

prompt = f"""Extract product information from the following HTML and return ONLY a valid JSON object with these exact fields:

Required fields:
- name (string): The product name
- price (number): Price as a decimal number without currency symbols
- description (string): Product description
- currency (string): Currency code (e.g., "USD")

Rules:
- Return ONLY the JSON object, no other text
- If a field is not found, use null
- For price, extract only the numeric value

HTML:
{html_content}"""

response = openai.ChatCompletion.create(
    model="gpt-4",
    messages=[{"role": "user", "content": prompt}],
    temperature=0  # Reduce randomness for consistent extraction
)

data = json.loads(response.choices[0].message.content)
print(data)
# Output: {"name": "Premium Wireless Headphones", "price": 299.99, "description": "Noise-canceling over-ear headphones with 30-hour battery life", "currency": "USD"}

Example 2: Few-Shot Learning for Complex Structures

Few-shot prompting provides examples to guide the AI model's understanding of your extraction requirements.

const { OpenAI } = require('openai');

const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });

async function extractArticleData(html) {
    const prompt = `Extract article metadata from HTML and return JSON.

Example 1:
Input: <article><h1>Getting Started with AI</h1><span class="author">John Doe</span><time>2024-01-15</time></article>
Output: {"title": "Getting Started with AI", "author": "John Doe", "date": "2024-01-15", "tags": []}

Example 2:
Input: <article><h2>Web Scraping Best Practices</h2><p class="by">Jane Smith</p><time datetime="2024-02-20">Feb 20</time><div class="tags"><span>scraping</span><span>tutorial</span></div></article>
Output: {"title": "Web Scraping Best Practices", "author": "Jane Smith", "date": "2024-02-20", "tags": ["scraping", "tutorial"]}

Now extract from this HTML:
${html}

Return ONLY the JSON object.`;

    const response = await openai.chat.completions.create({
        model: "gpt-4",
        messages: [{ role: "user", content: prompt }],
        temperature: 0
    });

    return JSON.parse(response.choices[0].message.content);
}

Few-shot learning is particularly effective when dealing with AI-powered data extraction from websites with varying HTML structures.

Example 3: Constraint-Based Prompting to Prevent Hallucinations

AI models can sometimes generate plausible but incorrect data. Use constraints to minimize this:

import anthropic

def extract_product_reviews(html_content):
    prompt = f"""Extract customer reviews from this HTML. Follow these strict rules:

CONSTRAINTS:
1. ONLY extract text that is explicitly present in the HTML
2. Do NOT infer, guess, or generate any content
3. If you cannot find a field, use null - NEVER make up data
4. Extract exactly as written - do not paraphrase or summarize
5. Return ONLY valid JSON array format

Required fields per review:
- reviewer_name (string or null): Exact name as shown
- rating (number or null): Numeric rating only (e.g., 5, 4.5)
- review_text (string or null): Exact review text
- date (string or null): Date in ISO format if parseable

HTML:
{html_content}

Return format: [{{"reviewer_name": "...", "rating": 5, "review_text": "...", "date": "..."}}]
"""

    client = anthropic.Anthropic(api_key="your-api-key")

    message = client.messages.create(
        model="claude-3-5-sonnet-20241022",
        max_tokens=1024,
        temperature=0,
        messages=[{"role": "user", "content": prompt}]
    )

    return message.content[0].text

Example 4: Chain-of-Thought Prompting for Complex Extraction

For complex data extraction tasks, chain-of-thought prompting encourages the AI to reason through the problem:

def extract_pricing_tiers(html):
    prompt = f"""Extract pricing tier information from this HTML. Think step-by-step:

Step 1: Identify all pricing tier sections
Step 2: For each tier, extract the name
Step 3: Extract the price and billing period
Step 4: List all features for that tier
Step 5: Format as JSON

HTML:
{html}

Provide your reasoning, then output the final JSON in this format:
[{{
  "tier_name": "string",
  "price": number,
  "billing_period": "monthly|yearly",
  "features": ["feature1", "feature2"]
}}]

Begin with "Reasoning:" followed by your analysis, then "Output:" followed by JSON.
"""

    response = openai.ChatCompletion.create(
        model="gpt-4",
        messages=[{"role": "user", "content": prompt}],
        temperature=0
    )

    # Extract JSON from response
    content = response.choices[0].message.content
    json_start = content.find('[')
    json_data = content[json_start:]

    return json.loads(json_data)

Example 5: Role-Based Prompting

Assigning a specific role can improve extraction quality:

async function extractTechnicalSpecs(html) {
    const prompt = `You are a data extraction specialist with expertise in technical specifications.

Your task: Extract ALL technical specifications from this product page HTML with perfect accuracy.

Requirements:
- Extract specification names and values as key-value pairs
- Preserve exact units (GB, MHz, inches, etc.)
- Convert measurements to standard formats when obvious
- Group related specs logically

HTML:
${html}

Return JSON format:
{
  "processor": "...",
  "ram": "...",
  "storage": "...",
  "display": "...",
  "dimensions": "...",
  "weight": "...",
  "other_specs": {"key": "value"}
}`;

    const response = await openai.chat.completions.create({
        model: "gpt-4-turbo-preview",
        messages: [
            { role: "system", content: "You are a precise data extraction specialist." },
            { role: "user", content: prompt }
        ],
        temperature: 0
    });

    return JSON.parse(response.choices[0].message.content);
}

Example 6: Function Calling for Structured Output

Modern LLMs support function calling, which enforces strict output schemas:

import openai

def scrape_with_function_calling(html_content):
    tools = [{
        "type": "function",
        "function": {
            "name": "extract_product_data",
            "description": "Extract structured product data from HTML",
            "parameters": {
                "type": "object",
                "properties": {
                    "name": {
                        "type": "string",
                        "description": "Product name"
                    },
                    "price": {
                        "type": "number",
                        "description": "Numeric price value"
                    },
                    "currency": {
                        "type": "string",
                        "enum": ["USD", "EUR", "GBP"],
                        "description": "Currency code"
                    },
                    "availability": {
                        "type": "string",
                        "enum": ["in_stock", "out_of_stock", "preorder"],
                        "description": "Stock status"
                    },
                    "features": {
                        "type": "array",
                        "items": {"type": "string"},
                        "description": "List of product features"
                    }
                },
                "required": ["name", "price", "currency"]
            }
        }
    }]

    response = openai.chat.completions.create(
        model="gpt-4-turbo-preview",
        messages=[{
            "role": "user",
            "content": f"Extract product data from this HTML:\n{html_content}"
        }],
        tools=tools,
        tool_choice={"type": "function", "function": {"name": "extract_product_data"}}
    )

    # Extract function arguments (the structured data)
    function_args = response.choices[0].message.tool_calls[0].function.arguments
    return json.loads(function_args)

This approach is highly effective when using ChatGPT API for web scraping because it guarantees schema compliance.

Example 7: Context Window Optimization

For large HTML documents, optimize what you send to the AI:

from bs4 import BeautifulSoup

def optimize_html_for_llm(html_content, target_selectors):
    """Extract only relevant portions of HTML to reduce token usage"""
    soup = BeautifulSoup(html_content, 'html.parser')

    # Remove unnecessary elements
    for element in soup(['script', 'style', 'svg', 'path']):
        element.decompose()

    # Extract only target sections
    relevant_sections = []
    for selector in target_selectors:
        elements = soup.select(selector)
        relevant_sections.extend([str(el) for el in elements])

    optimized_html = '\n'.join(relevant_sections)

    prompt = f"""Extract data from these relevant HTML sections:

{optimized_html}

Return JSON with: {{"products": [{{"name": "...", "price": ..., "rating": ...}}]}}
"""

    return prompt

# Usage
html = fetch_webpage("https://example.com/products")
optimized_prompt = optimize_html_for_llm(html, ['.product-card', '.product-info'])

Example 8: Template-Based Extraction

Create reusable prompt templates for consistent results:

class LLMScraperTemplate:
    def __init__(self, api_key):
        self.client = openai.OpenAI(api_key=api_key)

    def create_extraction_prompt(self, html, schema, constraints=None):
        """Generate optimized prompts from schema definition"""
        schema_description = "\n".join([
            f"- {field['name']} ({field['type']}): {field['description']}"
            for field in schema
        ])

        required_fields = [f['name'] for f in schema if f.get('required', False)]

        constraint_text = ""
        if constraints:
            constraint_text = "\nCONSTRAINTS:\n" + "\n".join(
                f"{i+1}. {c}" for i, c in enumerate(constraints)
            )

        prompt = f"""Extract the following fields from the HTML:

{schema_description}

Required fields: {', '.join(required_fields)}
{constraint_text}

HTML:
{html}

Return ONLY valid JSON matching this structure.
"""
        return prompt

    def extract(self, html, schema, constraints=None):
        prompt = self.create_extraction_prompt(html, schema, constraints)

        response = self.client.chat.completions.create(
            model="gpt-4",
            messages=[{"role": "user", "content": prompt}],
            temperature=0
        )

        return json.loads(response.choices[0].message.content)

# Usage
scraper = LLMScraperTemplate(api_key="your-key")

schema = [
    {"name": "title", "type": "string", "description": "Article title", "required": True},
    {"name": "author", "type": "string", "description": "Author name", "required": True},
    {"name": "published_date", "type": "string", "description": "ISO date format", "required": False},
    {"name": "tags", "type": "array", "description": "Article tags", "required": False}
]

constraints = [
    "Extract only explicitly visible text",
    "Do not infer missing information",
    "Use null for missing optional fields"
]

result = scraper.extract(html_content, schema, constraints)

Best Practices for Prompt Optimization

1. Use Low Temperature Settings

Set temperature=0 or very low values for deterministic extraction:

response = openai.ChatCompletion.create(
    model="gpt-4",
    messages=[{"role": "user", "content": prompt}],
    temperature=0,  # Maximum consistency
    top_p=1.0
)

2. Specify Output Format Explicitly

Always tell the AI exactly how to format responses:

Return ONLY a valid JSON object with no markdown formatting, no explanations, no additional text.

3. Include Validation Rules

prompt = f"""Extract email addresses from HTML.

Validation rules:
- Must match standard email format (user@domain.com)
- Ignore mailto: links and email images
- Return unique emails only
- Exclude social media placeholders like info@example.com

HTML: {html}

Return: {{"emails": ["email1@domain.com", "email2@domain.com"]}}
"""

4. Test and Iterate

When implementing GPT-based web scraping, always test prompts with diverse HTML samples:

def test_prompt_variations(html_samples, prompt_templates):
    """Compare different prompt strategies"""
    results = {}

    for template_name, template in prompt_templates.items():
        accuracy_scores = []

        for sample in html_samples:
            prompt = template.format(html=sample['html'])
            response = call_llm(prompt)
            score = evaluate_accuracy(response, sample['expected'])
            accuracy_scores.append(score)

        results[template_name] = {
            'avg_accuracy': sum(accuracy_scores) / len(accuracy_scores),
            'scores': accuracy_scores
        }

    return results

Conclusion

AI prompt optimization for web scraping requires careful consideration of specificity, constraints, output format, and validation. By using techniques like few-shot learning, function calling, and structured templates, you can achieve reliable extraction while minimizing costs and hallucinations.

The key is to be explicit, provide examples, set clear constraints, and iterate based on real-world results. As AI models continue to evolve, prompt engineering remains the critical skill for effective LLM-based web scraping.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon