How do I create a prompt template for web scraping with LLMs?

Creating effective prompt templates is crucial for successful LLM-powered web scraping. A well-designed prompt template helps the language model understand exactly what data to extract from HTML or text content, ensuring consistent and accurate results across different pages.

Understanding Prompt Templates for Web Scraping

A prompt template for web scraping typically consists of three key components:

Instructions: Clear directions on what task the LLM should perform
Context: The HTML or text content to extract data from
Output specification: The desired format and structure of extracted data

The goal is to create reusable templates that can be populated with different HTML content while maintaining consistent extraction quality.

Basic Prompt Template Structure

Here's a fundamental template structure for web scraping with LLMs:

prompt_template = """
You are a web scraping assistant. Extract the following information from the HTML below:

{extraction_instructions}

HTML Content:
{html_content}

Return the extracted data in the following JSON format:
{output_schema}

Only return valid JSON without any additional text or explanations.
"""

Python Implementation with OpenAI

Here's a complete example using Python with the OpenAI API:

import openai
from typing import Dict, Any
import json

class LLMScraperTemplate:
    def __init__(self, api_key: str):
        openai.api_key = api_key

    def create_product_extraction_template(self) -> str:
        """Template for extracting product information"""
        return """
You are a precise data extraction assistant. Extract product information from the HTML below.

Extract these fields:
- product_name: The full product title
- price: The current price (numeric value only)
- currency: The currency symbol or code
- availability: Whether the product is in stock (true/false)
- rating: Average customer rating (numeric value)
- description: Brief product description (max 200 characters)

HTML Content:
{html_content}

Return ONLY a valid JSON object with these exact field names. If a field is not found, use null.

Example output:
{{
  "product_name": "Example Product",
  "price": 29.99,
  "currency": "USD",
  "availability": true,
  "rating": 4.5,
  "description": "This is a great product..."
}}
"""

    def extract_with_template(self, template: str, html_content: str,
                            model: str = "gpt-4") -> Dict[str, Any]:
        """Execute the template with given HTML content"""
        prompt = template.format(html_content=html_content)

        response = openai.ChatCompletion.create(
            model=model,
            messages=[
                {"role": "system", "content": "You are a data extraction expert."},
                {"role": "user", "content": prompt}
            ],
            temperature=0,  # Lower temperature for more consistent outputs
            max_tokens=1000
        )

        extracted_text = response.choices[0].message.content
        return json.loads(extracted_text)

# Usage example
scraper = LLMScraperTemplate("your-api-key")
template = scraper.create_product_extraction_template()

html = """
<div class="product">
    <h1>Premium Wireless Headphones</h1>
    <span class="price">$149.99</span>
    <p class="stock">In Stock</p>
    <div class="rating">4.7 out of 5 stars</div>
    <p class="desc">High-quality wireless headphones with noise cancellation</p>
</div>
"""

result = scraper.extract_with_template(template, html)
print(json.dumps(result, indent=2))

JavaScript Implementation with OpenAI

Here's the equivalent implementation in JavaScript:

const OpenAI = require('openai');

class LLMScraperTemplate {
    constructor(apiKey) {
        this.client = new OpenAI({ apiKey });
    }

    createProductExtractionTemplate() {
        return `
You are a precise data extraction assistant. Extract product information from the HTML below.

Extract these fields:
- product_name: The full product title
- price: The current price (numeric value only)
- currency: The currency symbol or code
- availability: Whether the product is in stock (true/false)
- rating: Average customer rating (numeric value)
- description: Brief product description (max 200 characters)

HTML Content:
{html_content}

Return ONLY a valid JSON object with these exact field names. If a field is not found, use null.
`;
    }

    async extractWithTemplate(template, htmlContent, model = 'gpt-4') {
        const prompt = template.replace('{html_content}', htmlContent);

        const response = await this.client.chat.completions.create({
            model: model,
            messages: [
                { role: 'system', content: 'You are a data extraction expert.' },
                { role: 'user', content: prompt }
            ],
            temperature: 0,
            max_tokens: 1000
        });

        const extractedText = response.choices[0].message.content;
        return JSON.parse(extractedText);
    }
}

// Usage example
async function main() {
    const scraper = new LLMScraperTemplate('your-api-key');
    const template = scraper.createProductExtractionTemplate();

    const html = `
        <div class="product">
            <h1>Premium Wireless Headphones</h1>
            <span class="price">$149.99</span>
            <p class="stock">In Stock</p>
            <div class="rating">4.7 out of 5 stars</div>
        </div>
    `;

    const result = await scraper.extractWithTemplate(template, html);
    console.log(JSON.stringify(result, null, 2));
}

main();

Advanced Template Patterns

Multi-Item Extraction Template

For extracting multiple items (like search results or product listings):

multi_item_template = """
Extract all product listings from the HTML below. Each product should include:
- title: Product name
- price: Numeric price value
- url: Product link (href attribute)

HTML Content:
{html_content}

Return a JSON array of products. Example:
[
  {{"title": "Product 1", "price": 29.99, "url": "/product-1"}},
  {{"title": "Product 2", "price": 39.99, "url": "/product-2"}}
]

Return ONLY the JSON array without any additional text.
"""

Context-Aware Template

Include examples to guide the LLM's extraction:

context_aware_template = """
Extract article metadata from the HTML content.

Fields to extract:
- title: Article headline
- author: Author name
- publish_date: Publication date in ISO 8601 format
- tags: Array of topic tags
- excerpt: First paragraph or summary

HTML Content:
{html_content}

Examples of valid output:
{{
  "title": "Understanding Machine Learning",
  "author": "Jane Smith",
  "publish_date": "2024-01-15T10:30:00Z",
  "tags": ["AI", "Technology", "Education"],
  "excerpt": "Machine learning is transforming industries..."
}}

Return ONLY valid JSON matching this structure.
"""

Template Best Practices

1. Be Specific About Output Format

Always specify the exact JSON structure you expect:

# Good - Specific structure
"""Return JSON: {"name": str, "price": float, "in_stock": bool}"""

# Bad - Vague instruction
"""Return the data as JSON"""

2. Use Zero Temperature for Consistency

Set temperature=0 to get more deterministic outputs:

response = openai.ChatCompletion.create(
    model="gpt-4",
    temperature=0,  # Maximum consistency
    messages=[...]
)

3. Handle Missing Data Gracefully

Instruct the LLM on how to handle missing fields:

template = """
Extract data from HTML. If a field is not found:
- Use null for missing optional fields
- Use empty string "" for missing text fields
- Use 0 for missing numeric fields
- Use empty array [] for missing list fields

HTML: {html_content}
"""

4. Validate LLM Output

Always validate and sanitize LLM responses:

def safe_extract(template: str, html: str) -> Dict[str, Any]:
    try:
        result = extract_with_template(template, html)

        # Validate required fields
        required_fields = ['title', 'price']
        for field in required_fields:
            if field not in result:
                raise ValueError(f"Missing required field: {field}")

        # Type checking
        if not isinstance(result['price'], (int, float)):
            result['price'] = float(result['price'])

        return result
    except json.JSONDecodeError:
        print("Invalid JSON response from LLM")
        return None
    except Exception as e:
        print(f"Extraction error: {e}")
        return None

Combining LLM Templates with Traditional Scraping

LLMs work best when combined with traditional web scraping tools to pre-process HTML:

from bs4 import BeautifulSoup
import requests

def hybrid_scraping_approach(url: str, template: str) -> Dict[str, Any]:
    # Step 1: Fetch HTML
    response = requests.get(url)

    # Step 2: Pre-process with BeautifulSoup
    soup = BeautifulSoup(response.text, 'html.parser')

    # Extract relevant section only (reduces token usage)
    main_content = soup.find('main', class_='product-details')

    if main_content:
        # Step 3: Feed cleaned HTML to LLM
        cleaned_html = str(main_content)
        return extract_with_template(template, cleaned_html)

    return None

This approach is similar to how you might handle AJAX requests using Puppeteer to fetch dynamic content before processing it with an LLM.

Cost Optimization Strategies

LLM API calls can be expensive. Optimize your templates:

1. Minimize HTML Input

Strip unnecessary tags and attributes:

from bs4 import BeautifulSoup

def minimize_html(html: str) -> str:
    soup = BeautifulSoup(html, 'html.parser')

    # Remove script and style tags
    for tag in soup(['script', 'style', 'nav', 'footer']):
        tag.decompose()

    # Remove comments
    for comment in soup.find_all(text=lambda text: isinstance(text, Comment)):
        comment.extract()

    return str(soup)

2. Use Smaller Models When Possible

Start with GPT-3.5 and upgrade to GPT-4 only if needed:

def adaptive_extraction(html: str, template: str) -> Dict[str, Any]:
    # Try with cheaper model first
    try:
        result = extract_with_template(template, html, model="gpt-3.5-turbo")
        if validate_result(result):
            return result
    except:
        pass

    # Fall back to more capable model
    return extract_with_template(template, html, model="gpt-4")

3. Cache Results

Avoid re-processing identical pages:

import hashlib
from functools import lru_cache

@lru_cache(maxsize=1000)
def cached_extraction(html_hash: str, template_hash: str) -> str:
    # This will be called only once per unique HTML + template combo
    html = get_html_from_hash(html_hash)
    template = get_template_from_hash(template_hash)
    return extract_with_template(template, html)

def extract_with_cache(html: str, template: str) -> Dict[str, Any]:
    html_hash = hashlib.md5(html.encode()).hexdigest()
    template_hash = hashlib.md5(template.encode()).hexdigest()
    return cached_extraction(html_hash, template_hash)

Using Function Calling for Structured Output

Modern LLM APIs support function calling, which ensures structured output without JSON parsing:

functions = [
    {
        "name": "save_product_data",
        "description": "Save extracted product information",
        "parameters": {
            "type": "object",
            "properties": {
                "product_name": {"type": "string"},
                "price": {"type": "number"},
                "currency": {"type": "string"},
                "availability": {"type": "boolean"},
                "rating": {"type": "number"}
            },
            "required": ["product_name", "price"]
        }
    }
]

response = openai.ChatCompletion.create(
    model="gpt-4",
    messages=[{"role": "user", "content": f"Extract product data from: {html}"}],
    functions=functions,
    function_call={"name": "save_product_data"}
)

# Extract structured data from function call
function_args = json.loads(
    response.choices[0].message.function_call.arguments
)

To learn more about this technique, check out our guide on function calling in LLMs.

Testing Your Templates

Always test templates with various HTML structures:

def test_template_robustness():
    test_cases = [
        # Complete data
        '<div><h1>Product A</h1><span class="price">$50</span></div>',

        # Missing price
        '<div><h1>Product B</h1></div>',

        # Different structure
        '<article><h2>Product C</h2><p>Price: 30 USD</p></article>',

        # Malformed HTML
        '<div><h1>Product D<span>$40</div>',
    ]

    template = create_product_extraction_template()

    for i, html in enumerate(test_cases):
        result = extract_with_template(template, html)
        print(f"Test {i+1}: {result}")
        assert result is not None, f"Test {i+1} failed"

Conclusion

Creating effective prompt templates for LLM-powered web scraping requires careful attention to instruction clarity, output specification, and error handling. By following the patterns and best practices outlined in this guide, you can build robust, reusable templates that extract structured data reliably from web pages.

Remember to combine LLM capabilities with traditional parsing when appropriate, optimize for cost by minimizing input size, and always validate outputs. When dealing with dynamic content, you might also want to explore how to monitor network requests in Puppeteer to ensure you're capturing all the data your LLM template needs to process.

The key to success is iteration—start with simple templates, test them against real-world data, and refine based on the results you observe.

Table of contents