How Can I Use LLM Prompts for Web Scraping?

LLM (Large Language Model) prompts are the instructions you give to AI models like GPT-4, Claude, or Gemini to extract structured data from web pages. Unlike traditional web scraping that relies on CSS selectors or XPath, prompt-based scraping uses natural language instructions to tell the AI exactly what data to extract and how to format it. Mastering prompt engineering is crucial for successful AI-powered web scraping.

Understanding LLM Prompts for Web Scraping

An LLM prompt for web scraping consists of several key components:

Context: Setting the AI's role and task
Instructions: Specific directions for data extraction
Schema definition: The structure of the output you want
HTML/text content: The webpage content to process
Examples (optional): Sample outputs to guide the AI

The quality of your prompts directly impacts extraction accuracy, consistency, and cost-efficiency.

Basic Prompt Structure

Here's the fundamental structure of an effective web scraping prompt:

import openai

client = openai.OpenAI(api_key="your-api-key")

# Basic prompt structure
system_prompt = """You are a web scraping assistant specialized in extracting
structured data from HTML content. Always return valid JSON and follow the
schema provided exactly."""

user_prompt = f"""
Extract product information from the following HTML.

Required fields for each product:
- name (string): The product name
- price (number): The price as a number without currency symbols
- availability (boolean): Whether the product is in stock

Return the data as JSON with this structure:
{{
  "products": [
    {{"name": "...", "price": 0.00, "availability": true}}
  ]
}}

HTML Content:
{html_content}
"""

response = client.chat.completions.create(
    model="gpt-4",
    messages=[
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": user_prompt}
    ],
    temperature=0,  # Deterministic output
    response_format={"type": "json_object"}
)

extracted_data = response.choices[0].message.content

Writing Effective System Prompts

The system prompt sets the AI's behavior and role. Here are examples for different scraping scenarios:

General-Purpose Web Scraping

system_prompt = """You are an expert web scraping assistant. Your task is to:
1. Extract data exactly as specified in the user's instructions
2. Return well-formed JSON that matches the requested schema
3. Handle missing data by using null values, never invent or guess information
4. Preserve the original text formatting (capitalization, spacing) unless instructed otherwise
5. If you cannot find requested information, set the field to null"""

E-commerce Data Extraction

system_prompt = """You are a specialized e-commerce data extraction assistant.
When extracting product data:
- Parse prices as numbers, removing currency symbols and thousands separators
- Identify availability status from various text formats ("In Stock", "Available", etc.)
- Extract product ratings as numbers (e.g., "4.5 stars" becomes 4.5)
- Normalize product variants (sizes, colors) into structured arrays
- Return valid JSON matching the exact schema provided"""

Article/Content Scraping

system_prompt = """You are a content extraction specialist. Your role is to:
- Extract article metadata (title, author, date, tags)
- Identify the main content body, excluding navigation, ads, and sidebars
- Parse publication dates into ISO 8601 format (YYYY-MM-DD)
- Extract author information even when presented in various formats
- Return structured JSON output"""

Crafting User Prompts: Best Practices

1. Be Explicit About Data Types

# Weak prompt
prompt = "Extract product info"

# Strong prompt
prompt = """
Extract the following fields for each product:
- name (string): The full product title
- price (number): Numeric price value, remove currency symbols
- original_price (number or null): Original price if item is on sale, null otherwise
- discount_percentage (integer or null): Discount as whole number (e.g., 20 for 20% off)
- rating (number): Rating from 0-5, parse from star displays or text
- review_count (integer): Number of reviews as integer
- in_stock (boolean): true if available for purchase, false otherwise
- image_url (string): URL of the main product image
"""

2. Provide Output Schema Examples

prompt = """
Extract restaurant listings from the HTML.

Example output format:
{
  "restaurants": [
    {
      "name": "Luigi's Pizzeria",
      "cuisine": "Italian",
      "price_range": "$$",
      "rating": 4.5,
      "review_count": 230,
      "address": "123 Main St, New York, NY 10001",
      "phone": "+1-555-0123",
      "is_open_now": true
    }
  ]
}

Extract all restaurants from the provided HTML following this exact structure.
If any field is not available, use null.

HTML Content:
{html_content}
"""

3. Use Few-Shot Learning for Complex Extractions

Few-shot learning provides example input-output pairs to guide the AI:

prompt = """
Extract job listings from HTML. Here are examples of the expected extraction:

Example 1:
HTML: "<div class='job'><h2>Senior Python Developer</h2><span>TechCorp - Remote - $120k-$160k</span></div>"
Output: {
  "title": "Senior Python Developer",
  "company": "TechCorp",
  "location": "Remote",
  "salary_min": 120000,
  "salary_max": 160000,
  "salary_currency": "USD"
}

Example 2:
HTML: "<div class='job'><h3>Marketing Manager</h3><p>Acme Inc | New York, NY</p></div>"
Output: {
  "title": "Marketing Manager",
  "company": "Acme Inc",
  "location": "New York, NY",
  "salary_min": null,
  "salary_max": null,
  "salary_currency": null
}

Now extract all jobs from this HTML:
{html_content}
"""

4. Handle Edge Cases Explicitly

prompt = """
Extract article data with the following rules:

Fields to extract:
- title (string, required)
- author (string or null)
- publish_date (string in YYYY-MM-DD format or null)
- content (string, main article text only)
- tags (array of strings, empty array if none)

Important rules:
1. If publish_date is in relative format ("2 days ago"), leave it as null
2. For multiple authors, join with commas: "John Doe, Jane Smith"
3. Exclude advertisements, navigation menus, and footer text from content
4. If author is listed as "Staff" or "Editorial Team", use that exact text
5. Tags should be lowercase and without # symbols

HTML Content:
{html_content}
"""

Advanced Prompt Techniques

Chain-of-Thought Prompting

For complex extractions, guide the AI through reasoning steps:

prompt = """
Extract product specifications from this technical product page.

Follow these steps:
1. First, identify the main product specifications table or section
2. Parse each specification row, extracting both the label and value
3. Normalize specification names (e.g., "RAM" and "Memory" both become "ram")
4. Convert values to appropriate types (numbers for measurements, booleans for yes/no)
5. Return as a structured object

Example reasoning:
- If you see "Weight: 2.5 lbs", extract: {"weight_lbs": 2.5}
- If you see "Warranty: Yes (2 years)", extract: {"has_warranty": true, "warranty_years": 2}
- If you see "Color: Available in Red, Blue", extract: {"colors": ["Red", "Blue"]}

Now extract specifications from:
{html_content}
"""

Multi-Step Extraction

Break complex tasks into stages:

def multi_step_extraction(html_content):
    # Step 1: Identify relevant sections
    step1_prompt = """
    Analyze this HTML and identify:
    1. The CSS selector or description of where product listings are located
    2. The CSS selector for pagination elements
    3. The total number of products visible on the page

    Return as JSON.
    """

    # Step 2: Extract data
    step2_prompt = """
    Based on the identified product section, extract all product data with fields:
    name, price, rating, availability.
    """

    # Execute steps sequentially
    # ... implementation

Validation and Correction Prompts

Add a validation layer to ensure data quality:

validation_prompt = """
Review this extracted data and validate:

Extracted data:
{extracted_json}

Validation checks:
1. All prices should be positive numbers
2. Ratings should be between 0 and 5
3. Email addresses should be valid format
4. Phone numbers should include country code
5. URLs should be complete and valid

If you find issues, return corrected JSON with a "validation_notes" field explaining changes.
If data is valid, return it unchanged with "validation_notes": "All valid".
"""

Optimizing Prompts for Different LLM Models

GPT-4 Prompts

GPT-4 excels at complex instructions and structured output:

# GPT-4 prompt utilizing function calling
functions = [
    {
        "name": "extract_products",
        "description": "Extract product information from HTML",
        "parameters": {
            "type": "object",
            "properties": {
                "products": {
                    "type": "array",
                    "items": {
                        "type": "object",
                        "properties": {
                            "name": {"type": "string", "description": "Product name"},
                            "price": {"type": "number", "description": "Price as number"},
                            "currency": {"type": "string", "description": "Currency code (USD, EUR, etc.)"},
                            "rating": {"type": "number", "minimum": 0, "maximum": 5}
                        },
                        "required": ["name", "price"]
                    }
                }
            }
        }
    }
]

response = client.chat.completions.create(
    model="gpt-4",
    messages=[{"role": "user", "content": f"Extract products from: {html}"}],
    functions=functions,
    function_call={"name": "extract_products"}
)

Claude Prompts

Claude (Anthropic) works well with detailed, structured instructions:

import anthropic

client = anthropic.Anthropic(api_key="your-api-key")

prompt = """
I need you to extract event information from HTML content.

<instructions>
Extract all events with these fields:
- event_name: The full event title
- start_date: ISO 8601 format (YYYY-MM-DD)
- start_time: 24-hour format (HH:MM) or null
- venue_name: Name of the venue
- venue_address: Full address
- ticket_price: Lowest available price as number, null if free
- is_sold_out: Boolean

Return as JSON array.
</instructions>

<html>
{html_content}
</html>

Provide only the JSON output, no additional commentary.
"""

response = client.messages.create(
    model="claude-3-opus-20240229",
    max_tokens=4096,
    messages=[{"role": "user", "content": prompt}]
)

Open Source LLMs (Llama, Mistral)

Smaller models benefit from simpler, more direct prompts:

prompt = """
### Task: Extract product data

### Input HTML:
{html_content}

### Required Output Format:
{{
  "products": [
    {{"name": "string", "price": number}}
  ]
}}

### Rules:
- Extract only name and price
- Price must be a number
- Return valid JSON only

### Output:
"""

Reducing Token Usage and Costs

Pre-process HTML Before Sending

from bs4 import BeautifulSoup
import re

def clean_html_for_llm(html_content, target_selector=None):
    """
    Clean and minimize HTML before sending to LLM
    """
    soup = BeautifulSoup(html_content, 'html.parser')

    # Remove unnecessary elements
    for element in soup(['script', 'style', 'svg', 'path', 'noscript']):
        element.decompose()

    # Remove comments
    for comment in soup.findAll(text=lambda text: isinstance(text, Comment)):
        comment.extract()

    # Extract only target section if specified
    if target_selector:
        target = soup.select_one(target_selector)
        if target:
            soup = target

    # Remove excessive attributes
    for tag in soup.find_all():
        # Keep only essential attributes
        attrs_to_keep = ['class', 'id', 'href', 'src', 'alt', 'title']
        tag.attrs = {k: v for k, v in tag.attrs.items() if k in attrs_to_keep}

    # Minimize whitespace
    html_str = str(soup)
    html_str = re.sub(r'\s+', ' ', html_str)
    html_str = re.sub(r'>\s+<', '><', html_str)

    return html_str

# Usage
cleaned_html = clean_html_for_llm(html_content, '.product-list')
# Now use cleaned_html in your prompt

Convert HTML to Simplified Markdown

import html2text

def html_to_markdown_for_llm(html_content):
    """
    Convert HTML to markdown to reduce tokens
    """
    h = html2text.HTML2Text()
    h.ignore_links = False
    h.ignore_images = False
    h.ignore_emphasis = False

    markdown = h.handle(html_content)
    return markdown

# Use in prompt
markdown_content = html_to_markdown_for_llm(html_content)
prompt = f"""
Extract product data from this markdown-formatted page content:

{markdown_content}
"""

Combining Prompts with Browser Automation

When scraping dynamic websites, combine LLM prompts with browser automation. You can use tools to handle AJAX requests before extracting data:

const puppeteer = require('puppeteer');
const OpenAI = require('openai');

const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });

async function scrapeWithPrompts(url, extractionPrompt) {
  // Launch browser
  const browser = await puppeteer.launch();
  const page = await browser.newPage();

  // Navigate and wait for dynamic content
  await page.goto(url, { waitUntil: 'networkidle0' });

  // Get rendered HTML
  const html = await page.content();
  await browser.close();

  // Use LLM to extract data
  const response = await openai.chat.completions.create({
    model: 'gpt-4',
    messages: [
      {
        role: 'system',
        content: 'Extract structured data from HTML. Return valid JSON only.'
      },
      {
        role: 'user',
        content: `${extractionPrompt}\n\nHTML:\n${html.substring(0, 10000)}`
      }
    ],
    temperature: 0,
    response_format: { type: 'json_object' }
  });

  return JSON.parse(response.choices[0].message.content);
}

// Example usage
const prompt = `
Extract all article headlines with:
- title (string)
- url (string)
- published_date (YYYY-MM-DD or null)

Return as {"articles": [...]}
`;

scrapeWithPrompts('https://news.example.com', prompt)
  .then(data => console.log(JSON.stringify(data, null, 2)));

Testing and Iterating on Prompts

Create a Prompt Testing Framework

def test_prompt(html_samples, prompt_template, expected_fields):
    """
    Test a prompt against multiple HTML samples
    """
    results = {
        'successful': 0,
        'failed': 0,
        'errors': []
    }

    for idx, html in enumerate(html_samples):
        try:
            response = client.chat.completions.create(
                model="gpt-4",
                messages=[
                    {"role": "system", "content": "Extract data as JSON."},
                    {"role": "user", "content": prompt_template.format(html=html)}
                ],
                temperature=0,
                response_format={"type": "json_object"}
            )

            data = json.loads(response.choices[0].message.content)

            # Validate expected fields
            if all(field in data for field in expected_fields):
                results['successful'] += 1
            else:
                results['failed'] += 1
                results['errors'].append(f"Sample {idx}: Missing fields")

        except Exception as e:
            results['failed'] += 1
            results['errors'].append(f"Sample {idx}: {str(e)}")

    return results

# Test your prompt
html_samples = [sample1_html, sample2_html, sample3_html]
expected_fields = ['products', 'total_count']

test_results = test_prompt(html_samples, my_prompt_template, expected_fields)
print(f"Success rate: {test_results['successful']}/{len(html_samples)}")

A/B Test Different Prompts

def compare_prompts(html_content, prompts_dict):
    """
    Compare multiple prompt variations
    """
    results = {}

    for prompt_name, prompt in prompts_dict.items():
        start_time = time.time()

        response = client.chat.completions.create(
            model="gpt-4",
            messages=[{"role": "user", "content": prompt.format(html=html_content)}],
            temperature=0
        )

        execution_time = time.time() - start_time
        tokens_used = response.usage.total_tokens

        results[prompt_name] = {
            'output': response.choices[0].message.content,
            'tokens': tokens_used,
            'time': execution_time,
            'cost': (tokens_used / 1000) * 0.03  # GPT-4 pricing
        }

    return results

# Compare different approaches
prompts = {
    'detailed': "Extract products with detailed schema...",
    'simple': "Extract: name, price for each product",
    'few_shot': "Examples: ... Now extract products"
}

comparison = compare_prompts(html_content, prompts)

Common Prompt Patterns for Web Scraping

Pattern 1: List Extraction

prompt = """
Extract all items from this list.

For each item, extract:
- text (string): The visible text
- link (string or null): URL if item is a link
- position (integer): Position in the list (1-indexed)

Return as: {"items": [...]}

HTML:
{html}
"""

Pattern 2: Table Extraction

prompt = """
Extract data from the table in this HTML.

Rules:
1. First row contains headers
2. Each subsequent row is a data record
3. Convert headers to snake_case keys
4. Parse numeric columns as numbers
5. Parse date columns to YYYY-MM-DD format

Return as: {"headers": [...], "rows": [...]}

HTML:
{html}
"""

Pattern 3: Nested Data Extraction

prompt = """
Extract category hierarchy with products.

Structure:
{{
  "categories": [
    {{
      "name": "Category Name",
      "subcategories": ["Sub1", "Sub2"],
      "products": [
        {{"name": "...", "price": 0}}
      ]
    }}
  ]
}}

HTML:
{html}
"""

Error Handling in Prompt-Based Scraping

import json
from jsonschema import validate, ValidationError

def extract_with_validation(html, prompt, schema):
    """
    Extract data and validate against JSON schema
    """
    max_retries = 3

    for attempt in range(max_retries):
        try:
            response = client.chat.completions.create(
                model="gpt-4",
                messages=[
                    {"role": "system", "content": "Return valid JSON only."},
                    {"role": "user", "content": f"{prompt}\n\nHTML:\n{html}"}
                ],
                temperature=0,
                response_format={"type": "json_object"}
            )

            data = json.loads(response.choices[0].message.content)

            # Validate against schema
            validate(instance=data, schema=schema)

            return data

        except json.JSONDecodeError:
            if attempt == max_retries - 1:
                raise
            # Retry with more explicit JSON instruction
            prompt += "\n\nIMPORTANT: Return ONLY valid JSON, no other text."

        except ValidationError as e:
            if attempt == max_retries - 1:
                raise
            # Add schema to prompt for next attempt
            prompt += f"\n\nSchema requirements: {json.dumps(schema)}"

    raise Exception("Failed to extract valid data after retries")

# Define schema
product_schema = {
    "type": "object",
    "properties": {
        "products": {
            "type": "array",
            "items": {
                "type": "object",
                "properties": {
                    "name": {"type": "string"},
                    "price": {"type": "number"}
                },
                "required": ["name", "price"]
            }
        }
    },
    "required": ["products"]
}

# Use with validation
result = extract_with_validation(html, prompt, product_schema)

Best Practices Summary

Start simple: Begin with basic prompts and add complexity as needed
Be specific: Define exact field types, formats, and requirements
Provide examples: Use few-shot learning for complex extractions
Validate output: Always validate extracted JSON against expected schema
Handle missing data: Instruct the AI to use null for missing values
Optimize for tokens: Clean HTML and extract relevant sections only
Test thoroughly: Use multiple HTML samples to ensure consistency
Monitor costs: Track token usage and API costs
Iterate: Continuously refine prompts based on results
Combine approaches: Use traditional selectors for navigation, LLMs for extraction

Conclusion

Mastering LLM prompts for web scraping requires understanding both prompt engineering principles and the specific challenges of data extraction. By writing clear, structured prompts with explicit schemas and validation rules, you can achieve high-quality data extraction with minimal code maintenance.

The key to success is iteration: start with simple prompts, test against real-world HTML samples, and gradually refine your instructions based on the results. Whether you're using GPT-4, Claude, or open-source models, the principles of effective prompt design remain consistent.

For dynamic websites that require interaction before extraction, consider combining LLM-based extraction with browser automation tools to monitor network requests and ensure you're capturing all necessary data.

Remember that while LLM-based scraping offers flexibility and reduces maintenance, it comes with costs and latency. Use it strategically where its strengths—handling inconsistent layouts and semantic understanding—provide the most value.

Table of contents