What Prompt Engineering Techniques Work Best for Web Scraping with LLMs?
Prompt engineering is crucial for successful LLM-based web scraping. Well-crafted prompts can dramatically improve extraction accuracy, reduce hallucinations, and ensure consistent structured output. This guide covers the most effective techniques for engineering prompts that extract reliable data from web pages.
Core Prompt Engineering Principles for Web Scraping
1. Be Explicit and Specific
The most fundamental rule of prompt engineering for web scraping is to be extremely specific about what you want to extract and how you want it formatted.
Poor prompt:
Extract product information from this page.
Better prompt: ``` Extract the following product information from this HTML: - Product name (string) - Price (number, without currency symbol) - Availability (boolean: true if in stock, false otherwise) - Rating (number between 0 and 5)
Return the data as a JSON object. If a field is not found, return null. ```
2. Use Few-Shot Learning Examples
Few-shot learning is one of the most powerful techniques for web scraping. Provide 2-3 examples of input HTML and the expected output format.
import openai
prompt = """
Extract product data from HTML snippets. Here are examples:
Example 1:
Input HTML: <div class="product"><h2>Laptop Pro</h2><span class="price">$1299</span><p class="stock">In Stock</p></div>
Output: {"name": "Laptop Pro", "price": 1299, "in_stock": true}
Example 2:
Input HTML: <div class="product"><h2>Wireless Mouse</h2><span class="price">$29.99</span><p class="stock">Out of Stock</p></div>
Output: {"name": "Wireless Mouse", "price": 29.99, "in_stock": false}
Now extract from this HTML:
{html_content}
"""
response = openai.ChatCompletion.create(
model="gpt-4",
messages=[{"role": "user", "content": prompt}]
)
3. Define the Output Schema Explicitly
Always specify the exact structure you expect. For JSON output, define the schema with field names, types, and constraints.
const extractionPrompt = `
Extract customer review data and return it as JSON with this exact schema:
{
"reviewer_name": string,
"rating": number (1-5),
"review_date": string (ISO 8601 format),
"verified_purchase": boolean,
"review_text": string,
"helpful_votes": number
}
Rules:
- If a field is missing, use null
- Convert all dates to ISO 8601 format (YYYY-MM-DD)
- Extract only numeric rating values
- Do not include any text outside the JSON object
HTML content:
${htmlContent}
`;
4. Use Chain-of-Thought Prompting
For complex extractions, ask the LLM to explain its reasoning before providing the final output. This reduces errors and hallucinations.
prompt = f"""
You are extracting structured data from an e-commerce product page.
Step 1: Identify where the product name is located in the HTML
Step 2: Identify where the price is located and extract the numeric value
Step 3: Determine if the product is in stock based on availability indicators
Step 4: Find the product rating
After completing these steps, provide the final JSON output.
HTML:
{html_content}
"""
5. Implement Validation Instructions
Include validation rules directly in your prompt to ensure data quality.
validation_prompt = """
Extract product information and validate the following:
Required fields (reject extraction if missing):
- product_name: must be non-empty string
- price: must be positive number
Optional fields:
- discount_percentage: number between 0 and 100
- shipping_cost: number >= 0
Validation rules:
1. If original_price exists and is less than current price, flag as ERROR
2. If discount_percentage > 0 but prices are equal, flag as ERROR
3. If product_name contains only numbers or special characters, flag as SUSPICIOUS
Return format:
{
"data": { extracted fields },
"validation_status": "VALID|ERROR|SUSPICIOUS",
"validation_errors": [list of error messages if any]
}
HTML content:
{html}
"""
Advanced Prompt Engineering Techniques
6. Use Role-Based Prompting
Assign the LLM a specific role to improve extraction quality.
const roleBasedPrompt = `
You are a professional data extraction specialist with expertise in e-commerce data.
Your task is to extract product specifications from technical product pages.
Guidelines:
- Extract only factual information present in the HTML
- Do not infer or guess missing values
- Maintain original units of measurement
- Preserve technical terminology exactly as written
Extract the following from this product specification page:
${htmlContent}
`;
7. Leverage Structured Output Formats
Use modern LLM features like function calling or structured outputs for more reliable extraction.
from openai import OpenAI
client = OpenAI()
# Define the extraction schema using function calling
functions = [
{
"name": "extract_product_data",
"description": "Extract structured product information from HTML",
"parameters": {
"type": "object",
"properties": {
"product_name": {"type": "string"},
"price": {"type": "number"},
"currency": {"type": "string", "enum": ["USD", "EUR", "GBP"]},
"in_stock": {"type": "boolean"},
"rating": {"type": "number", "minimum": 0, "maximum": 5},
"review_count": {"type": "integer"}
},
"required": ["product_name", "price"]
}
}
]
response = client.chat.completions.create(
model="gpt-4",
messages=[
{"role": "user", "content": f"Extract product data from: {html_content}"}
],
functions=functions,
function_call={"name": "extract_product_data"}
)
8. Handle Dynamic Content with Context
When scraping pages with dynamic content loaded via AJAX, provide context about how to handle AJAX requests before sending HTML to the LLM.
dynamic_content_prompt = """
This HTML is from a dynamically loaded page. Some content may be:
- Loaded via JavaScript after initial page load
- Present in JSON-LD structured data
- Embedded in data attributes (data-*)
- Contained in script tags as JavaScript objects
Priority for extraction:
1. Check data-* attributes first
2. Look for JSON-LD structured data in <script type="application/ld+json">
3. Search standard HTML elements
4. Parse JavaScript variable assignments if necessary
Extract product information following this priority order:
{html_content}
"""
9. Implement Iterative Refinement
For complex pages, use a multi-step extraction approach.
// Step 1: Identify relevant sections
const sectionPrompt = `
Analyze this HTML and identify which sections contain:
1. Product information
2. Pricing details
3. Customer reviews
4. Shipping information
Return JSON with section identifiers:
{ "product_section": "CSS selector or description",
"pricing_section": "CSS selector or description", ... }
`;
// Step 2: Extract from specific sections
const extractionPrompt = `
Focus only on this section of HTML: ${relevantSection}
Extract the following fields: ${fieldList}
`;
10. Use Negative Examples
Tell the LLM what NOT to extract to avoid common mistakes.
negative_example_prompt = """
Extract product reviews from this page.
DO extract:
- Verified customer reviews
- Star ratings
- Review dates
DO NOT extract:
- Product descriptions written by sellers
- Q&A sections
- Recommended product listings
- Advertisement content
- Navigation menu items
Example of what NOT to extract:
<div class="product-description">This amazing product...</div> ❌ Not a review
Example of what TO extract:
<div class="customer-review">Great product! Highly recommend.</div> ✓ Extract this
HTML content:
{html}
"""
Handling LLM-Specific Challenges
Preventing Hallucinations
anti_hallucination_prompt = """
CRITICAL INSTRUCTIONS:
- Extract ONLY information explicitly present in the HTML
- If a field is not found, return null - DO NOT guess or infer
- If you're uncertain about a value, mark it as null
- Do not use information from your training data
- Do not make assumptions about missing data
Verification checklist:
□ All extracted values exist in the provided HTML
□ No inferred or assumed values are included
□ Null is used for missing fields
HTML to extract from:
{html_content}
"""
Optimizing for Token Usage
When working with large HTML pages, optimize your prompts to minimize costs.
def create_optimized_prompt(html_content, fields_to_extract):
"""Create a token-efficient extraction prompt"""
# Strip unnecessary whitespace and comments
cleaned_html = clean_html(html_content)
prompt = f"""
Extract these fields: {', '.join(fields_to_extract)}
HTML (whitespace normalized):
{cleaned_html}
Return only valid JSON with extracted fields. No explanation needed.
"""
return prompt
Handling Multiple Items
For extracting lists of items (e.g., search results, product listings):
const listExtractionPrompt = `
Extract ALL product items from this search results page.
Return an array of objects, each with:
{
"position": number (1-based position on page),
"title": string,
"price": number,
"url": string,
"image_url": string
}
Important:
- Extract every product item, even if some fields are missing
- Maintain the order they appear on the page
- Return empty array if no products found
- Do not skip items due to missing fields
HTML:
${htmlContent}
`;
Best Practices for Production Use
1. Implement Prompt Versioning
PROMPT_VERSION = "v2.1"
prompt_template = f"""
[Prompt Version: {PROMPT_VERSION}]
Extract product data according to schema v2.1:
{{schema}}
HTML:
{{html}}
"""
# Track which prompt version was used for each extraction
extraction_metadata = {
"prompt_version": PROMPT_VERSION,
"model": "gpt-4",
"timestamp": datetime.now()
}
2. A/B Test Your Prompts
import random
prompt_variants = {
"variant_a": "Extract product information from this HTML: {html}",
"variant_b": "You are a data extraction expert. Extract product info from: {html}",
"variant_c": "Analyze this HTML and extract structured product data: {html}"
}
def get_prompt(html_content):
variant = random.choice(list(prompt_variants.keys()))
return prompt_variants[variant].format(html=html_content), variant
# Track performance by variant
3. Combine with Traditional Parsing
For best results, use LLMs in combination with traditional parsing methods:
def hybrid_extraction(html_content):
"""Combine regex/CSS selectors with LLM extraction"""
# First, try traditional extraction
traditional_data = extract_with_css_selectors(html_content)
# Use LLM to fill in missing fields or validate
llm_prompt = f"""
I've extracted this data using CSS selectors:
{traditional_data}
Please:
1. Fill in any null fields if the data exists in the HTML
2. Validate the extracted values are correct
3. Return the corrected/completed data
HTML:
{html_content}
"""
return llm_extract(llm_prompt)
Measuring Prompt Effectiveness
Track these metrics to evaluate your prompts:
def evaluate_prompt_performance(extractions):
metrics = {
"completeness": sum(1 for e in extractions if all(e.values())) / len(extractions),
"null_rate": sum(1 for e in extractions for v in e.values() if v is None) / (len(extractions) * len(extractions[0])),
"format_errors": sum(1 for e in extractions if not validate_schema(e)),
"avg_confidence": sum(e.get("confidence", 0) for e in extractions) / len(extractions)
}
return metrics
Conclusion
Effective prompt engineering for LLM-based web scraping requires a combination of specific instructions, examples, validation rules, and structured output formats. Start with clear, explicit prompts and gradually add techniques like few-shot learning, chain-of-thought reasoning, and schema enforcement as needed.
The key to success is iterative refinement: test your prompts on representative samples, measure extraction quality, and continuously optimize based on real-world results. When handling dynamic websites or complex browser interactions, combining LLMs with traditional scraping tools often yields the best results.
Remember that while LLMs are powerful, they work best as part of a comprehensive scraping strategy that includes proper HTML fetching, error handling, and data validation.