How do I create a prompt template for web scraping with LLMs?
Creating effective prompt templates is crucial for successful LLM-powered web scraping. A well-designed prompt template helps the language model understand exactly what data to extract from HTML or text content, ensuring consistent and accurate results across different pages.
Understanding Prompt Templates for Web Scraping
A prompt template for web scraping typically consists of three key components:
- Instructions: Clear directions on what task the LLM should perform
- Context: The HTML or text content to extract data from
- Output specification: The desired format and structure of extracted data
The goal is to create reusable templates that can be populated with different HTML content while maintaining consistent extraction quality.
Basic Prompt Template Structure
Here's a fundamental template structure for web scraping with LLMs:
prompt_template = """
You are a web scraping assistant. Extract the following information from the HTML below:
{extraction_instructions}
HTML Content:
{html_content}
Return the extracted data in the following JSON format:
{output_schema}
Only return valid JSON without any additional text or explanations.
"""
Python Implementation with OpenAI
Here's a complete example using Python with the OpenAI API:
import openai
from typing import Dict, Any
import json
class LLMScraperTemplate:
def __init__(self, api_key: str):
openai.api_key = api_key
def create_product_extraction_template(self) -> str:
"""Template for extracting product information"""
return """
You are a precise data extraction assistant. Extract product information from the HTML below.
Extract these fields:
- product_name: The full product title
- price: The current price (numeric value only)
- currency: The currency symbol or code
- availability: Whether the product is in stock (true/false)
- rating: Average customer rating (numeric value)
- description: Brief product description (max 200 characters)
HTML Content:
{html_content}
Return ONLY a valid JSON object with these exact field names. If a field is not found, use null.
Example output:
{{
"product_name": "Example Product",
"price": 29.99,
"currency": "USD",
"availability": true,
"rating": 4.5,
"description": "This is a great product..."
}}
"""
def extract_with_template(self, template: str, html_content: str,
model: str = "gpt-4") -> Dict[str, Any]:
"""Execute the template with given HTML content"""
prompt = template.format(html_content=html_content)
response = openai.ChatCompletion.create(
model=model,
messages=[
{"role": "system", "content": "You are a data extraction expert."},
{"role": "user", "content": prompt}
],
temperature=0, # Lower temperature for more consistent outputs
max_tokens=1000
)
extracted_text = response.choices[0].message.content
return json.loads(extracted_text)
# Usage example
scraper = LLMScraperTemplate("your-api-key")
template = scraper.create_product_extraction_template()
html = """
<div class="product">
<h1>Premium Wireless Headphones</h1>
<span class="price">$149.99</span>
<p class="stock">In Stock</p>
<div class="rating">4.7 out of 5 stars</div>
<p class="desc">High-quality wireless headphones with noise cancellation</p>
</div>
"""
result = scraper.extract_with_template(template, html)
print(json.dumps(result, indent=2))
JavaScript Implementation with OpenAI
Here's the equivalent implementation in JavaScript:
const OpenAI = require('openai');
class LLMScraperTemplate {
constructor(apiKey) {
this.client = new OpenAI({ apiKey });
}
createProductExtractionTemplate() {
return `
You are a precise data extraction assistant. Extract product information from the HTML below.
Extract these fields:
- product_name: The full product title
- price: The current price (numeric value only)
- currency: The currency symbol or code
- availability: Whether the product is in stock (true/false)
- rating: Average customer rating (numeric value)
- description: Brief product description (max 200 characters)
HTML Content:
{html_content}
Return ONLY a valid JSON object with these exact field names. If a field is not found, use null.
`;
}
async extractWithTemplate(template, htmlContent, model = 'gpt-4') {
const prompt = template.replace('{html_content}', htmlContent);
const response = await this.client.chat.completions.create({
model: model,
messages: [
{ role: 'system', content: 'You are a data extraction expert.' },
{ role: 'user', content: prompt }
],
temperature: 0,
max_tokens: 1000
});
const extractedText = response.choices[0].message.content;
return JSON.parse(extractedText);
}
}
// Usage example
async function main() {
const scraper = new LLMScraperTemplate('your-api-key');
const template = scraper.createProductExtractionTemplate();
const html = `
<div class="product">
<h1>Premium Wireless Headphones</h1>
<span class="price">$149.99</span>
<p class="stock">In Stock</p>
<div class="rating">4.7 out of 5 stars</div>
</div>
`;
const result = await scraper.extractWithTemplate(template, html);
console.log(JSON.stringify(result, null, 2));
}
main();
Advanced Template Patterns
Multi-Item Extraction Template
For extracting multiple items (like search results or product listings):
multi_item_template = """
Extract all product listings from the HTML below. Each product should include:
- title: Product name
- price: Numeric price value
- url: Product link (href attribute)
HTML Content:
{html_content}
Return a JSON array of products. Example:
[
{{"title": "Product 1", "price": 29.99, "url": "/product-1"}},
{{"title": "Product 2", "price": 39.99, "url": "/product-2"}}
]
Return ONLY the JSON array without any additional text.
"""
Context-Aware Template
Include examples to guide the LLM's extraction:
context_aware_template = """
Extract article metadata from the HTML content.
Fields to extract:
- title: Article headline
- author: Author name
- publish_date: Publication date in ISO 8601 format
- tags: Array of topic tags
- excerpt: First paragraph or summary
HTML Content:
{html_content}
Examples of valid output:
{{
"title": "Understanding Machine Learning",
"author": "Jane Smith",
"publish_date": "2024-01-15T10:30:00Z",
"tags": ["AI", "Technology", "Education"],
"excerpt": "Machine learning is transforming industries..."
}}
Return ONLY valid JSON matching this structure.
"""
Template Best Practices
1. Be Specific About Output Format
Always specify the exact JSON structure you expect:
# Good - Specific structure
"""Return JSON: {"name": str, "price": float, "in_stock": bool}"""
# Bad - Vague instruction
"""Return the data as JSON"""
2. Use Zero Temperature for Consistency
Set temperature=0 to get more deterministic outputs:
response = openai.ChatCompletion.create(
model="gpt-4",
temperature=0, # Maximum consistency
messages=[...]
)
3. Handle Missing Data Gracefully
Instruct the LLM on how to handle missing fields:
template = """
Extract data from HTML. If a field is not found:
- Use null for missing optional fields
- Use empty string "" for missing text fields
- Use 0 for missing numeric fields
- Use empty array [] for missing list fields
HTML: {html_content}
"""
4. Validate LLM Output
Always validate and sanitize LLM responses:
def safe_extract(template: str, html: str) -> Dict[str, Any]:
try:
result = extract_with_template(template, html)
# Validate required fields
required_fields = ['title', 'price']
for field in required_fields:
if field not in result:
raise ValueError(f"Missing required field: {field}")
# Type checking
if not isinstance(result['price'], (int, float)):
result['price'] = float(result['price'])
return result
except json.JSONDecodeError:
print("Invalid JSON response from LLM")
return None
except Exception as e:
print(f"Extraction error: {e}")
return None
Combining LLM Templates with Traditional Scraping
LLMs work best when combined with traditional web scraping tools to pre-process HTML:
from bs4 import BeautifulSoup
import requests
def hybrid_scraping_approach(url: str, template: str) -> Dict[str, Any]:
# Step 1: Fetch HTML
response = requests.get(url)
# Step 2: Pre-process with BeautifulSoup
soup = BeautifulSoup(response.text, 'html.parser')
# Extract relevant section only (reduces token usage)
main_content = soup.find('main', class_='product-details')
if main_content:
# Step 3: Feed cleaned HTML to LLM
cleaned_html = str(main_content)
return extract_with_template(template, cleaned_html)
return None
This approach is similar to how you might handle AJAX requests using Puppeteer to fetch dynamic content before processing it with an LLM.
Cost Optimization Strategies
LLM API calls can be expensive. Optimize your templates:
1. Minimize HTML Input
Strip unnecessary tags and attributes:
from bs4 import BeautifulSoup
def minimize_html(html: str) -> str:
soup = BeautifulSoup(html, 'html.parser')
# Remove script and style tags
for tag in soup(['script', 'style', 'nav', 'footer']):
tag.decompose()
# Remove comments
for comment in soup.find_all(text=lambda text: isinstance(text, Comment)):
comment.extract()
return str(soup)
2. Use Smaller Models When Possible
Start with GPT-3.5 and upgrade to GPT-4 only if needed:
def adaptive_extraction(html: str, template: str) -> Dict[str, Any]:
# Try with cheaper model first
try:
result = extract_with_template(template, html, model="gpt-3.5-turbo")
if validate_result(result):
return result
except:
pass
# Fall back to more capable model
return extract_with_template(template, html, model="gpt-4")
3. Cache Results
Avoid re-processing identical pages:
import hashlib
from functools import lru_cache
@lru_cache(maxsize=1000)
def cached_extraction(html_hash: str, template_hash: str) -> str:
# This will be called only once per unique HTML + template combo
html = get_html_from_hash(html_hash)
template = get_template_from_hash(template_hash)
return extract_with_template(template, html)
def extract_with_cache(html: str, template: str) -> Dict[str, Any]:
html_hash = hashlib.md5(html.encode()).hexdigest()
template_hash = hashlib.md5(template.encode()).hexdigest()
return cached_extraction(html_hash, template_hash)
Using Function Calling for Structured Output
Modern LLM APIs support function calling, which ensures structured output without JSON parsing:
functions = [
{
"name": "save_product_data",
"description": "Save extracted product information",
"parameters": {
"type": "object",
"properties": {
"product_name": {"type": "string"},
"price": {"type": "number"},
"currency": {"type": "string"},
"availability": {"type": "boolean"},
"rating": {"type": "number"}
},
"required": ["product_name", "price"]
}
}
]
response = openai.ChatCompletion.create(
model="gpt-4",
messages=[{"role": "user", "content": f"Extract product data from: {html}"}],
functions=functions,
function_call={"name": "save_product_data"}
)
# Extract structured data from function call
function_args = json.loads(
response.choices[0].message.function_call.arguments
)
To learn more about this technique, check out our guide on function calling in LLMs.
Testing Your Templates
Always test templates with various HTML structures:
def test_template_robustness():
test_cases = [
# Complete data
'<div><h1>Product A</h1><span class="price">$50</span></div>',
# Missing price
'<div><h1>Product B</h1></div>',
# Different structure
'<article><h2>Product C</h2><p>Price: 30 USD</p></article>',
# Malformed HTML
'<div><h1>Product D<span>$40</div>',
]
template = create_product_extraction_template()
for i, html in enumerate(test_cases):
result = extract_with_template(template, html)
print(f"Test {i+1}: {result}")
assert result is not None, f"Test {i+1} failed"
Conclusion
Creating effective prompt templates for LLM-powered web scraping requires careful attention to instruction clarity, output specification, and error handling. By following the patterns and best practices outlined in this guide, you can build robust, reusable templates that extract structured data reliably from web pages.
Remember to combine LLM capabilities with traditional parsing when appropriate, optimize for cost by minimizing input size, and always validate outputs. When dealing with dynamic content, you might also want to explore how to monitor network requests in Puppeteer to ensure you're capturing all the data your LLM template needs to process.
The key to success is iteration—start with simple templates, test them against real-world data, and refine based on the results you observe.