How Can I Use LLM Prompts for Web Scraping?
LLM (Large Language Model) prompts are the instructions you give to AI models like GPT-4, Claude, or Gemini to extract structured data from web pages. Unlike traditional web scraping that relies on CSS selectors or XPath, prompt-based scraping uses natural language instructions to tell the AI exactly what data to extract and how to format it. Mastering prompt engineering is crucial for successful AI-powered web scraping.
Understanding LLM Prompts for Web Scraping
An LLM prompt for web scraping consists of several key components:
- Context: Setting the AI's role and task
- Instructions: Specific directions for data extraction
- Schema definition: The structure of the output you want
- HTML/text content: The webpage content to process
- Examples (optional): Sample outputs to guide the AI
The quality of your prompts directly impacts extraction accuracy, consistency, and cost-efficiency.
Basic Prompt Structure
Here's the fundamental structure of an effective web scraping prompt:
import openai
client = openai.OpenAI(api_key="your-api-key")
# Basic prompt structure
system_prompt = """You are a web scraping assistant specialized in extracting
structured data from HTML content. Always return valid JSON and follow the
schema provided exactly."""
user_prompt = f"""
Extract product information from the following HTML.
Required fields for each product:
- name (string): The product name
- price (number): The price as a number without currency symbols
- availability (boolean): Whether the product is in stock
Return the data as JSON with this structure:
{{
"products": [
{{"name": "...", "price": 0.00, "availability": true}}
]
}}
HTML Content:
{html_content}
"""
response = client.chat.completions.create(
model="gpt-4",
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": user_prompt}
],
temperature=0, # Deterministic output
response_format={"type": "json_object"}
)
extracted_data = response.choices[0].message.content
Writing Effective System Prompts
The system prompt sets the AI's behavior and role. Here are examples for different scraping scenarios:
General-Purpose Web Scraping
system_prompt = """You are an expert web scraping assistant. Your task is to:
1. Extract data exactly as specified in the user's instructions
2. Return well-formed JSON that matches the requested schema
3. Handle missing data by using null values, never invent or guess information
4. Preserve the original text formatting (capitalization, spacing) unless instructed otherwise
5. If you cannot find requested information, set the field to null"""
E-commerce Data Extraction
system_prompt = """You are a specialized e-commerce data extraction assistant.
When extracting product data:
- Parse prices as numbers, removing currency symbols and thousands separators
- Identify availability status from various text formats ("In Stock", "Available", etc.)
- Extract product ratings as numbers (e.g., "4.5 stars" becomes 4.5)
- Normalize product variants (sizes, colors) into structured arrays
- Return valid JSON matching the exact schema provided"""
Article/Content Scraping
system_prompt = """You are a content extraction specialist. Your role is to:
- Extract article metadata (title, author, date, tags)
- Identify the main content body, excluding navigation, ads, and sidebars
- Parse publication dates into ISO 8601 format (YYYY-MM-DD)
- Extract author information even when presented in various formats
- Return structured JSON output"""
Crafting User Prompts: Best Practices
1. Be Explicit About Data Types
# Weak prompt
prompt = "Extract product info"
# Strong prompt
prompt = """
Extract the following fields for each product:
- name (string): The full product title
- price (number): Numeric price value, remove currency symbols
- original_price (number or null): Original price if item is on sale, null otherwise
- discount_percentage (integer or null): Discount as whole number (e.g., 20 for 20% off)
- rating (number): Rating from 0-5, parse from star displays or text
- review_count (integer): Number of reviews as integer
- in_stock (boolean): true if available for purchase, false otherwise
- image_url (string): URL of the main product image
"""
2. Provide Output Schema Examples
prompt = """
Extract restaurant listings from the HTML.
Example output format:
{
"restaurants": [
{
"name": "Luigi's Pizzeria",
"cuisine": "Italian",
"price_range": "$$",
"rating": 4.5,
"review_count": 230,
"address": "123 Main St, New York, NY 10001",
"phone": "+1-555-0123",
"is_open_now": true
}
]
}
Extract all restaurants from the provided HTML following this exact structure.
If any field is not available, use null.
HTML Content:
{html_content}
"""
3. Use Few-Shot Learning for Complex Extractions
Few-shot learning provides example input-output pairs to guide the AI:
prompt = """
Extract job listings from HTML. Here are examples of the expected extraction:
Example 1:
HTML: "<div class='job'><h2>Senior Python Developer</h2><span>TechCorp - Remote - $120k-$160k</span></div>"
Output: {
"title": "Senior Python Developer",
"company": "TechCorp",
"location": "Remote",
"salary_min": 120000,
"salary_max": 160000,
"salary_currency": "USD"
}
Example 2:
HTML: "<div class='job'><h3>Marketing Manager</h3><p>Acme Inc | New York, NY</p></div>"
Output: {
"title": "Marketing Manager",
"company": "Acme Inc",
"location": "New York, NY",
"salary_min": null,
"salary_max": null,
"salary_currency": null
}
Now extract all jobs from this HTML:
{html_content}
"""
4. Handle Edge Cases Explicitly
prompt = """
Extract article data with the following rules:
Fields to extract:
- title (string, required)
- author (string or null)
- publish_date (string in YYYY-MM-DD format or null)
- content (string, main article text only)
- tags (array of strings, empty array if none)
Important rules:
1. If publish_date is in relative format ("2 days ago"), leave it as null
2. For multiple authors, join with commas: "John Doe, Jane Smith"
3. Exclude advertisements, navigation menus, and footer text from content
4. If author is listed as "Staff" or "Editorial Team", use that exact text
5. Tags should be lowercase and without # symbols
HTML Content:
{html_content}
"""
Advanced Prompt Techniques
Chain-of-Thought Prompting
For complex extractions, guide the AI through reasoning steps:
prompt = """
Extract product specifications from this technical product page.
Follow these steps:
1. First, identify the main product specifications table or section
2. Parse each specification row, extracting both the label and value
3. Normalize specification names (e.g., "RAM" and "Memory" both become "ram")
4. Convert values to appropriate types (numbers for measurements, booleans for yes/no)
5. Return as a structured object
Example reasoning:
- If you see "Weight: 2.5 lbs", extract: {"weight_lbs": 2.5}
- If you see "Warranty: Yes (2 years)", extract: {"has_warranty": true, "warranty_years": 2}
- If you see "Color: Available in Red, Blue", extract: {"colors": ["Red", "Blue"]}
Now extract specifications from:
{html_content}
"""
Multi-Step Extraction
Break complex tasks into stages:
def multi_step_extraction(html_content):
# Step 1: Identify relevant sections
step1_prompt = """
Analyze this HTML and identify:
1. The CSS selector or description of where product listings are located
2. The CSS selector for pagination elements
3. The total number of products visible on the page
Return as JSON.
"""
# Step 2: Extract data
step2_prompt = """
Based on the identified product section, extract all product data with fields:
name, price, rating, availability.
"""
# Execute steps sequentially
# ... implementation
Validation and Correction Prompts
Add a validation layer to ensure data quality:
validation_prompt = """
Review this extracted data and validate:
Extracted data:
{extracted_json}
Validation checks:
1. All prices should be positive numbers
2. Ratings should be between 0 and 5
3. Email addresses should be valid format
4. Phone numbers should include country code
5. URLs should be complete and valid
If you find issues, return corrected JSON with a "validation_notes" field explaining changes.
If data is valid, return it unchanged with "validation_notes": "All valid".
"""
Optimizing Prompts for Different LLM Models
GPT-4 Prompts
GPT-4 excels at complex instructions and structured output:
# GPT-4 prompt utilizing function calling
functions = [
{
"name": "extract_products",
"description": "Extract product information from HTML",
"parameters": {
"type": "object",
"properties": {
"products": {
"type": "array",
"items": {
"type": "object",
"properties": {
"name": {"type": "string", "description": "Product name"},
"price": {"type": "number", "description": "Price as number"},
"currency": {"type": "string", "description": "Currency code (USD, EUR, etc.)"},
"rating": {"type": "number", "minimum": 0, "maximum": 5}
},
"required": ["name", "price"]
}
}
}
}
}
]
response = client.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": f"Extract products from: {html}"}],
functions=functions,
function_call={"name": "extract_products"}
)
Claude Prompts
Claude (Anthropic) works well with detailed, structured instructions:
import anthropic
client = anthropic.Anthropic(api_key="your-api-key")
prompt = """
I need you to extract event information from HTML content.
<instructions>
Extract all events with these fields:
- event_name: The full event title
- start_date: ISO 8601 format (YYYY-MM-DD)
- start_time: 24-hour format (HH:MM) or null
- venue_name: Name of the venue
- venue_address: Full address
- ticket_price: Lowest available price as number, null if free
- is_sold_out: Boolean
Return as JSON array.
</instructions>
<html>
{html_content}
</html>
Provide only the JSON output, no additional commentary.
"""
response = client.messages.create(
model="claude-3-opus-20240229",
max_tokens=4096,
messages=[{"role": "user", "content": prompt}]
)
Open Source LLMs (Llama, Mistral)
Smaller models benefit from simpler, more direct prompts:
prompt = """
### Task: Extract product data
### Input HTML:
{html_content}
### Required Output Format:
{{
"products": [
{{"name": "string", "price": number}}
]
}}
### Rules:
- Extract only name and price
- Price must be a number
- Return valid JSON only
### Output:
"""
Reducing Token Usage and Costs
Pre-process HTML Before Sending
from bs4 import BeautifulSoup
import re
def clean_html_for_llm(html_content, target_selector=None):
"""
Clean and minimize HTML before sending to LLM
"""
soup = BeautifulSoup(html_content, 'html.parser')
# Remove unnecessary elements
for element in soup(['script', 'style', 'svg', 'path', 'noscript']):
element.decompose()
# Remove comments
for comment in soup.findAll(text=lambda text: isinstance(text, Comment)):
comment.extract()
# Extract only target section if specified
if target_selector:
target = soup.select_one(target_selector)
if target:
soup = target
# Remove excessive attributes
for tag in soup.find_all():
# Keep only essential attributes
attrs_to_keep = ['class', 'id', 'href', 'src', 'alt', 'title']
tag.attrs = {k: v for k, v in tag.attrs.items() if k in attrs_to_keep}
# Minimize whitespace
html_str = str(soup)
html_str = re.sub(r'\s+', ' ', html_str)
html_str = re.sub(r'>\s+<', '><', html_str)
return html_str
# Usage
cleaned_html = clean_html_for_llm(html_content, '.product-list')
# Now use cleaned_html in your prompt
Convert HTML to Simplified Markdown
import html2text
def html_to_markdown_for_llm(html_content):
"""
Convert HTML to markdown to reduce tokens
"""
h = html2text.HTML2Text()
h.ignore_links = False
h.ignore_images = False
h.ignore_emphasis = False
markdown = h.handle(html_content)
return markdown
# Use in prompt
markdown_content = html_to_markdown_for_llm(html_content)
prompt = f"""
Extract product data from this markdown-formatted page content:
{markdown_content}
"""
Combining Prompts with Browser Automation
When scraping dynamic websites, combine LLM prompts with browser automation. You can use tools to handle AJAX requests before extracting data:
const puppeteer = require('puppeteer');
const OpenAI = require('openai');
const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });
async function scrapeWithPrompts(url, extractionPrompt) {
// Launch browser
const browser = await puppeteer.launch();
const page = await browser.newPage();
// Navigate and wait for dynamic content
await page.goto(url, { waitUntil: 'networkidle0' });
// Get rendered HTML
const html = await page.content();
await browser.close();
// Use LLM to extract data
const response = await openai.chat.completions.create({
model: 'gpt-4',
messages: [
{
role: 'system',
content: 'Extract structured data from HTML. Return valid JSON only.'
},
{
role: 'user',
content: `${extractionPrompt}\n\nHTML:\n${html.substring(0, 10000)}`
}
],
temperature: 0,
response_format: { type: 'json_object' }
});
return JSON.parse(response.choices[0].message.content);
}
// Example usage
const prompt = `
Extract all article headlines with:
- title (string)
- url (string)
- published_date (YYYY-MM-DD or null)
Return as {"articles": [...]}
`;
scrapeWithPrompts('https://news.example.com', prompt)
.then(data => console.log(JSON.stringify(data, null, 2)));
Testing and Iterating on Prompts
Create a Prompt Testing Framework
def test_prompt(html_samples, prompt_template, expected_fields):
"""
Test a prompt against multiple HTML samples
"""
results = {
'successful': 0,
'failed': 0,
'errors': []
}
for idx, html in enumerate(html_samples):
try:
response = client.chat.completions.create(
model="gpt-4",
messages=[
{"role": "system", "content": "Extract data as JSON."},
{"role": "user", "content": prompt_template.format(html=html)}
],
temperature=0,
response_format={"type": "json_object"}
)
data = json.loads(response.choices[0].message.content)
# Validate expected fields
if all(field in data for field in expected_fields):
results['successful'] += 1
else:
results['failed'] += 1
results['errors'].append(f"Sample {idx}: Missing fields")
except Exception as e:
results['failed'] += 1
results['errors'].append(f"Sample {idx}: {str(e)}")
return results
# Test your prompt
html_samples = [sample1_html, sample2_html, sample3_html]
expected_fields = ['products', 'total_count']
test_results = test_prompt(html_samples, my_prompt_template, expected_fields)
print(f"Success rate: {test_results['successful']}/{len(html_samples)}")
A/B Test Different Prompts
def compare_prompts(html_content, prompts_dict):
"""
Compare multiple prompt variations
"""
results = {}
for prompt_name, prompt in prompts_dict.items():
start_time = time.time()
response = client.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": prompt.format(html=html_content)}],
temperature=0
)
execution_time = time.time() - start_time
tokens_used = response.usage.total_tokens
results[prompt_name] = {
'output': response.choices[0].message.content,
'tokens': tokens_used,
'time': execution_time,
'cost': (tokens_used / 1000) * 0.03 # GPT-4 pricing
}
return results
# Compare different approaches
prompts = {
'detailed': "Extract products with detailed schema...",
'simple': "Extract: name, price for each product",
'few_shot': "Examples: ... Now extract products"
}
comparison = compare_prompts(html_content, prompts)
Common Prompt Patterns for Web Scraping
Pattern 1: List Extraction
prompt = """
Extract all items from this list.
For each item, extract:
- text (string): The visible text
- link (string or null): URL if item is a link
- position (integer): Position in the list (1-indexed)
Return as: {"items": [...]}
HTML:
{html}
"""
Pattern 2: Table Extraction
prompt = """
Extract data from the table in this HTML.
Rules:
1. First row contains headers
2. Each subsequent row is a data record
3. Convert headers to snake_case keys
4. Parse numeric columns as numbers
5. Parse date columns to YYYY-MM-DD format
Return as: {"headers": [...], "rows": [...]}
HTML:
{html}
"""
Pattern 3: Nested Data Extraction
prompt = """
Extract category hierarchy with products.
Structure:
{{
"categories": [
{{
"name": "Category Name",
"subcategories": ["Sub1", "Sub2"],
"products": [
{{"name": "...", "price": 0}}
]
}}
]
}}
HTML:
{html}
"""
Error Handling in Prompt-Based Scraping
import json
from jsonschema import validate, ValidationError
def extract_with_validation(html, prompt, schema):
"""
Extract data and validate against JSON schema
"""
max_retries = 3
for attempt in range(max_retries):
try:
response = client.chat.completions.create(
model="gpt-4",
messages=[
{"role": "system", "content": "Return valid JSON only."},
{"role": "user", "content": f"{prompt}\n\nHTML:\n{html}"}
],
temperature=0,
response_format={"type": "json_object"}
)
data = json.loads(response.choices[0].message.content)
# Validate against schema
validate(instance=data, schema=schema)
return data
except json.JSONDecodeError:
if attempt == max_retries - 1:
raise
# Retry with more explicit JSON instruction
prompt += "\n\nIMPORTANT: Return ONLY valid JSON, no other text."
except ValidationError as e:
if attempt == max_retries - 1:
raise
# Add schema to prompt for next attempt
prompt += f"\n\nSchema requirements: {json.dumps(schema)}"
raise Exception("Failed to extract valid data after retries")
# Define schema
product_schema = {
"type": "object",
"properties": {
"products": {
"type": "array",
"items": {
"type": "object",
"properties": {
"name": {"type": "string"},
"price": {"type": "number"}
},
"required": ["name", "price"]
}
}
},
"required": ["products"]
}
# Use with validation
result = extract_with_validation(html, prompt, product_schema)
Best Practices Summary
- Start simple: Begin with basic prompts and add complexity as needed
- Be specific: Define exact field types, formats, and requirements
- Provide examples: Use few-shot learning for complex extractions
- Validate output: Always validate extracted JSON against expected schema
- Handle missing data: Instruct the AI to use null for missing values
- Optimize for tokens: Clean HTML and extract relevant sections only
- Test thoroughly: Use multiple HTML samples to ensure consistency
- Monitor costs: Track token usage and API costs
- Iterate: Continuously refine prompts based on results
- Combine approaches: Use traditional selectors for navigation, LLMs for extraction
Conclusion
Mastering LLM prompts for web scraping requires understanding both prompt engineering principles and the specific challenges of data extraction. By writing clear, structured prompts with explicit schemas and validation rules, you can achieve high-quality data extraction with minimal code maintenance.
The key to success is iteration: start with simple prompts, test against real-world HTML samples, and gradually refine your instructions based on the results. Whether you're using GPT-4, Claude, or open-source models, the principles of effective prompt design remain consistent.
For dynamic websites that require interaction before extraction, consider combining LLM-based extraction with browser automation tools to monitor network requests and ensure you're capturing all necessary data.
Remember that while LLM-based scraping offers flexibility and reduces maintenance, it comes with costs and latency. Use it strategically where its strengths—handling inconsistent layouts and semantic understanding—provide the most value.