What are prompt engineering examples for web scraping?
Prompt engineering is the art of crafting effective instructions for Large Language Models (LLMs) like GPT to extract structured data from web pages. Unlike traditional web scraping that relies on CSS selectors or XPath, prompt-based scraping uses natural language instructions to tell AI models what data to extract and how to format it.
Understanding Prompt Engineering for Web Scraping
Prompt engineering for web scraping involves creating clear, specific instructions that guide an LLM to: - Identify relevant data on a web page - Extract information accurately - Format the output in a structured way (JSON, CSV, etc.) - Handle edge cases and missing data
The key advantage is that well-crafted prompts can adapt to varying HTML structures without needing to update selectors when websites change their layouts.
Basic Prompt Structure for Data Extraction
A good web scraping prompt typically includes: 1. Context: What type of page you're scraping 2. Task: What data to extract 3. Format: How to structure the output 4. Constraints: Rules for handling edge cases
Here's a basic example:
import openai
html_content = """
<div class="product">
<h1>Wireless Headphones</h1>
<span class="price">$79.99</span>
<p class="description">High-quality Bluetooth headphones</p>
</div>
"""
prompt = f"""
Extract product information from the following HTML.
HTML:
{html_content}
Return the data as JSON with these fields:
- name: product name
- price: numeric price value
- description: product description
If a field is missing, use null.
"""
response = openai.ChatCompletion.create(
model="gpt-4",
messages=[
{"role": "system", "content": "You are a web scraping assistant that extracts structured data from HTML."},
{"role": "user", "content": prompt}
]
)
print(response.choices[0].message.content)
Example 1: E-commerce Product Scraping
For scraping product listings, use specific prompts that handle common variations:
prompt = """
You are extracting product data from an e-commerce website.
Extract ALL products from this HTML and return as a JSON array.
For each product, extract:
- title (string): Product name
- price (number): Price in USD, extract just the number
- currency (string): Currency code
- rating (number): Star rating if present, null otherwise
- reviewCount (number): Number of reviews if present, null otherwise
- inStock (boolean): true if in stock, false if out of stock, null if unknown
- imageUrl (string): Primary product image URL
Rules:
- If price has a discount, use the discounted price
- Convert all prices to numbers (remove $ and commas)
- Extract full absolute URLs for images
- Return empty array [] if no products found
HTML:
{html_content}
Return only valid JSON, no explanations.
"""
Example 2: Article Metadata Extraction
When scraping blog posts or news articles, focus on semantic content:
const prompt = `
Extract article metadata from this HTML page.
Return JSON with:
{
"title": "Article title",
"author": "Author name or null",
"publishDate": "ISO 8601 date or null",
"tags": ["tag1", "tag2"] or [],
"content": "Main article text, cleaned",
"readTime": "Estimated read time or null",
"category": "Article category or null"
}
Instructions:
- Extract the main article content, removing ads and navigation
- Clean up extra whitespace in content
- Parse dates to ISO 8601 format (YYYY-MM-DD)
- Tags should be lowercase
- If content is too long, include first 500 words only
HTML:
${htmlContent}
Return only valid JSON.
`;
const response = await fetch('https://api.openai.com/v1/chat/completions', {
method: 'POST',
headers: {
'Authorization': `Bearer ${process.env.OPENAI_API_KEY}`,
'Content-Type': 'application/json'
},
body: JSON.stringify({
model: 'gpt-4',
messages: [
{role: 'system', content: 'You extract structured data from HTML accurately.'},
{role: 'user', content: prompt}
]
})
});
Example 3: Table Data Extraction
For structured data in HTML tables, provide clear schema definitions:
prompt = """
Extract data from the HTML table below.
Expected columns: Name, Email, Phone, Department, Status
Return as JSON array where each object represents a row:
[
{
"name": "string",
"email": "string",
"phone": "string (format: XXX-XXX-XXXX)",
"department": "string",
"status": "active" | "inactive"
}
]
Instructions:
- Normalize phone numbers to XXX-XXX-XXXX format
- Convert emails to lowercase
- Map status variations (Active/Yes/✓) to "active", others to "inactive"
- Skip header rows
- Skip rows with missing name or email
HTML:
{table_html}
Return only the JSON array.
"""
Example 4: Few-Shot Prompting for Complex Extraction
Few-shot prompting provides examples to improve accuracy:
prompt = """
Extract job posting information from HTML.
Example 1:
Input: <div class="job"><h2>Senior Developer</h2><span>$120k-150k</span><p>Remote</p></div>
Output: {"title": "Senior Developer", "salary_min": 120000, "salary_max": 150000, "location": "Remote"}
Example 2:
Input: <div class="job"><h2>Marketing Manager</h2><span>80k/year</span><p>New York, NY</p></div>
Output: {"title": "Marketing Manager", "salary_min": 80000, "salary_max": 80000, "location": "New York, NY"}
Example 3:
Input: <div class="job"><h2>Data Analyst</h2><span>Competitive</span><p>Hybrid - Austin</p></div>
Output: {"title": "Data Analyst", "salary_min": null, "salary_max": null, "location": "Austin"}
Now extract from this HTML:
{html_content}
Return only the JSON object.
"""
Example 5: Handling Dynamic Content with Context
When dealing with JavaScript-rendered content, combine browser automation with GPT:
const puppeteer = require('puppeteer');
const OpenAI = require('openai');
async function scrapeWithContext(url) {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto(url, { waitUntil: 'networkidle2' });
// Wait for dynamic content
await page.waitForSelector('.product-list');
const html = await page.content();
await browser.close();
const prompt = `
This HTML is from a single-page application that loaded product data dynamically.
Extract all products visible on the page.
Return as JSON array with:
- id: product identifier
- name: product name
- price: numeric price
- availability: "in_stock" | "out_of_stock" | "preorder"
HTML:
${html}
Return only valid JSON array.
`;
const openai = new OpenAI();
const response = await openai.chat.completions.create({
model: 'gpt-4',
messages: [
{role: 'system', content: 'You are a precise web scraping assistant.'},
{role: 'user', content: prompt}
]
});
return JSON.parse(response.choices[0].message.content);
}
Similar to handling AJAX requests using Puppeteer, this approach ensures all dynamic content is loaded before extraction.
Example 6: Multi-Step Extraction with Validation
For complex scraping tasks, use chain-of-thought prompting:
prompt = """
You are scraping a product review page. Follow these steps:
Step 1: Identify the overall product rating (1-5 stars)
Step 2: Extract the total number of reviews
Step 3: Find all individual review elements
Step 4: For each review, extract:
- reviewer name
- rating (1-5)
- review date
- review text
- helpful votes count
Step 5: Calculate average rating from individual reviews
Step 6: Verify it matches the overall rating (within 0.5 stars)
Return JSON:
{
"product_rating": number,
"total_reviews": number,
"reviews": [
{
"author": "string",
"rating": number,
"date": "YYYY-MM-DD",
"text": "string",
"helpful_votes": number
}
],
"calculated_average": number,
"validation_passed": boolean
}
HTML:
{html_content}
"""
Best Practices for Prompt Engineering in Web Scraping
1. Be Specific About Output Format
Always specify exact field names, data types, and structure:
# Good
"Return JSON with 'price' as number, 'title' as string"
# Bad
"Extract the price and title"
2. Handle Missing Data Explicitly
prompt = """
If any field is missing or cannot be determined:
- Use null for optional fields
- Use empty string "" for required string fields
- Use 0 for required numeric fields
- Use empty array [] for list fields
"""
3. Provide Data Transformation Rules
prompt = """
Data transformations:
- Dates: Convert to ISO 8601 (YYYY-MM-DD)
- Prices: Extract numbers only, remove currency symbols
- URLs: Convert to absolute URLs
- Text: Trim whitespace, remove HTML entities
- Phone: Format as +1-XXX-XXX-XXXX
"""
4. Use System Messages Effectively
system_message = """
You are a specialized web scraping assistant with these capabilities:
- Extract structured data from HTML with high accuracy
- Handle malformed HTML gracefully
- Normalize data consistently
- Return only valid JSON, never explanations
- Use null for missing data, never omit fields
"""
5. Limit Token Usage for Large Pages
When working with large HTML documents, consider preprocessing:
from bs4 import BeautifulSoup
def preprocess_html(html):
soup = BeautifulSoup(html, 'html.parser')
# Remove scripts, styles, and navigation
for tag in soup(['script', 'style', 'nav', 'footer', 'header']):
tag.decompose()
# Keep only main content area
main_content = soup.find('main') or soup.find('article') or soup.body
return str(main_content)
# Then use in prompt
cleaned_html = preprocess_html(raw_html)
prompt = f"Extract data from: {cleaned_html}"
This approach is particularly useful when crawling single-page applications that generate large amounts of HTML.
Example 7: Schema-Based Extraction
For consistent results across multiple pages, define a strict schema:
const schema = {
type: "object",
properties: {
products: {
type: "array",
items: {
type: "object",
required: ["id", "name", "price"],
properties: {
id: { type: "string" },
name: { type: "string" },
price: { type: "number", minimum: 0 },
currency: { type: "string", enum: ["USD", "EUR", "GBP"] },
inStock: { type: "boolean" }
}
}
}
}
};
const prompt = `
Extract product data matching this JSON schema:
${JSON.stringify(schema, null, 2)}
Validation rules:
- All required fields must be present
- price must be positive number
- currency must be one of: USD, EUR, GBP
- id must be unique within the array
HTML:
${htmlContent}
Return valid JSON matching the schema.
`;
Handling Errors and Edge Cases
Always include error handling in your prompts:
prompt = """
Extract contact information. Handle these cases:
1. Multiple phone numbers: Return as array
2. Email obfuscated (e.g., "name [at] domain [dot] com"): Reconstruct to proper format
3. No contact info found: Return {"found": false, "data": null}
4. Partial data: Include what's available, mark others as null
Return:
{
"found": boolean,
"data": {
"email": string or null,
"phones": [string] or [],
"address": string or null,
"social": {
"twitter": string or null,
"linkedin": string or null
}
}
}
HTML:
{html_content}
"""
Combining Traditional and AI-Based Scraping
For optimal results, combine traditional selectors with GPT extraction:
from bs4 import BeautifulSoup
import openai
def hybrid_scrape(html):
soup = BeautifulSoup(html, 'html.parser')
# Use traditional methods for structure
product_cards = soup.select('.product-card')
results = []
for card in product_cards:
# Let GPT handle complex/variable content
prompt = f"""
Extract product details from this HTML fragment.
Focus on: name, price, features list, specifications.
Return JSON:
{{
"name": "string",
"price": number,
"features": ["string"],
"specs": {{"key": "value"}}
}}
HTML:
{str(card)}
"""
# Send to GPT for extraction
response = openai.ChatCompletion.create(
model="gpt-4",
messages=[{"role": "user", "content": prompt}]
)
results.append(json.loads(response.choices[0].message.content))
return results
Conclusion
Effective prompt engineering for web scraping requires clear instructions, well-defined schemas, and thoughtful handling of edge cases. By combining specific output formats, few-shot examples, and validation rules, you can create robust scraping solutions that adapt to varying website structures without constant maintenance.
The key is to treat prompts as code: version them, test them against different inputs, and refine them based on results. Start with simple extraction tasks and gradually increase complexity as you understand how the model interprets different HTML structures.
When working with complex websites that require browser automation, consider integrating GPT-based extraction with tools like Puppeteer for handling dynamic content and browser events before passing the rendered HTML to your language model.