What are examples of AI prompt optimization for web scraping?
AI prompt optimization is crucial for effective web scraping with Large Language Models (LLMs) like GPT-4, Claude, or Gemini. Well-crafted prompts can dramatically improve extraction accuracy, reduce API costs, and minimize hallucinations. This guide explores proven prompt optimization techniques with practical examples.
Understanding Prompt Optimization Basics
Prompt optimization for web scraping involves structuring your instructions to help the AI model understand exactly what data to extract, in what format, and with what level of precision. Unlike traditional selectors, AI models interpret natural language, making prompt quality the primary determinant of success.
Key Principles
- Be specific and explicit about what you want
- Provide clear output format specifications
- Include examples when possible
- Set constraints to prevent hallucinations
- Use structured output formats like JSON
Example 1: Basic Product Information Extraction
Unoptimized Prompt
import openai
html_content = """
<div class="product">
<h1>Premium Wireless Headphones</h1>
<span class="price">$299.99</span>
<p class="description">Noise-canceling over-ear headphones with 30-hour battery life</p>
</div>
"""
response = openai.ChatCompletion.create(
model="gpt-4",
messages=[{
"role": "user",
"content": f"Extract product info from: {html_content}"
}]
)
This vague prompt may return inconsistent results or miss important fields.
Optimized Prompt
import openai
import json
prompt = f"""Extract product information from the following HTML and return ONLY a valid JSON object with these exact fields:
Required fields:
- name (string): The product name
- price (number): Price as a decimal number without currency symbols
- description (string): Product description
- currency (string): Currency code (e.g., "USD")
Rules:
- Return ONLY the JSON object, no other text
- If a field is not found, use null
- For price, extract only the numeric value
HTML:
{html_content}"""
response = openai.ChatCompletion.create(
model="gpt-4",
messages=[{"role": "user", "content": prompt}],
temperature=0 # Reduce randomness for consistent extraction
)
data = json.loads(response.choices[0].message.content)
print(data)
# Output: {"name": "Premium Wireless Headphones", "price": 299.99, "description": "Noise-canceling over-ear headphones with 30-hour battery life", "currency": "USD"}
Example 2: Few-Shot Learning for Complex Structures
Few-shot prompting provides examples to guide the AI model's understanding of your extraction requirements.
const { OpenAI } = require('openai');
const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });
async function extractArticleData(html) {
const prompt = `Extract article metadata from HTML and return JSON.
Example 1:
Input: <article><h1>Getting Started with AI</h1><span class="author">John Doe</span><time>2024-01-15</time></article>
Output: {"title": "Getting Started with AI", "author": "John Doe", "date": "2024-01-15", "tags": []}
Example 2:
Input: <article><h2>Web Scraping Best Practices</h2><p class="by">Jane Smith</p><time datetime="2024-02-20">Feb 20</time><div class="tags"><span>scraping</span><span>tutorial</span></div></article>
Output: {"title": "Web Scraping Best Practices", "author": "Jane Smith", "date": "2024-02-20", "tags": ["scraping", "tutorial"]}
Now extract from this HTML:
${html}
Return ONLY the JSON object.`;
const response = await openai.chat.completions.create({
model: "gpt-4",
messages: [{ role: "user", content: prompt }],
temperature: 0
});
return JSON.parse(response.choices[0].message.content);
}
Few-shot learning is particularly effective when dealing with AI-powered data extraction from websites with varying HTML structures.
Example 3: Constraint-Based Prompting to Prevent Hallucinations
AI models can sometimes generate plausible but incorrect data. Use constraints to minimize this:
import anthropic
def extract_product_reviews(html_content):
prompt = f"""Extract customer reviews from this HTML. Follow these strict rules:
CONSTRAINTS:
1. ONLY extract text that is explicitly present in the HTML
2. Do NOT infer, guess, or generate any content
3. If you cannot find a field, use null - NEVER make up data
4. Extract exactly as written - do not paraphrase or summarize
5. Return ONLY valid JSON array format
Required fields per review:
- reviewer_name (string or null): Exact name as shown
- rating (number or null): Numeric rating only (e.g., 5, 4.5)
- review_text (string or null): Exact review text
- date (string or null): Date in ISO format if parseable
HTML:
{html_content}
Return format: [{{"reviewer_name": "...", "rating": 5, "review_text": "...", "date": "..."}}]
"""
client = anthropic.Anthropic(api_key="your-api-key")
message = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=1024,
temperature=0,
messages=[{"role": "user", "content": prompt}]
)
return message.content[0].text
Example 4: Chain-of-Thought Prompting for Complex Extraction
For complex data extraction tasks, chain-of-thought prompting encourages the AI to reason through the problem:
def extract_pricing_tiers(html):
prompt = f"""Extract pricing tier information from this HTML. Think step-by-step:
Step 1: Identify all pricing tier sections
Step 2: For each tier, extract the name
Step 3: Extract the price and billing period
Step 4: List all features for that tier
Step 5: Format as JSON
HTML:
{html}
Provide your reasoning, then output the final JSON in this format:
[{{
"tier_name": "string",
"price": number,
"billing_period": "monthly|yearly",
"features": ["feature1", "feature2"]
}}]
Begin with "Reasoning:" followed by your analysis, then "Output:" followed by JSON.
"""
response = openai.ChatCompletion.create(
model="gpt-4",
messages=[{"role": "user", "content": prompt}],
temperature=0
)
# Extract JSON from response
content = response.choices[0].message.content
json_start = content.find('[')
json_data = content[json_start:]
return json.loads(json_data)
Example 5: Role-Based Prompting
Assigning a specific role can improve extraction quality:
async function extractTechnicalSpecs(html) {
const prompt = `You are a data extraction specialist with expertise in technical specifications.
Your task: Extract ALL technical specifications from this product page HTML with perfect accuracy.
Requirements:
- Extract specification names and values as key-value pairs
- Preserve exact units (GB, MHz, inches, etc.)
- Convert measurements to standard formats when obvious
- Group related specs logically
HTML:
${html}
Return JSON format:
{
"processor": "...",
"ram": "...",
"storage": "...",
"display": "...",
"dimensions": "...",
"weight": "...",
"other_specs": {"key": "value"}
}`;
const response = await openai.chat.completions.create({
model: "gpt-4-turbo-preview",
messages: [
{ role: "system", content: "You are a precise data extraction specialist." },
{ role: "user", content: prompt }
],
temperature: 0
});
return JSON.parse(response.choices[0].message.content);
}
Example 6: Function Calling for Structured Output
Modern LLMs support function calling, which enforces strict output schemas:
import openai
def scrape_with_function_calling(html_content):
tools = [{
"type": "function",
"function": {
"name": "extract_product_data",
"description": "Extract structured product data from HTML",
"parameters": {
"type": "object",
"properties": {
"name": {
"type": "string",
"description": "Product name"
},
"price": {
"type": "number",
"description": "Numeric price value"
},
"currency": {
"type": "string",
"enum": ["USD", "EUR", "GBP"],
"description": "Currency code"
},
"availability": {
"type": "string",
"enum": ["in_stock", "out_of_stock", "preorder"],
"description": "Stock status"
},
"features": {
"type": "array",
"items": {"type": "string"},
"description": "List of product features"
}
},
"required": ["name", "price", "currency"]
}
}
}]
response = openai.chat.completions.create(
model="gpt-4-turbo-preview",
messages=[{
"role": "user",
"content": f"Extract product data from this HTML:\n{html_content}"
}],
tools=tools,
tool_choice={"type": "function", "function": {"name": "extract_product_data"}}
)
# Extract function arguments (the structured data)
function_args = response.choices[0].message.tool_calls[0].function.arguments
return json.loads(function_args)
This approach is highly effective when using ChatGPT API for web scraping because it guarantees schema compliance.
Example 7: Context Window Optimization
For large HTML documents, optimize what you send to the AI:
from bs4 import BeautifulSoup
def optimize_html_for_llm(html_content, target_selectors):
"""Extract only relevant portions of HTML to reduce token usage"""
soup = BeautifulSoup(html_content, 'html.parser')
# Remove unnecessary elements
for element in soup(['script', 'style', 'svg', 'path']):
element.decompose()
# Extract only target sections
relevant_sections = []
for selector in target_selectors:
elements = soup.select(selector)
relevant_sections.extend([str(el) for el in elements])
optimized_html = '\n'.join(relevant_sections)
prompt = f"""Extract data from these relevant HTML sections:
{optimized_html}
Return JSON with: {{"products": [{{"name": "...", "price": ..., "rating": ...}}]}}
"""
return prompt
# Usage
html = fetch_webpage("https://example.com/products")
optimized_prompt = optimize_html_for_llm(html, ['.product-card', '.product-info'])
Example 8: Template-Based Extraction
Create reusable prompt templates for consistent results:
class LLMScraperTemplate:
def __init__(self, api_key):
self.client = openai.OpenAI(api_key=api_key)
def create_extraction_prompt(self, html, schema, constraints=None):
"""Generate optimized prompts from schema definition"""
schema_description = "\n".join([
f"- {field['name']} ({field['type']}): {field['description']}"
for field in schema
])
required_fields = [f['name'] for f in schema if f.get('required', False)]
constraint_text = ""
if constraints:
constraint_text = "\nCONSTRAINTS:\n" + "\n".join(
f"{i+1}. {c}" for i, c in enumerate(constraints)
)
prompt = f"""Extract the following fields from the HTML:
{schema_description}
Required fields: {', '.join(required_fields)}
{constraint_text}
HTML:
{html}
Return ONLY valid JSON matching this structure.
"""
return prompt
def extract(self, html, schema, constraints=None):
prompt = self.create_extraction_prompt(html, schema, constraints)
response = self.client.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": prompt}],
temperature=0
)
return json.loads(response.choices[0].message.content)
# Usage
scraper = LLMScraperTemplate(api_key="your-key")
schema = [
{"name": "title", "type": "string", "description": "Article title", "required": True},
{"name": "author", "type": "string", "description": "Author name", "required": True},
{"name": "published_date", "type": "string", "description": "ISO date format", "required": False},
{"name": "tags", "type": "array", "description": "Article tags", "required": False}
]
constraints = [
"Extract only explicitly visible text",
"Do not infer missing information",
"Use null for missing optional fields"
]
result = scraper.extract(html_content, schema, constraints)
Best Practices for Prompt Optimization
1. Use Low Temperature Settings
Set temperature=0
or very low values for deterministic extraction:
response = openai.ChatCompletion.create(
model="gpt-4",
messages=[{"role": "user", "content": prompt}],
temperature=0, # Maximum consistency
top_p=1.0
)
2. Specify Output Format Explicitly
Always tell the AI exactly how to format responses:
Return ONLY a valid JSON object with no markdown formatting, no explanations, no additional text.
3. Include Validation Rules
prompt = f"""Extract email addresses from HTML.
Validation rules:
- Must match standard email format (user@domain.com)
- Ignore mailto: links and email images
- Return unique emails only
- Exclude social media placeholders like info@example.com
HTML: {html}
Return: {{"emails": ["email1@domain.com", "email2@domain.com"]}}
"""
4. Test and Iterate
When implementing GPT-based web scraping, always test prompts with diverse HTML samples:
def test_prompt_variations(html_samples, prompt_templates):
"""Compare different prompt strategies"""
results = {}
for template_name, template in prompt_templates.items():
accuracy_scores = []
for sample in html_samples:
prompt = template.format(html=sample['html'])
response = call_llm(prompt)
score = evaluate_accuracy(response, sample['expected'])
accuracy_scores.append(score)
results[template_name] = {
'avg_accuracy': sum(accuracy_scores) / len(accuracy_scores),
'scores': accuracy_scores
}
return results
Conclusion
AI prompt optimization for web scraping requires careful consideration of specificity, constraints, output format, and validation. By using techniques like few-shot learning, function calling, and structured templates, you can achieve reliable extraction while minimizing costs and hallucinations.
The key is to be explicit, provide examples, set clear constraints, and iterate based on real-world results. As AI models continue to evolve, prompt engineering remains the critical skill for effective LLM-based web scraping.