What Are Some Effective ChatGPT Prompts for Data Extraction?
Effective ChatGPT prompts for data extraction are the cornerstone of successful AI-powered web scraping. The quality of your prompts directly determines the accuracy, consistency, and reliability of extracted data. This guide provides battle-tested prompt templates and techniques for extracting structured data from HTML using ChatGPT and other AI models.
Understanding Effective Data Extraction Prompts
A well-crafted data extraction prompt has several essential components:
- Clear context - Defines the AI's role and objective
- Specific instructions - Details exactly what data to extract
- Schema definition - Specifies the output structure and data types
- Handling rules - Explains how to deal with missing or ambiguous data
- Format requirements - Ensures consistent JSON output
The difference between a weak and strong prompt can mean the difference between 60% and 95% extraction accuracy.
Essential Prompt Templates
Template 1: Basic Product Extraction
import openai
client = openai.OpenAI(api_key="your-api-key")
prompt = """
Extract all product information from this HTML page.
For each product, extract:
- name (string): The full product title
- price (number): Price as a decimal number, remove currency symbols
- currency (string): Currency code (USD, EUR, GBP, etc.)
- availability (boolean): true if in stock, false otherwise
- rating (number or null): Rating from 0-5, null if not available
- review_count (integer or null): Number of reviews, null if not available
- image_url (string or null): Main product image URL
Return JSON in this exact format:
{
"products": [
{
"name": "Product Name",
"price": 29.99,
"currency": "USD",
"availability": true,
"rating": 4.5,
"review_count": 127,
"image_url": "https://..."
}
]
}
Rules:
- If a field is missing, use null (not empty string)
- Parse star ratings like "4.5 stars" to 4.5
- Remove thousands separators from prices (1,299.99 becomes 1299.99)
- Convert "In Stock", "Available Now" to true
- Convert "Out of Stock", "Sold Out" to false
HTML Content:
{html_content}
"""
response = client.chat.completions.create(
model="gpt-4",
messages=[
{
"role": "system",
"content": "You are a data extraction specialist. Extract structured data from HTML and return valid JSON only."
},
{
"role": "user",
"content": prompt.format(html_content=html)
}
],
temperature=0, # Deterministic output
response_format={"type": "json_object"}
)
data = response.choices[0].message.content
Template 2: Article/Blog Post Extraction
prompt = """
Extract article information from this webpage.
Required fields:
- title (string): The main article headline
- author (string or null): Author name, null if not found
- publish_date (string or null): Date in YYYY-MM-DD format, null if not available or relative
- updated_date (string or null): Last updated date in YYYY-MM-DD format
- category (string or null): Article category or section
- tags (array): List of tags/keywords, empty array if none
- content (string): Full article body text, excluding ads and navigation
- excerpt (string or null): Article summary or description
- reading_time (integer or null): Estimated reading time in minutes
- featured_image (string or null): URL of main article image
Output format:
{
"article": {
"title": "...",
"author": "...",
...
}
}
Important rules:
1. Extract only the main article content, exclude:
- Navigation menus
- Sidebar widgets
- Advertisement text
- Footer content
- Comments section
2. For publish_date, only extract if in absolute format (not "2 days ago")
3. Multiple authors should be comma-separated: "John Doe, Jane Smith"
4. Tags should be lowercase without # symbols
5. Preserve paragraph breaks in content using \n\n
HTML:
{html_content}
"""
Template 3: E-commerce Listing with Variants
prompt = """
Extract product listing with all variants and specifications.
Schema:
{
"product": {
"name": "string",
"brand": "string or null",
"base_price": "number",
"currency": "string",
"description": "string",
"specifications": {
"key": "value"
},
"variants": [
{
"id": "string",
"name": "string (e.g., 'Red - Large')",
"attributes": {
"color": "string",
"size": "string"
},
"price": "number",
"sku": "string or null",
"in_stock": "boolean"
}
],
"images": ["array of image URLs"],
"rating": {
"average": "number (0-5)",
"count": "integer"
}
}
}
Extraction instructions:
1. Parse all color/size combinations as separate variants
2. Extract technical specifications into key-value pairs
3. Normalize specification keys (e.g., "RAM", "Memory" → "ram")
4. Each variant should have its own price and availability
5. Main images array should include all product photos
HTML:
{html_content}
"""
Template 4: Job Listings Extraction
const prompt = `
Extract all job postings from this page.
For each job, extract:
- title (string): Job title/position name
- company (string): Company name
- location (string): Job location (city, state/country, or "Remote")
- employment_type (string): "Full-time", "Part-time", "Contract", "Internship", etc.
- experience_level (string or null): "Entry", "Mid", "Senior", "Executive", etc.
- salary_min (number or null): Minimum salary as annual amount
- salary_max (number or null): Maximum salary as annual amount
- salary_currency (string or null): Currency code
- salary_period (string or null): "annual", "monthly", "hourly"
- posted_date (string or null): Date in YYYY-MM-DD format
- description (string): Job description text
- requirements (array): List of requirements/qualifications
- benefits (array): List of benefits, empty array if none
- application_url (string or null): URL to apply
Output:
{
"jobs": [...]
}
Salary parsing rules:
- "$120k-$160k" → min: 120000, max: 160000, currency: "USD", period: "annual"
- "$50/hour" → min: 50, max: null, currency: "USD", period: "hourly"
- "€60,000 per year" → min: 60000, max: null, currency: "EUR", period: "annual"
- If no salary info, all salary fields should be null
HTML:
${htmlContent}
`;
Template 5: Real Estate Listings
prompt = """
Extract real estate property listings.
Schema per property:
- id (string or null): Property ID or listing number
- title (string): Property listing title
- price (number): Listing price as number
- currency (string): Currency code
- price_type (string): "sale", "rent_monthly", "rent_weekly"
- property_type (string): "house", "apartment", "condo", "townhouse", etc.
- bedrooms (number or null): Number of bedrooms
- bathrooms (number or null): Number of bathrooms
- square_feet (number or null): Interior square footage
- lot_size (number or null): Lot size in square feet
- year_built (integer or null): Year property was built
- address (object): {
"street": "string or null",
"city": "string",
"state": "string or null",
"zip": "string or null",
"country": "string"
}
- features (array): List of property features/amenities
- description (string): Property description
- images (array): Array of image URLs
- agent_name (string or null): Listing agent name
- agent_phone (string or null): Agent contact number
Return as: {"properties": [...]}
Parsing rules:
- "3 bed, 2 bath" → bedrooms: 3, bathrooms: 2
- "1,850 sq ft" → square_feet: 1850
- "Built in 2015" → year_built: 2015
- Extract all amenities as features array: ["Pool", "Garage", "Fireplace"]
HTML:
{html_content}
"""
Advanced Prompt Techniques
Few-Shot Learning for Complex Data
Few-shot prompts provide examples to guide the AI:
prompt = """
Extract event information from this HTML. Use these examples as a guide:
EXAMPLE 1:
Input HTML: "<div class='event'><h3>Tech Conference 2024</h3><p>March 15-17, 2024 | San Francisco, CA</p><span>$299</span></div>"
Output:
{
"name": "Tech Conference 2024",
"start_date": "2024-03-15",
"end_date": "2024-03-17",
"location": "San Francisco, CA",
"price": 299,
"currency": "USD"
}
EXAMPLE 2:
Input HTML: "<div><h2>Free Webinar: AI Trends</h2><time>2024-04-01 2:00 PM EST</time><p>Online Event</p></div>"
Output:
{
"name": "Free Webinar: AI Trends",
"start_date": "2024-04-01",
"end_date": "2024-04-01",
"location": "Online",
"price": 0,
"currency": null
}
EXAMPLE 3:
Input HTML: "<article><h1>Summer Music Festival</h1><div>June 20-22 | Central Park</div><p>$75-$150</p></article>"
Output:
{
"name": "Summer Music Festival",
"start_date": "2024-06-20",
"end_date": "2024-06-22",
"location": "Central Park",
"price": 75,
"currency": "USD"
}
Now extract all events from this HTML following the same pattern:
{html_content}
Return as: {"events": [...]}
"""
Chain-of-Thought for Complex Extraction
Guide the AI through reasoning steps:
prompt = """
Extract structured data from this product specification page.
Follow these reasoning steps:
STEP 1: Identify the specifications section
Look for tables, lists, or div sections containing technical specifications.
STEP 2: Parse each specification
For each spec, extract the label and value. Common patterns:
- "Display: 15.6 inch LED" → {"display_size": 15.6, "display_type": "LED"}
- "RAM: 16GB DDR4" → {"ram_gb": 16, "ram_type": "DDR4"}
- "Storage: 512GB SSD" → {"storage_gb": 512, "storage_type": "SSD"}
- "Battery: Up to 10 hours" → {"battery_hours": 10}
- "Weight: 3.5 lbs" → {"weight_lbs": 3.5}
STEP 3: Normalize specification names
Convert various labels to consistent keys:
- "Screen Size", "Display", "Monitor" → "display_size"
- "Memory", "RAM" → "ram_gb"
- "Hard Drive", "Storage", "SSD" → "storage_gb"
- "Processor", "CPU" → "processor"
STEP 4: Extract numeric values and units
Parse numbers and preserve units where important:
- "2.4 GHz" → 2.4
- "16GB" → 16
- "1920x1080" → {"width": 1920, "height": 1080}
STEP 5: Return structured output
{
"specifications": {
"display_size": 15.6,
"display_type": "LED",
"ram_gb": 16,
"storage_gb": 512,
...
}
}
Now extract specifications from:
{html_content}
"""
Multi-Level Nested Data Extraction
prompt = """
Extract the complete category hierarchy with products.
This page shows categories, subcategories, and products within each.
Target structure:
{
"categories": [
{
"name": "Main Category Name",
"url": "category URL",
"subcategories": [
{
"name": "Subcategory Name",
"url": "subcategory URL",
"product_count": 0
}
],
"featured_products": [
{
"name": "Product Name",
"price": 0,
"url": "product URL",
"image": "image URL"
}
]
}
]
}
Extraction rules:
1. Maintain the hierarchy: category → subcategories → products
2. Extract product counts from text like "Electronics (245 items)" → 245
3. Featured products should only include products explicitly shown, not just linked
4. URLs should be absolute paths (include domain if relative)
5. If no featured products are visible, use empty array
HTML:
{html_content}
"""
Prompts for Specific Data Types
Extracting Tables
table_prompt = """
Extract data from the HTML table.
Instructions:
1. First row typically contains headers
2. Convert headers to snake_case keys (e.g., "Product Name" → "product_name")
3. Parse numeric columns as numbers
4. Parse date columns to YYYY-MM-DD format if possible
5. Keep text columns as strings
6. Empty cells should be null
Return format:
{
"table": {
"headers": ["column1", "column2", ...],
"rows": [
{"column1": "value", "column2": 123, ...}
]
}
}
HTML:
{html_content}
"""
Extracting Contact Information
contact_prompt = """
Extract all contact information from this page.
Find and extract:
- email (string or array): Email address(es)
- phone (string or array): Phone number(s) with country code
- address (object): {
"street": "string",
"city": "string",
"state": "string",
"zip": "string",
"country": "string"
}
- social_media (object): {
"facebook": "URL or null",
"twitter": "URL or null",
"linkedin": "URL or null",
"instagram": "URL or null"
}
- business_hours (array or null): [
{"day": "Monday", "hours": "9:00 AM - 5:00 PM"}
]
Parsing rules:
- Validate email format (must contain @ and domain)
- Phone numbers should include country code when available
- Extract social media URLs from links or embedded widgets
- Parse business hours even if in paragraph format
HTML:
{html_content}
"""
Extracting Reviews and Ratings
reviews_prompt = """
Extract customer reviews from this product page.
For each review, extract:
- reviewer_name (string): Name of reviewer
- rating (number): Rating from 1-5
- date (string): Review date in YYYY-MM-DD format if possible
- title (string or null): Review headline/title
- content (string): Full review text
- verified_purchase (boolean or null): true if marked as verified buyer
- helpful_count (integer or null): Number of "helpful" votes
- images (array): URLs of review images, empty array if none
Summary data:
- total_reviews (integer): Total number of reviews
- average_rating (number): Overall average rating
- rating_distribution (object): {
"5_star": 120,
"4_star": 45,
"3_star": 10,
"2_star": 3,
"1_star": 2
}
Return format:
{
"summary": {
"total_reviews": 180,
"average_rating": 4.5,
"rating_distribution": {...}
},
"reviews": [...]
}
HTML:
{html_content}
"""
Optimizing Prompts for Accuracy
Handling Missing Data
prompt = """
Extract restaurant information with strict null handling.
Fields:
- name (string, REQUIRED): Restaurant name
- cuisine (string or null): Type of cuisine
- price_range (string or null): "$", "$$", "$$$", or "$$$$"
- rating (number or null): Rating 0-5
- review_count (integer or null): Number of reviews
- phone (string or null): Phone number
- address (string or null): Full address
- website (string or null): Website URL
- hours (object or null): Business hours
CRITICAL RULES FOR MISSING DATA:
1. Use null for missing fields (NEVER guess or invent data)
2. Use null for unclear or ambiguous data
3. Use null for fields marked "Coming Soon" or "TBD"
4. If rating is shown as "New" or "Not yet rated", use null
5. Empty strings are NOT acceptable - use null instead
Example of correct null handling:
{
"name": "Restaurant Name",
"cuisine": "Italian",
"price_range": "$$",
"rating": null, // Marked as "New"
"review_count": null, // Not shown
"phone": null, // Not provided
"address": "123 Main St",
"website": null, // No link found
"hours": null // Not available
}
HTML:
{html_content}
"""
Adding Validation Rules
validation_prompt = """
Extract product data with built-in validation.
Fields and validation rules:
- sku (string): Must be alphanumeric, 6-20 characters
- name (string): Required, minimum 3 characters
- price (number): Must be positive, maximum 2 decimal places
- compare_price (number or null): If present, must be >= price
- quantity (integer): Must be non-negative integer
- weight (number or null): If present, must be positive
- dimensions (object or null): {
"length": number,
"width": number,
"height": number,
"unit": "in" or "cm"
}
- url (string): Must be valid HTTP/HTTPS URL
- email (string or null): Must be valid email format if present
Validation instructions:
1. If price has more than 2 decimals, round to 2
2. If compare_price < price, set compare_price to null (data error)
3. Remove any non-numeric characters from sku
4. Validate URL format - must start with http:// or https://
5. If any required field is missing, skip that entire product
Return only products that pass all validation rules.
HTML:
{html_content}
"""
Prompts for Different Content Types
News Articles and Press Releases
news_prompt = """
Extract news article or press release information.
Fields to extract:
- headline (string): Main article headline
- subheadline (string or null): Secondary headline or deck
- byline (string or null): Author attribution line
- dateline (string or null): Location and date (e.g., "NEW YORK, Jan 15")
- publish_datetime (string): ISO 8601 format (YYYY-MM-DDTHH:MM:SS)
- update_datetime (string or null): Last updated timestamp
- section (string or null): News section/category
- tags (array): Article tags and topics
- body_text (string): Full article text, paragraphs separated by \n\n
- lead_paragraph (string): First/lead paragraph
- image_caption (string or null): Main image caption
- image_credit (string or null): Photo credit
- related_articles (array): [{title: "", url: ""}]
Press release specific fields:
- company_name (string or null): Company issuing release
- contact_info (object or null): {name: "", email: "", phone: ""}
- boilerplate (string or null): Company description paragraph
HTML:
{html_content}
"""
Forum Posts and Comments
forum_prompt = """
Extract forum thread with all posts/comments.
Thread data:
- title (string): Thread title
- category (string or null): Forum category/section
- created_date (string): Thread creation date YYYY-MM-DD
- view_count (integer or null): Number of views
- reply_count (integer): Number of replies
- is_locked (boolean): Whether thread is locked
- is_pinned (boolean): Whether thread is pinned/sticky
- tags (array): Thread tags
For each post:
- post_id (string or null): Post ID if available
- author (string): Username of poster
- author_role (string or null): "Admin", "Moderator", "Member", etc.
- post_date (string): Post timestamp YYYY-MM-DD HH:MM
- content (string): Post content text
- quote_text (string or null): Quoted text if replying to another post
- upvotes (integer or null): Upvote/like count
- is_solution (boolean): Whether marked as solution/answer
Return format:
{
"thread": {
"title": "...",
...
},
"posts": [...]
}
HTML:
{html_content}
"""
Using Prompts with Browser Automation
When scraping dynamic content, combine ChatGPT prompts with browser automation. You can handle AJAX requests to ensure all content loads before extraction:
const puppeteer = require('puppeteer');
const OpenAI = require('openai');
const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });
async function dynamicScrapeWithGPT(url) {
const browser = await puppeteer.launch();
const page = await browser.newPage();
// Navigate and wait for dynamic content
await page.goto(url, { waitUntil: 'networkidle0' });
// Scroll to load lazy-loaded content
await page.evaluate(() => {
window.scrollTo(0, document.body.scrollHeight);
});
await page.waitForTimeout(2000);
// Get fully rendered HTML
const html = await page.content();
await browser.close();
// Extract with ChatGPT
const prompt = `
Extract all product listings from this page.
For each product:
- name (string)
- price (number)
- image_url (string)
- product_url (string)
Return as: {"products": [...]}
HTML:
${html.substring(0, 15000)}
`;
const response = await openai.chat.completions.create({
model: 'gpt-4',
messages: [
{
role: 'system',
content: 'Extract structured product data from HTML. Return valid JSON only.'
},
{
role: 'user',
content: prompt
}
],
temperature: 0,
response_format: { type: 'json_object' }
});
return JSON.parse(response.choices[0].message.content);
}
Testing and Iterating Prompts
Create a systematic testing framework:
import json
def test_extraction_prompt(prompt_template, test_cases):
"""
Test a prompt against multiple HTML samples
"""
results = []
for i, test_case in enumerate(test_cases):
html = test_case['html']
expected = test_case.get('expected_fields', [])
response = client.chat.completions.create(
model="gpt-4",
messages=[
{
"role": "system",
"content": "Extract structured data as valid JSON."
},
{
"role": "user",
"content": prompt_template.format(html_content=html)
}
],
temperature=0,
response_format={"type": "json_object"}
)
extracted = json.loads(response.choices[0].message.content)
# Check if all expected fields are present
missing_fields = [f for f in expected if f not in str(extracted)]
results.append({
'test_case': i + 1,
'success': len(missing_fields) == 0,
'missing_fields': missing_fields,
'token_count': response.usage.total_tokens,
'extracted_data': extracted
})
# Calculate success rate
success_rate = sum(1 for r in results if r['success']) / len(results)
avg_tokens = sum(r['token_count'] for r in results) / len(results)
return {
'success_rate': success_rate,
'average_tokens': avg_tokens,
'results': results
}
# Example usage
test_cases = [
{
'html': '<div class="product">...</div>',
'expected_fields': ['name', 'price', 'availability']
},
# Add more test cases
]
results = test_extraction_prompt(product_prompt, test_cases)
print(f"Success rate: {results['success_rate'] * 100}%")
print(f"Average tokens: {results['average_tokens']}")
Cost Optimization Strategies
Minimize HTML Before Sending
from bs4 import BeautifulSoup
import re
def optimize_html_for_extraction(html, target_section=None):
"""
Clean and minimize HTML to reduce token usage
"""
soup = BeautifulSoup(html, 'html.parser')
# Remove unnecessary elements
for element in soup(['script', 'style', 'svg', 'noscript', 'iframe']):
element.decompose()
# Extract only target section if specified
if target_section:
target = soup.select_one(target_section)
if target:
soup = BeautifulSoup(str(target), 'html.parser')
# Remove comments
for comment in soup.find_all(string=lambda text: isinstance(text, str) and text.strip().startswith('<!--')):
comment.extract()
# Simplify attributes
for tag in soup.find_all():
# Keep only useful attributes
keep = ['class', 'id', 'href', 'src', 'alt', 'title', 'data-price', 'data-id']
tag.attrs = {k: v for k, v in tag.attrs.items() if k in keep}
# Minimize whitespace
html_str = str(soup)
html_str = re.sub(r'\s+', ' ', html_str)
html_str = re.sub(r'>\s+<', '><', html_str)
return html_str.strip()
# Use before sending to ChatGPT
cleaned_html = optimize_html_for_extraction(raw_html, '.product-grid')
Common Pitfalls and Solutions
Pitfall 1: Inconsistent Field Names
Problem: AI returns different key names across requests
// Request 1
{"product_name": "Item"}
// Request 2
{"name": "Item"}
Solution: Explicitly define exact key names in prompt
"Return JSON with EXACTLY these keys: product_name, product_price, product_url"
Pitfall 2: Invalid JSON Output
Problem: AI includes explanatory text with JSON
Solution: Use response_format
parameter and reinforce in prompt
response_format={"type": "json_object"}
# In prompt: "Return ONLY valid JSON, no other text or explanation"
Pitfall 3: Hallucinated Data
Problem: AI invents data when it's not present
Solution: Explicitly instruct to use null
"CRITICAL: If data is not present in the HTML, use null. Never guess or invent information."
Conclusion
Effective ChatGPT prompts for data extraction require precision, clear structure, and comprehensive instructions. The templates and techniques in this guide provide a solid foundation for building reliable AI-powered web scrapers.
Key takeaways: - Be explicit about data types, formats, and structure - Provide examples through few-shot learning for complex extractions - Handle edge cases by defining rules for missing and ambiguous data - Validate output against expected schemas - Test systematically across multiple HTML samples - Optimize for cost by cleaning HTML and using appropriate models
When working with JavaScript-heavy websites, combine these prompts with browser automation to interact with DOM elements before extraction.
Remember that prompt engineering is iterative. Start with basic templates, test against real-world data, and refine based on results. Monitor accuracy, token usage, and costs to find the optimal balance for your use case.