What Are Some Effective ChatGPT Prompts for Data Extraction?

Effective ChatGPT prompts for data extraction are the cornerstone of successful AI-powered web scraping. The quality of your prompts directly determines the accuracy, consistency, and reliability of extracted data. This guide provides battle-tested prompt templates and techniques for extracting structured data from HTML using ChatGPT and other AI models.

Understanding Effective Data Extraction Prompts

A well-crafted data extraction prompt has several essential components:

Clear context - Defines the AI's role and objective
Specific instructions - Details exactly what data to extract
Schema definition - Specifies the output structure and data types
Handling rules - Explains how to deal with missing or ambiguous data
Format requirements - Ensures consistent JSON output

The difference between a weak and strong prompt can mean the difference between 60% and 95% extraction accuracy.

Essential Prompt Templates

Template 1: Basic Product Extraction

import openai

client = openai.OpenAI(api_key="your-api-key")

prompt = """
Extract all product information from this HTML page.

For each product, extract:
- name (string): The full product title
- price (number): Price as a decimal number, remove currency symbols
- currency (string): Currency code (USD, EUR, GBP, etc.)
- availability (boolean): true if in stock, false otherwise
- rating (number or null): Rating from 0-5, null if not available
- review_count (integer or null): Number of reviews, null if not available
- image_url (string or null): Main product image URL

Return JSON in this exact format:
{
  "products": [
    {
      "name": "Product Name",
      "price": 29.99,
      "currency": "USD",
      "availability": true,
      "rating": 4.5,
      "review_count": 127,
      "image_url": "https://..."
    }
  ]
}

Rules:
- If a field is missing, use null (not empty string)
- Parse star ratings like "4.5 stars" to 4.5
- Remove thousands separators from prices (1,299.99 becomes 1299.99)
- Convert "In Stock", "Available Now" to true
- Convert "Out of Stock", "Sold Out" to false

HTML Content:
{html_content}
"""

response = client.chat.completions.create(
    model="gpt-4",
    messages=[
        {
            "role": "system",
            "content": "You are a data extraction specialist. Extract structured data from HTML and return valid JSON only."
        },
        {
            "role": "user",
            "content": prompt.format(html_content=html)
        }
    ],
    temperature=0,  # Deterministic output
    response_format={"type": "json_object"}
)

data = response.choices[0].message.content

Template 2: Article/Blog Post Extraction

prompt = """
Extract article information from this webpage.

Required fields:
- title (string): The main article headline
- author (string or null): Author name, null if not found
- publish_date (string or null): Date in YYYY-MM-DD format, null if not available or relative
- updated_date (string or null): Last updated date in YYYY-MM-DD format
- category (string or null): Article category or section
- tags (array): List of tags/keywords, empty array if none
- content (string): Full article body text, excluding ads and navigation
- excerpt (string or null): Article summary or description
- reading_time (integer or null): Estimated reading time in minutes
- featured_image (string or null): URL of main article image

Output format:
{
  "article": {
    "title": "...",
    "author": "...",
    ...
  }
}

Important rules:
1. Extract only the main article content, exclude:
   - Navigation menus
   - Sidebar widgets
   - Advertisement text
   - Footer content
   - Comments section
2. For publish_date, only extract if in absolute format (not "2 days ago")
3. Multiple authors should be comma-separated: "John Doe, Jane Smith"
4. Tags should be lowercase without # symbols
5. Preserve paragraph breaks in content using \n\n

HTML:
{html_content}
"""

Template 3: E-commerce Listing with Variants

prompt = """
Extract product listing with all variants and specifications.

Schema:
{
  "product": {
    "name": "string",
    "brand": "string or null",
    "base_price": "number",
    "currency": "string",
    "description": "string",
    "specifications": {
      "key": "value"
    },
    "variants": [
      {
        "id": "string",
        "name": "string (e.g., 'Red - Large')",
        "attributes": {
          "color": "string",
          "size": "string"
        },
        "price": "number",
        "sku": "string or null",
        "in_stock": "boolean"
      }
    ],
    "images": ["array of image URLs"],
    "rating": {
      "average": "number (0-5)",
      "count": "integer"
    }
  }
}

Extraction instructions:
1. Parse all color/size combinations as separate variants
2. Extract technical specifications into key-value pairs
3. Normalize specification keys (e.g., "RAM", "Memory" → "ram")
4. Each variant should have its own price and availability
5. Main images array should include all product photos

HTML:
{html_content}
"""

Template 4: Job Listings Extraction

const prompt = `
Extract all job postings from this page.

For each job, extract:
- title (string): Job title/position name
- company (string): Company name
- location (string): Job location (city, state/country, or "Remote")
- employment_type (string): "Full-time", "Part-time", "Contract", "Internship", etc.
- experience_level (string or null): "Entry", "Mid", "Senior", "Executive", etc.
- salary_min (number or null): Minimum salary as annual amount
- salary_max (number or null): Maximum salary as annual amount
- salary_currency (string or null): Currency code
- salary_period (string or null): "annual", "monthly", "hourly"
- posted_date (string or null): Date in YYYY-MM-DD format
- description (string): Job description text
- requirements (array): List of requirements/qualifications
- benefits (array): List of benefits, empty array if none
- application_url (string or null): URL to apply

Output:
{
  "jobs": [...]
}

Salary parsing rules:
- "$120k-$160k" → min: 120000, max: 160000, currency: "USD", period: "annual"
- "$50/hour" → min: 50, max: null, currency: "USD", period: "hourly"
- "€60,000 per year" → min: 60000, max: null, currency: "EUR", period: "annual"
- If no salary info, all salary fields should be null

HTML:
${htmlContent}
`;

Template 5: Real Estate Listings

prompt = """
Extract real estate property listings.

Schema per property:
- id (string or null): Property ID or listing number
- title (string): Property listing title
- price (number): Listing price as number
- currency (string): Currency code
- price_type (string): "sale", "rent_monthly", "rent_weekly"
- property_type (string): "house", "apartment", "condo", "townhouse", etc.
- bedrooms (number or null): Number of bedrooms
- bathrooms (number or null): Number of bathrooms
- square_feet (number or null): Interior square footage
- lot_size (number or null): Lot size in square feet
- year_built (integer or null): Year property was built
- address (object): {
    "street": "string or null",
    "city": "string",
    "state": "string or null",
    "zip": "string or null",
    "country": "string"
  }
- features (array): List of property features/amenities
- description (string): Property description
- images (array): Array of image URLs
- agent_name (string or null): Listing agent name
- agent_phone (string or null): Agent contact number

Return as: {"properties": [...]}

Parsing rules:
- "3 bed, 2 bath" → bedrooms: 3, bathrooms: 2
- "1,850 sq ft" → square_feet: 1850
- "Built in 2015" → year_built: 2015
- Extract all amenities as features array: ["Pool", "Garage", "Fireplace"]

HTML:
{html_content}
"""

Advanced Prompt Techniques

Few-Shot Learning for Complex Data

Few-shot prompts provide examples to guide the AI:

prompt = """
Extract event information from this HTML. Use these examples as a guide:

EXAMPLE 1:
Input HTML: "<div class='event'><h3>Tech Conference 2024</h3><p>March 15-17, 2024 | San Francisco, CA</p><span>$299</span></div>"

Output:
{
  "name": "Tech Conference 2024",
  "start_date": "2024-03-15",
  "end_date": "2024-03-17",
  "location": "San Francisco, CA",
  "price": 299,
  "currency": "USD"
}

EXAMPLE 2:
Input HTML: "<div><h2>Free Webinar: AI Trends</h2><time>2024-04-01 2:00 PM EST</time><p>Online Event</p></div>"

Output:
{
  "name": "Free Webinar: AI Trends",
  "start_date": "2024-04-01",
  "end_date": "2024-04-01",
  "location": "Online",
  "price": 0,
  "currency": null
}

EXAMPLE 3:
Input HTML: "<article><h1>Summer Music Festival</h1><div>June 20-22 | Central Park</div><p>$75-$150</p></article>"

Output:
{
  "name": "Summer Music Festival",
  "start_date": "2024-06-20",
  "end_date": "2024-06-22",
  "location": "Central Park",
  "price": 75,
  "currency": "USD"
}

Now extract all events from this HTML following the same pattern:
{html_content}

Return as: {"events": [...]}
"""

Chain-of-Thought for Complex Extraction

Guide the AI through reasoning steps:

prompt = """
Extract structured data from this product specification page.

Follow these reasoning steps:

STEP 1: Identify the specifications section
Look for tables, lists, or div sections containing technical specifications.

STEP 2: Parse each specification
For each spec, extract the label and value. Common patterns:
- "Display: 15.6 inch LED" → {"display_size": 15.6, "display_type": "LED"}
- "RAM: 16GB DDR4" → {"ram_gb": 16, "ram_type": "DDR4"}
- "Storage: 512GB SSD" → {"storage_gb": 512, "storage_type": "SSD"}
- "Battery: Up to 10 hours" → {"battery_hours": 10}
- "Weight: 3.5 lbs" → {"weight_lbs": 3.5}

STEP 3: Normalize specification names
Convert various labels to consistent keys:
- "Screen Size", "Display", "Monitor" → "display_size"
- "Memory", "RAM" → "ram_gb"
- "Hard Drive", "Storage", "SSD" → "storage_gb"
- "Processor", "CPU" → "processor"

STEP 4: Extract numeric values and units
Parse numbers and preserve units where important:
- "2.4 GHz" → 2.4
- "16GB" → 16
- "1920x1080" → {"width": 1920, "height": 1080}

STEP 5: Return structured output
{
  "specifications": {
    "display_size": 15.6,
    "display_type": "LED",
    "ram_gb": 16,
    "storage_gb": 512,
    ...
  }
}

Now extract specifications from:
{html_content}
"""

Multi-Level Nested Data Extraction

prompt = """
Extract the complete category hierarchy with products.

This page shows categories, subcategories, and products within each.

Target structure:
{
  "categories": [
    {
      "name": "Main Category Name",
      "url": "category URL",
      "subcategories": [
        {
          "name": "Subcategory Name",
          "url": "subcategory URL",
          "product_count": 0
        }
      ],
      "featured_products": [
        {
          "name": "Product Name",
          "price": 0,
          "url": "product URL",
          "image": "image URL"
        }
      ]
    }
  ]
}

Extraction rules:
1. Maintain the hierarchy: category → subcategories → products
2. Extract product counts from text like "Electronics (245 items)" → 245
3. Featured products should only include products explicitly shown, not just linked
4. URLs should be absolute paths (include domain if relative)
5. If no featured products are visible, use empty array

HTML:
{html_content}
"""

Prompts for Specific Data Types

Extracting Tables

table_prompt = """
Extract data from the HTML table.

Instructions:
1. First row typically contains headers
2. Convert headers to snake_case keys (e.g., "Product Name" → "product_name")
3. Parse numeric columns as numbers
4. Parse date columns to YYYY-MM-DD format if possible
5. Keep text columns as strings
6. Empty cells should be null

Return format:
{
  "table": {
    "headers": ["column1", "column2", ...],
    "rows": [
      {"column1": "value", "column2": 123, ...}
    ]
  }
}

HTML:
{html_content}
"""

Extracting Contact Information

contact_prompt = """
Extract all contact information from this page.

Find and extract:
- email (string or array): Email address(es)
- phone (string or array): Phone number(s) with country code
- address (object): {
    "street": "string",
    "city": "string",
    "state": "string",
    "zip": "string",
    "country": "string"
  }
- social_media (object): {
    "facebook": "URL or null",
    "twitter": "URL or null",
    "linkedin": "URL or null",
    "instagram": "URL or null"
  }
- business_hours (array or null): [
    {"day": "Monday", "hours": "9:00 AM - 5:00 PM"}
  ]

Parsing rules:
- Validate email format (must contain @ and domain)
- Phone numbers should include country code when available
- Extract social media URLs from links or embedded widgets
- Parse business hours even if in paragraph format

HTML:
{html_content}
"""

Extracting Reviews and Ratings

reviews_prompt = """
Extract customer reviews from this product page.

For each review, extract:
- reviewer_name (string): Name of reviewer
- rating (number): Rating from 1-5
- date (string): Review date in YYYY-MM-DD format if possible
- title (string or null): Review headline/title
- content (string): Full review text
- verified_purchase (boolean or null): true if marked as verified buyer
- helpful_count (integer or null): Number of "helpful" votes
- images (array): URLs of review images, empty array if none

Summary data:
- total_reviews (integer): Total number of reviews
- average_rating (number): Overall average rating
- rating_distribution (object): {
    "5_star": 120,
    "4_star": 45,
    "3_star": 10,
    "2_star": 3,
    "1_star": 2
  }

Return format:
{
  "summary": {
    "total_reviews": 180,
    "average_rating": 4.5,
    "rating_distribution": {...}
  },
  "reviews": [...]
}

HTML:
{html_content}
"""

Optimizing Prompts for Accuracy

Handling Missing Data

prompt = """
Extract restaurant information with strict null handling.

Fields:
- name (string, REQUIRED): Restaurant name
- cuisine (string or null): Type of cuisine
- price_range (string or null): "$", "$$", "$$$", or "$$$$"
- rating (number or null): Rating 0-5
- review_count (integer or null): Number of reviews
- phone (string or null): Phone number
- address (string or null): Full address
- website (string or null): Website URL
- hours (object or null): Business hours

CRITICAL RULES FOR MISSING DATA:
1. Use null for missing fields (NEVER guess or invent data)
2. Use null for unclear or ambiguous data
3. Use null for fields marked "Coming Soon" or "TBD"
4. If rating is shown as "New" or "Not yet rated", use null
5. Empty strings are NOT acceptable - use null instead

Example of correct null handling:
{
  "name": "Restaurant Name",
  "cuisine": "Italian",
  "price_range": "$$",
  "rating": null,  // Marked as "New"
  "review_count": null,  // Not shown
  "phone": null,  // Not provided
  "address": "123 Main St",
  "website": null,  // No link found
  "hours": null  // Not available
}

HTML:
{html_content}
"""

Adding Validation Rules

validation_prompt = """
Extract product data with built-in validation.

Fields and validation rules:
- sku (string): Must be alphanumeric, 6-20 characters
- name (string): Required, minimum 3 characters
- price (number): Must be positive, maximum 2 decimal places
- compare_price (number or null): If present, must be >= price
- quantity (integer): Must be non-negative integer
- weight (number or null): If present, must be positive
- dimensions (object or null): {
    "length": number,
    "width": number,
    "height": number,
    "unit": "in" or "cm"
  }
- url (string): Must be valid HTTP/HTTPS URL
- email (string or null): Must be valid email format if present

Validation instructions:
1. If price has more than 2 decimals, round to 2
2. If compare_price < price, set compare_price to null (data error)
3. Remove any non-numeric characters from sku
4. Validate URL format - must start with http:// or https://
5. If any required field is missing, skip that entire product

Return only products that pass all validation rules.

HTML:
{html_content}
"""

Prompts for Different Content Types

News Articles and Press Releases

news_prompt = """
Extract news article or press release information.

Fields to extract:
- headline (string): Main article headline
- subheadline (string or null): Secondary headline or deck
- byline (string or null): Author attribution line
- dateline (string or null): Location and date (e.g., "NEW YORK, Jan 15")
- publish_datetime (string): ISO 8601 format (YYYY-MM-DDTHH:MM:SS)
- update_datetime (string or null): Last updated timestamp
- section (string or null): News section/category
- tags (array): Article tags and topics
- body_text (string): Full article text, paragraphs separated by \n\n
- lead_paragraph (string): First/lead paragraph
- image_caption (string or null): Main image caption
- image_credit (string or null): Photo credit
- related_articles (array): [{title: "", url: ""}]

Press release specific fields:
- company_name (string or null): Company issuing release
- contact_info (object or null): {name: "", email: "", phone: ""}
- boilerplate (string or null): Company description paragraph

HTML:
{html_content}
"""

Forum Posts and Comments

forum_prompt = """
Extract forum thread with all posts/comments.

Thread data:
- title (string): Thread title
- category (string or null): Forum category/section
- created_date (string): Thread creation date YYYY-MM-DD
- view_count (integer or null): Number of views
- reply_count (integer): Number of replies
- is_locked (boolean): Whether thread is locked
- is_pinned (boolean): Whether thread is pinned/sticky
- tags (array): Thread tags

For each post:
- post_id (string or null): Post ID if available
- author (string): Username of poster
- author_role (string or null): "Admin", "Moderator", "Member", etc.
- post_date (string): Post timestamp YYYY-MM-DD HH:MM
- content (string): Post content text
- quote_text (string or null): Quoted text if replying to another post
- upvotes (integer or null): Upvote/like count
- is_solution (boolean): Whether marked as solution/answer

Return format:
{
  "thread": {
    "title": "...",
    ...
  },
  "posts": [...]
}

HTML:
{html_content}
"""

Using Prompts with Browser Automation

When scraping dynamic content, combine ChatGPT prompts with browser automation. You can handle AJAX requests to ensure all content loads before extraction:

const puppeteer = require('puppeteer');
const OpenAI = require('openai');

const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });

async function dynamicScrapeWithGPT(url) {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();

  // Navigate and wait for dynamic content
  await page.goto(url, { waitUntil: 'networkidle0' });

  // Scroll to load lazy-loaded content
  await page.evaluate(() => {
    window.scrollTo(0, document.body.scrollHeight);
  });
  await page.waitForTimeout(2000);

  // Get fully rendered HTML
  const html = await page.content();
  await browser.close();

  // Extract with ChatGPT
  const prompt = `
  Extract all product listings from this page.

  For each product:
  - name (string)
  - price (number)
  - image_url (string)
  - product_url (string)

  Return as: {"products": [...]}

  HTML:
  ${html.substring(0, 15000)}
  `;

  const response = await openai.chat.completions.create({
    model: 'gpt-4',
    messages: [
      {
        role: 'system',
        content: 'Extract structured product data from HTML. Return valid JSON only.'
      },
      {
        role: 'user',
        content: prompt
      }
    ],
    temperature: 0,
    response_format: { type: 'json_object' }
  });

  return JSON.parse(response.choices[0].message.content);
}

Testing and Iterating Prompts

Create a systematic testing framework:

import json

def test_extraction_prompt(prompt_template, test_cases):
    """
    Test a prompt against multiple HTML samples
    """
    results = []

    for i, test_case in enumerate(test_cases):
        html = test_case['html']
        expected = test_case.get('expected_fields', [])

        response = client.chat.completions.create(
            model="gpt-4",
            messages=[
                {
                    "role": "system",
                    "content": "Extract structured data as valid JSON."
                },
                {
                    "role": "user",
                    "content": prompt_template.format(html_content=html)
                }
            ],
            temperature=0,
            response_format={"type": "json_object"}
        )

        extracted = json.loads(response.choices[0].message.content)

        # Check if all expected fields are present
        missing_fields = [f for f in expected if f not in str(extracted)]

        results.append({
            'test_case': i + 1,
            'success': len(missing_fields) == 0,
            'missing_fields': missing_fields,
            'token_count': response.usage.total_tokens,
            'extracted_data': extracted
        })

    # Calculate success rate
    success_rate = sum(1 for r in results if r['success']) / len(results)
    avg_tokens = sum(r['token_count'] for r in results) / len(results)

    return {
        'success_rate': success_rate,
        'average_tokens': avg_tokens,
        'results': results
    }

# Example usage
test_cases = [
    {
        'html': '<div class="product">...</div>',
        'expected_fields': ['name', 'price', 'availability']
    },
    # Add more test cases
]

results = test_extraction_prompt(product_prompt, test_cases)
print(f"Success rate: {results['success_rate'] * 100}%")
print(f"Average tokens: {results['average_tokens']}")

Cost Optimization Strategies

Minimize HTML Before Sending

from bs4 import BeautifulSoup
import re

def optimize_html_for_extraction(html, target_section=None):
    """
    Clean and minimize HTML to reduce token usage
    """
    soup = BeautifulSoup(html, 'html.parser')

    # Remove unnecessary elements
    for element in soup(['script', 'style', 'svg', 'noscript', 'iframe']):
        element.decompose()

    # Extract only target section if specified
    if target_section:
        target = soup.select_one(target_section)
        if target:
            soup = BeautifulSoup(str(target), 'html.parser')

    # Remove comments
    for comment in soup.find_all(string=lambda text: isinstance(text, str) and text.strip().startswith('<!--')):
        comment.extract()

    # Simplify attributes
    for tag in soup.find_all():
        # Keep only useful attributes
        keep = ['class', 'id', 'href', 'src', 'alt', 'title', 'data-price', 'data-id']
        tag.attrs = {k: v for k, v in tag.attrs.items() if k in keep}

    # Minimize whitespace
    html_str = str(soup)
    html_str = re.sub(r'\s+', ' ', html_str)
    html_str = re.sub(r'>\s+<', '><', html_str)

    return html_str.strip()

# Use before sending to ChatGPT
cleaned_html = optimize_html_for_extraction(raw_html, '.product-grid')

Common Pitfalls and Solutions

Pitfall 1: Inconsistent Field Names

Problem: AI returns different key names across requests

// Request 1
{"product_name": "Item"}
// Request 2
{"name": "Item"}

Solution: Explicitly define exact key names in prompt

"Return JSON with EXACTLY these keys: product_name, product_price, product_url"

Pitfall 2: Invalid JSON Output

Problem: AI includes explanatory text with JSON

Solution: Use response_format parameter and reinforce in prompt

response_format={"type": "json_object"}
# In prompt: "Return ONLY valid JSON, no other text or explanation"

Pitfall 3: Hallucinated Data

Problem: AI invents data when it's not present

Solution: Explicitly instruct to use null

"CRITICAL: If data is not present in the HTML, use null. Never guess or invent information."

Conclusion

Effective ChatGPT prompts for data extraction require precision, clear structure, and comprehensive instructions. The templates and techniques in this guide provide a solid foundation for building reliable AI-powered web scrapers.

Key takeaways: - Be explicit about data types, formats, and structure - Provide examples through few-shot learning for complex extractions - Handle edge cases by defining rules for missing and ambiguous data - Validate output against expected schemas - Test systematically across multiple HTML samples - Optimize for cost by cleaning HTML and using appropriate models

When working with JavaScript-heavy websites, combine these prompts with browser automation to interact with DOM elements before extraction.

Remember that prompt engineering is iterative. Start with basic templates, test against real-world data, and refine based on results. Monitor accuracy, token usage, and costs to find the optimal balance for your use case.

Table of contents