What are prompt engineering examples for web scraping?

Prompt engineering is the art of crafting effective instructions for Large Language Models (LLMs) like GPT to extract structured data from web pages. Unlike traditional web scraping that relies on CSS selectors or XPath, prompt-based scraping uses natural language instructions to tell AI models what data to extract and how to format it.

Understanding Prompt Engineering for Web Scraping

Prompt engineering for web scraping involves creating clear, specific instructions that guide an LLM to: - Identify relevant data on a web page - Extract information accurately - Format the output in a structured way (JSON, CSV, etc.) - Handle edge cases and missing data

The key advantage is that well-crafted prompts can adapt to varying HTML structures without needing to update selectors when websites change their layouts.

Basic Prompt Structure for Data Extraction

A good web scraping prompt typically includes: 1. Context: What type of page you're scraping 2. Task: What data to extract 3. Format: How to structure the output 4. Constraints: Rules for handling edge cases

Here's a basic example:

import openai

html_content = """
<div class="product">
    <h1>Wireless Headphones</h1>
    <span class="price">$79.99</span>
    <p class="description">High-quality Bluetooth headphones</p>
</div>
"""

prompt = f"""
Extract product information from the following HTML.

HTML:
{html_content}

Return the data as JSON with these fields:
- name: product name
- price: numeric price value
- description: product description

If a field is missing, use null.
"""

response = openai.ChatCompletion.create(
    model="gpt-4",
    messages=[
        {"role": "system", "content": "You are a web scraping assistant that extracts structured data from HTML."},
        {"role": "user", "content": prompt}
    ]
)

print(response.choices[0].message.content)

Example 1: E-commerce Product Scraping

For scraping product listings, use specific prompts that handle common variations:

prompt = """
You are extracting product data from an e-commerce website.

Extract ALL products from this HTML and return as a JSON array.

For each product, extract:
- title (string): Product name
- price (number): Price in USD, extract just the number
- currency (string): Currency code
- rating (number): Star rating if present, null otherwise
- reviewCount (number): Number of reviews if present, null otherwise
- inStock (boolean): true if in stock, false if out of stock, null if unknown
- imageUrl (string): Primary product image URL

Rules:
- If price has a discount, use the discounted price
- Convert all prices to numbers (remove $ and commas)
- Extract full absolute URLs for images
- Return empty array [] if no products found

HTML:
{html_content}

Return only valid JSON, no explanations.
"""

Example 2: Article Metadata Extraction

When scraping blog posts or news articles, focus on semantic content:

const prompt = `
Extract article metadata from this HTML page.

Return JSON with:
{
  "title": "Article title",
  "author": "Author name or null",
  "publishDate": "ISO 8601 date or null",
  "tags": ["tag1", "tag2"] or [],
  "content": "Main article text, cleaned",
  "readTime": "Estimated read time or null",
  "category": "Article category or null"
}

Instructions:
- Extract the main article content, removing ads and navigation
- Clean up extra whitespace in content
- Parse dates to ISO 8601 format (YYYY-MM-DD)
- Tags should be lowercase
- If content is too long, include first 500 words only

HTML:
${htmlContent}

Return only valid JSON.
`;

const response = await fetch('https://api.openai.com/v1/chat/completions', {
  method: 'POST',
  headers: {
    'Authorization': `Bearer ${process.env.OPENAI_API_KEY}`,
    'Content-Type': 'application/json'
  },
  body: JSON.stringify({
    model: 'gpt-4',
    messages: [
      {role: 'system', content: 'You extract structured data from HTML accurately.'},
      {role: 'user', content: prompt}
    ]
  })
});

Example 3: Table Data Extraction

For structured data in HTML tables, provide clear schema definitions:

prompt = """
Extract data from the HTML table below.

Expected columns: Name, Email, Phone, Department, Status

Return as JSON array where each object represents a row:
[
  {
    "name": "string",
    "email": "string",
    "phone": "string (format: XXX-XXX-XXXX)",
    "department": "string",
    "status": "active" | "inactive"
  }
]

Instructions:
- Normalize phone numbers to XXX-XXX-XXXX format
- Convert emails to lowercase
- Map status variations (Active/Yes/✓) to "active", others to "inactive"
- Skip header rows
- Skip rows with missing name or email

HTML:
{table_html}

Return only the JSON array.
"""

Example 4: Few-Shot Prompting for Complex Extraction

Few-shot prompting provides examples to improve accuracy:

prompt = """
Extract job posting information from HTML.

Example 1:
Input: <div class="job"><h2>Senior Developer</h2><span>$120k-150k</span><p>Remote</p></div>
Output: {"title": "Senior Developer", "salary_min": 120000, "salary_max": 150000, "location": "Remote"}

Example 2:
Input: <div class="job"><h2>Marketing Manager</h2><span>80k/year</span><p>New York, NY</p></div>
Output: {"title": "Marketing Manager", "salary_min": 80000, "salary_max": 80000, "location": "New York, NY"}

Example 3:
Input: <div class="job"><h2>Data Analyst</h2><span>Competitive</span><p>Hybrid - Austin</p></div>
Output: {"title": "Data Analyst", "salary_min": null, "salary_max": null, "location": "Austin"}

Now extract from this HTML:
{html_content}

Return only the JSON object.
"""

Example 5: Handling Dynamic Content with Context

When dealing with JavaScript-rendered content, combine browser automation with GPT:

const puppeteer = require('puppeteer');
const OpenAI = require('openai');

async function scrapeWithContext(url) {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();
  await page.goto(url, { waitUntil: 'networkidle2' });

  // Wait for dynamic content
  await page.waitForSelector('.product-list');

  const html = await page.content();
  await browser.close();

  const prompt = `
  This HTML is from a single-page application that loaded product data dynamically.

  Extract all products visible on the page.

  Return as JSON array with:
  - id: product identifier
  - name: product name
  - price: numeric price
  - availability: "in_stock" | "out_of_stock" | "preorder"

  HTML:
  ${html}

  Return only valid JSON array.
  `;

  const openai = new OpenAI();
  const response = await openai.chat.completions.create({
    model: 'gpt-4',
    messages: [
      {role: 'system', content: 'You are a precise web scraping assistant.'},
      {role: 'user', content: prompt}
    ]
  });

  return JSON.parse(response.choices[0].message.content);
}

Similar to handling AJAX requests using Puppeteer, this approach ensures all dynamic content is loaded before extraction.

Example 6: Multi-Step Extraction with Validation

For complex scraping tasks, use chain-of-thought prompting:

prompt = """
You are scraping a product review page. Follow these steps:

Step 1: Identify the overall product rating (1-5 stars)
Step 2: Extract the total number of reviews
Step 3: Find all individual review elements
Step 4: For each review, extract:
  - reviewer name
  - rating (1-5)
  - review date
  - review text
  - helpful votes count

Step 5: Calculate average rating from individual reviews
Step 6: Verify it matches the overall rating (within 0.5 stars)

Return JSON:
{
  "product_rating": number,
  "total_reviews": number,
  "reviews": [
    {
      "author": "string",
      "rating": number,
      "date": "YYYY-MM-DD",
      "text": "string",
      "helpful_votes": number
    }
  ],
  "calculated_average": number,
  "validation_passed": boolean
}

HTML:
{html_content}
"""

Best Practices for Prompt Engineering in Web Scraping

1. Be Specific About Output Format

Always specify exact field names, data types, and structure:

# Good
"Return JSON with 'price' as number, 'title' as string"

# Bad
"Extract the price and title"

2. Handle Missing Data Explicitly

prompt = """
If any field is missing or cannot be determined:
- Use null for optional fields
- Use empty string "" for required string fields
- Use 0 for required numeric fields
- Use empty array [] for list fields
"""

3. Provide Data Transformation Rules

prompt = """
Data transformations:
- Dates: Convert to ISO 8601 (YYYY-MM-DD)
- Prices: Extract numbers only, remove currency symbols
- URLs: Convert to absolute URLs
- Text: Trim whitespace, remove HTML entities
- Phone: Format as +1-XXX-XXX-XXXX
"""

4. Use System Messages Effectively

system_message = """
You are a specialized web scraping assistant with these capabilities:
- Extract structured data from HTML with high accuracy
- Handle malformed HTML gracefully
- Normalize data consistently
- Return only valid JSON, never explanations
- Use null for missing data, never omit fields
"""

5. Limit Token Usage for Large Pages

When working with large HTML documents, consider preprocessing:

from bs4 import BeautifulSoup

def preprocess_html(html):
    soup = BeautifulSoup(html, 'html.parser')

    # Remove scripts, styles, and navigation
    for tag in soup(['script', 'style', 'nav', 'footer', 'header']):
        tag.decompose()

    # Keep only main content area
    main_content = soup.find('main') or soup.find('article') or soup.body

    return str(main_content)

# Then use in prompt
cleaned_html = preprocess_html(raw_html)
prompt = f"Extract data from: {cleaned_html}"

This approach is particularly useful when crawling single-page applications that generate large amounts of HTML.

Example 7: Schema-Based Extraction

For consistent results across multiple pages, define a strict schema:

const schema = {
  type: "object",
  properties: {
    products: {
      type: "array",
      items: {
        type: "object",
        required: ["id", "name", "price"],
        properties: {
          id: { type: "string" },
          name: { type: "string" },
          price: { type: "number", minimum: 0 },
          currency: { type: "string", enum: ["USD", "EUR", "GBP"] },
          inStock: { type: "boolean" }
        }
      }
    }
  }
};

const prompt = `
Extract product data matching this JSON schema:
${JSON.stringify(schema, null, 2)}

Validation rules:
- All required fields must be present
- price must be positive number
- currency must be one of: USD, EUR, GBP
- id must be unique within the array

HTML:
${htmlContent}

Return valid JSON matching the schema.
`;

Handling Errors and Edge Cases

Always include error handling in your prompts:

prompt = """
Extract contact information. Handle these cases:

1. Multiple phone numbers: Return as array
2. Email obfuscated (e.g., "name [at] domain [dot] com"): Reconstruct to proper format
3. No contact info found: Return {"found": false, "data": null}
4. Partial data: Include what's available, mark others as null

Return:
{
  "found": boolean,
  "data": {
    "email": string or null,
    "phones": [string] or [],
    "address": string or null,
    "social": {
      "twitter": string or null,
      "linkedin": string or null
    }
  }
}

HTML:
{html_content}
"""

Combining Traditional and AI-Based Scraping

For optimal results, combine traditional selectors with GPT extraction:

from bs4 import BeautifulSoup
import openai

def hybrid_scrape(html):
    soup = BeautifulSoup(html, 'html.parser')

    # Use traditional methods for structure
    product_cards = soup.select('.product-card')

    results = []
    for card in product_cards:
        # Let GPT handle complex/variable content
        prompt = f"""
        Extract product details from this HTML fragment.
        Focus on: name, price, features list, specifications.

        Return JSON:
        {{
          "name": "string",
          "price": number,
          "features": ["string"],
          "specs": {{"key": "value"}}
        }}

        HTML:
        {str(card)}
        """

        # Send to GPT for extraction
        response = openai.ChatCompletion.create(
            model="gpt-4",
            messages=[{"role": "user", "content": prompt}]
        )

        results.append(json.loads(response.choices[0].message.content))

    return results

Conclusion

Effective prompt engineering for web scraping requires clear instructions, well-defined schemas, and thoughtful handling of edge cases. By combining specific output formats, few-shot examples, and validation rules, you can create robust scraping solutions that adapt to varying website structures without constant maintenance.

The key is to treat prompts as code: version them, test them against different inputs, and refine them based on results. Start with simple extraction tasks and gradually increase complexity as you understand how the model interprets different HTML structures.

When working with complex websites that require browser automation, consider integrating GPT-based extraction with tools like Puppeteer for handling dynamic content and browser events before passing the rendered HTML to your language model.

Table of contents

What are prompt engineering examples for web scraping?

Understanding Prompt Engineering for Web Scraping

Basic Prompt Structure for Data Extraction

Example 1: E-commerce Product Scraping

Example 2: Article Metadata Extraction

Example 3: Table Data Extraction

Example 4: Few-Shot Prompting for Complex Extraction

Example 5: Handling Dynamic Content with Context

Example 6: Multi-Step Extraction with Validation

Best Practices for Prompt Engineering in Web Scraping

1. Be Specific About Output Format

2. Handle Missing Data Explicitly

3. Provide Data Transformation Rules

4. Use System Messages Effectively

5. Limit Token Usage for Large Pages

Example 7: Schema-Based Extraction

Handling Errors and Edge Cases

Combining Traditional and AI-Based Scraping

Conclusion

Try WebScraping.AI for Your Web Scraping Needs

Key Features:

Getting Started:

Related Questions

How can I integrate OpenAI with my web scraping service?

What is LLM web scraping and when should I use it?

How do I measure the accuracy and reliability of GPT-based scrapers?

Get Started Now

Support