Table of contents

How do I create effective GPT instructions for web scraping?

Creating effective GPT instructions for web scraping requires understanding how to communicate your data extraction needs clearly and precisely to language models. Well-crafted prompts can significantly improve accuracy, reduce hallucinations, and make your scraping workflow more reliable.

Understanding GPT-Based Web Scraping

GPT and other large language models (LLMs) can parse HTML content and extract structured data without writing traditional CSS selectors or XPath expressions. Instead, you provide natural language instructions that describe what data you want to extract. This approach is particularly useful for:

  • Unstructured or inconsistently formatted content
  • Dynamic websites where selectors frequently change
  • Complex data extraction requiring context understanding
  • Multi-field extraction from varied layouts

Core Components of Effective GPT Instructions

1. Clear Data Structure Definition

Always specify the exact structure you want the extracted data to follow. Include field names, data types, and format requirements.

Example for Python:

import openai
import json

def scrape_with_gpt(html_content):
    prompt = """
    Extract product information from the following HTML and return it as JSON.

    Required fields:
    - product_name (string): The name of the product
    - price (number): The price as a numeric value without currency symbols
    - rating (number): The average rating (0-5)
    - availability (boolean): Whether the product is in stock
    - reviews_count (integer): The number of reviews

    HTML:
    {html}

    Return only valid JSON, no additional text.
    """.format(html=html_content)

    response = openai.ChatCompletion.create(
        model="gpt-4",
        messages=[
            {"role": "system", "content": "You are a precise data extraction assistant."},
            {"role": "user", "content": prompt}
        ],
        temperature=0
    )

    return json.loads(response.choices[0].message.content)

2. Provide Context and Examples

Include examples of the expected output format. This technique, called "few-shot prompting," dramatically improves accuracy.

Example for JavaScript:

const OpenAI = require('openai');

async function scrapeArticleData(htmlContent) {
  const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });

  const prompt = `
Extract article information from the HTML below.

Example output:
{
  "title": "Understanding Web Scraping",
  "author": "John Doe",
  "publish_date": "2024-01-15",
  "tags": ["web scraping", "automation", "data extraction"],
  "word_count": 1500
}

Now extract from this HTML:
${htmlContent}

Return only valid JSON.
  `;

  const completion = await openai.chat.completions.create({
    model: "gpt-4",
    messages: [
      { role: "system", content: "Extract structured data from HTML." },
      { role: "user", content: prompt }
    ],
    temperature: 0
  });

  return JSON.parse(completion.choices[0].message.content);
}

3. Be Specific About Edge Cases

Explicitly handle missing data, null values, and alternative formats.

prompt = """
Extract user profile data from the HTML.

Fields:
- username (string): Required, return null if not found
- email (string): Return null if not displayed
- join_date (string): Format as YYYY-MM-DD, return null if unavailable
- bio (string): Full biography text, return empty string if none
- verified (boolean): true if profile shows verification badge, false otherwise
- follower_count (integer): Extract number only, return 0 if not shown

Important:
- If a field is missing, use null or the specified default
- Remove all HTML tags from text fields
- Extract numbers from formatted strings (e.g., "1.2K followers" -> 1200)

HTML:
{html}

Return valid JSON only.
"""

Best Practices for GPT Scraping Instructions

Use System Messages Effectively

Set the system message to establish the model's role and constraints:

system_message = """You are a data extraction expert. Your tasks:
1. Extract only the requested information
2. Return valid JSON without markdown formatting
3. Use null for missing values
4. Never make up information
5. Preserve original data types"""

response = openai.ChatCompletion.create(
    model="gpt-4",
    messages=[
        {"role": "system", "content": system_message},
        {"role": "user", "content": extraction_prompt}
    ],
    temperature=0  # Use 0 for deterministic output
)

Minimize HTML Input Size

GPT models have token limits. Preprocess HTML to include only relevant content when working with browser automation tools like Puppeteer:

const puppeteer = require('puppeteer');

async function getRelevantHTML(url, selector) {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();
  await page.goto(url);

  // Extract only the relevant section
  const relevantHTML = await page.$eval(selector, el => el.innerHTML);

  await browser.close();
  return relevantHTML;
}

// Use the extracted HTML with GPT
const productHTML = await getRelevantHTML('https://example.com/product', '.product-details');
const extractedData = await scrapeWithGPT(productHTML);

Request Structured Output with JSON Schema

For better reliability, specify the exact JSON schema:

prompt = """
Extract data following this JSON schema:

{
  "type": "object",
  "properties": {
    "title": {"type": "string"},
    "price": {"type": "number"},
    "currency": {"type": "string", "enum": ["USD", "EUR", "GBP"]},
    "features": {
      "type": "array",
      "items": {"type": "string"}
    }
  },
  "required": ["title", "price"]
}

HTML:
{html}
"""

Advanced Techniques

Chain of Thought Prompting

For complex extractions, ask the model to explain its reasoning:

prompt = """
Extract product specifications from the HTML below.

Process:
1. First, identify the specifications table or section
2. Then, extract each specification key-value pair
3. Standardize units (e.g., convert all weights to kg)
4. Finally, return the structured data

Think step by step, then provide the final JSON output.

HTML:
{html}
"""

Validation Instructions

Include validation rules in your prompt:

prompt = """
Extract email addresses from the contact page HTML.

Validation rules:
- Each email must match standard email format
- Exclude generic emails like info@, support@, noreply@
- Remove duplicates
- Return maximum 5 emails
- Sort alphabetically

HTML:
{html}

Return: {{"emails": ["email1@domain.com", "email2@domain.com"]}}
"""

Multi-Step Extraction

For complex pages, break extraction into multiple GPT calls:

async def scrape_complex_page(html):
    # Step 1: Extract main sections
    sections_prompt = "Identify and list all article sections in this HTML."
    sections = await call_gpt(sections_prompt, html)

    # Step 2: Extract data from each section
    results = []
    for section in sections:
        detail_prompt = f"Extract detailed information from this section: {section}"
        data = await call_gpt(detail_prompt, section)
        results.append(data)

    return results

Optimizing for Cost and Performance

Reduce Token Usage

from bs4 import BeautifulSoup

def clean_html(html_content):
    """Remove unnecessary HTML to reduce tokens"""
    soup = BeautifulSoup(html_content, 'html.parser')

    # Remove scripts, styles, and comments
    for element in soup(['script', 'style', 'noscript']):
        element.decompose()

    # Get text with minimal formatting
    return soup.get_text(separator=' ', strip=True)

# Use cleaned HTML
cleaned = clean_html(raw_html)
result = scrape_with_gpt(cleaned)

Batch Processing

Process multiple items in a single API call when possible:

prompt = """
Extract data from multiple product listings below.
Return an array of objects.

Expected format:
[
  {{"name": "Product 1", "price": 29.99}},
  {{"name": "Product 2", "price": 39.99}}
]

HTML containing multiple products:
{html}
"""

Error Handling and Retry Logic

Implement robust error handling when working with GPT for web scraping:

import time
import json
from openai import OpenAI

def scrape_with_retry(html, max_retries=3):
    client = OpenAI()

    for attempt in range(max_retries):
        try:
            response = client.chat.completions.create(
                model="gpt-4",
                messages=[
                    {"role": "system", "content": "Extract data as JSON."},
                    {"role": "user", "content": f"Extract: {html}"}
                ],
                temperature=0
            )

            result = response.choices[0].message.content

            # Validate JSON
            parsed = json.loads(result)
            return parsed

        except json.JSONDecodeError:
            if attempt < max_retries - 1:
                time.sleep(2 ** attempt)  # Exponential backoff
                continue
            else:
                raise ValueError("Failed to get valid JSON after retries")

        except Exception as e:
            if attempt < max_retries - 1:
                time.sleep(2 ** attempt)
                continue
            else:
                raise e

Testing and Validation

Always validate GPT extraction results:

def validate_extraction(data, schema):
    """Validate extracted data against expected schema"""
    required_fields = schema.get('required', [])

    # Check required fields
    for field in required_fields:
        if field not in data or data[field] is None:
            raise ValueError(f"Missing required field: {field}")

    # Check data types
    for field, value in data.items():
        expected_type = schema['properties'][field]['type']
        if expected_type == 'number' and not isinstance(value, (int, float)):
            raise TypeError(f"Field {field} should be number, got {type(value)}")

    return True

# Use in your scraping workflow
extracted = scrape_with_gpt(html)
validate_extraction(extracted, schema)

Conclusion

Effective GPT instructions for web scraping combine clear specifications, concrete examples, explicit handling of edge cases, and robust validation. Start with simple prompts and iteratively refine them based on actual results. Monitor extraction accuracy and adjust your instructions to handle the specific patterns in your target websites.

For production environments, consider combining traditional web scraping techniques with GPT-based extraction to balance cost, speed, and reliability. Use GPT for complex or unstructured data while relying on conventional selectors for simple, consistent elements.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon