How Can I Extract Structured Data Using GPT?

GPT (Generative Pre-trained Transformer) models can transform unstructured HTML content into structured data formats like JSON, CSV, or XML. By leveraging GPT's natural language understanding capabilities, you can extract specific information from web pages without writing complex CSS selectors or XPath expressions.

Understanding GPT-Based Data Extraction

Traditional web scraping relies on selectors to target specific HTML elements. GPT-based extraction takes a different approach: you provide the raw HTML or text content along with instructions about what data you want to extract, and the model returns structured output.

This method is particularly useful when: - Website structures change frequently - Data is embedded in natural language text - Multiple pages have different layouts but similar content - You need to extract semantic meaning, not just visible text

Setting Up GPT for Data Extraction

Using OpenAI API (Python)

First, install the OpenAI library:

pip install openai

Here's a basic example of extracting product information from HTML:

import openai
import json

openai.api_key = "your-api-key"

html_content = """
<div class="product">
    <h2>Wireless Bluetooth Headphones</h2>
    <p>Premium sound quality with active noise cancellation</p>
    <span class="price">$129.99</span>
    <div class="rating">4.5 stars (234 reviews)</div>
</div>
"""

response = openai.chat.completions.create(
    model="gpt-4",
    messages=[
        {
            "role": "system",
            "content": "Extract product information from HTML and return as JSON with fields: name, description, price, rating, review_count"
        },
        {
            "role": "user",
            "content": html_content
        }
    ],
    response_format={"type": "json_object"}
)

product_data = json.loads(response.choices[0].message.content)
print(json.dumps(product_data, indent=2))

Output:

{
  "name": "Wireless Bluetooth Headphones",
  "description": "Premium sound quality with active noise cancellation",
  "price": 129.99,
  "rating": 4.5,
  "review_count": 234
}

Using OpenAI API (JavaScript/Node.js)

npm install openai

const OpenAI = require('openai');

const openai = new OpenAI({
  apiKey: process.env.OPENAI_API_KEY
});

async function extractProductData(html) {
  const response = await openai.chat.completions.create({
    model: "gpt-4",
    messages: [
      {
        role: "system",
        content: "Extract product information from HTML and return as JSON with fields: name, description, price, rating, review_count"
      },
      {
        role: "user",
        content: html
      }
    ],
    response_format: { type: "json_object" }
  });

  return JSON.parse(response.choices[0].message.content);
}

const htmlContent = `
<div class="product">
    <h2>Wireless Bluetooth Headphones</h2>
    <p>Premium sound quality with active noise cancellation</p>
    <span class="price">$129.99</span>
    <div class="rating">4.5 stars (234 reviews)</div>
</div>
`;

extractProductData(htmlContent).then(data => {
  console.log(JSON.stringify(data, null, 2));
});

Advanced Extraction Techniques

Using Function Calling for Schema Validation

OpenAI's function calling feature ensures GPT returns data in a specific structure:

import openai
import json

def extract_with_schema(html_content):
    response = openai.chat.completions.create(
        model="gpt-4",
        messages=[
            {
                "role": "user",
                "content": f"Extract product details from this HTML: {html_content}"
            }
        ],
        tools=[
            {
                "type": "function",
                "function": {
                    "name": "save_product",
                    "description": "Save extracted product information",
                    "parameters": {
                        "type": "object",
                        "properties": {
                            "name": {
                                "type": "string",
                                "description": "Product name"
                            },
                            "price": {
                                "type": "number",
                                "description": "Product price in USD"
                            },
                            "description": {
                                "type": "string",
                                "description": "Product description"
                            },
                            "rating": {
                                "type": "number",
                                "description": "Average rating (0-5)"
                            },
                            "in_stock": {
                                "type": "boolean",
                                "description": "Whether product is in stock"
                            }
                        },
                        "required": ["name", "price"]
                    }
                }
            }
        ],
        tool_choice={"type": "function", "function": {"name": "save_product"}}
    )

    tool_call = response.choices[0].message.tool_calls[0]
    return json.loads(tool_call.function.arguments)

# Extract data with strict schema
html = """
<div>
    <h1>Gaming Laptop Pro</h1>
    <p class="price">$1,299.00</p>
    <p>High-performance gaming laptop with RTX 4070</p>
    <span class="stock">In Stock</span>
    <div class="stars">★★★★☆ 4.2/5</div>
</div>
"""

result = extract_with_schema(html)
print(json.dumps(result, indent=2))

Batch Processing Multiple Elements

When scraping lists of items, you can extract multiple records in a single API call:

import openai
import json
from bs4 import BeautifulSoup

def extract_multiple_products(html_content):
    response = openai.chat.completions.create(
        model="gpt-4",
        messages=[
            {
                "role": "system",
                "content": """Extract all products from the HTML and return a JSON array.
                Each product should have: name, price, rating, availability.
                Return format: {"products": [...]}"""
            },
            {
                "role": "user",
                "content": html_content
            }
        ],
        response_format={"type": "json_object"}
    )

    return json.loads(response.choices[0].message.content)

html = """
<div class="product-list">
    <div class="item">
        <h3>Laptop Stand</h3>
        <span class="price">$39.99</span>
        <div class="rating">4.7★</div>
        <p class="stock">Available</p>
    </div>
    <div class="item">
        <h3>USB-C Hub</h3>
        <span class="price">$24.99</span>
        <div class="rating">4.3★</div>
        <p class="stock">Out of Stock</p>
    </div>
    <div class="item">
        <h3>Wireless Mouse</h3>
        <span class="price">$19.99</span>
        <div class="rating">4.8★</div>
        <p class="stock">Available</p>
    </div>
</div>
"""

results = extract_multiple_products(html)
print(json.dumps(results, indent=2))

Combining GPT with Traditional Web Scraping

For optimal results, combine GPT extraction with traditional scraping tools. Use libraries like Puppeteer or Playwright to handle JavaScript-rendered pages, then use GPT to extract structured data:

const puppeteer = require('puppeteer');
const OpenAI = require('openai');

const openai = new OpenAI({
  apiKey: process.env.OPENAI_API_KEY
});

async function scrapeWithGPT(url) {
  // Launch browser and get content
  const browser = await puppeteer.launch();
  const page = await browser.newPage();
  await page.goto(url, { waitUntil: 'networkidle2' });

  // Get the HTML content
  const htmlContent = await page.content();
  await browser.close();

  // Extract structured data with GPT
  const response = await openai.chat.completions.create({
    model: "gpt-4",
    messages: [
      {
        role: "system",
        content: "Extract article data: title, author, publish_date, content, tags. Return as JSON."
      },
      {
        role: "user",
        content: htmlContent
      }
    ],
    response_format: { type: "json_object" }
  });

  return JSON.parse(response.choices[0].message.content);
}

// Usage
scrapeWithGPT('https://example.com/article').then(data => {
  console.log(data);
});

When handling AJAX requests using Puppeteer, you can wait for dynamic content to load before passing it to GPT for extraction.

Optimizing Token Usage and Costs

GPT API calls are priced by token usage. Here are strategies to minimize costs:

1. Clean HTML Before Processing

Remove unnecessary HTML tags, scripts, and styles:

from bs4 import BeautifulSoup

def clean_html(html_content):
    soup = BeautifulSoup(html_content, 'html.parser')

    # Remove script and style elements
    for element in soup(['script', 'style', 'nav', 'footer', 'header']):
        element.decompose()

    # Get text with minimal formatting
    return soup.get_text(separator='\n', strip=True)

# Use cleaned content
cleaned = clean_html(html_content)
# Pass cleaned content to GPT...

2. Use GPT-3.5-Turbo for Simple Extractions

For straightforward data extraction, GPT-3.5-Turbo is significantly cheaper than GPT-4:

response = openai.chat.completions.create(
    model="gpt-3.5-turbo",  # Cheaper alternative
    messages=[...],
    response_format={"type": "json_object"}
)

3. Extract Only Target Sections

Instead of sending entire pages, use CSS selectors to isolate relevant sections:

from bs4 import BeautifulSoup

def extract_section(html, selector):
    soup = BeautifulSoup(html, 'html.parser')
    section = soup.select_one(selector)
    return str(section) if section else ""

# Extract only the product section
product_html = extract_section(full_html, '.product-details')
# Send only relevant section to GPT

Handling Complex Data Types

Extracting Dates and Numbers

GPT can parse and normalize dates and prices in various formats:

prompt = """
Extract and normalize the following data:
- Convert all dates to ISO 8601 format (YYYY-MM-DD)
- Convert all prices to numeric values (remove currency symbols)
- Parse relative dates like "2 days ago"

Return JSON with: event_name, event_date, ticket_price
"""

html = """
<div class="event">
    <h2>Summer Music Festival</h2>
    <p>Date: July 15th, 2024</p>
    <span class="price">$75.00 USD</span>
</div>
"""

# GPT will return:
# {
#   "event_name": "Summer Music Festival",
#   "event_date": "2024-07-15",
#   "ticket_price": 75.00
# }

Extracting Nested Structures

For complex hierarchical data:

response = openai.chat.completions.create(
    model="gpt-4",
    messages=[
        {
            "role": "system",
            "content": """Extract company data with nested structure:
            {
                "company_name": string,
                "employees": [
                    {
                        "name": string,
                        "position": string,
                        "contact": {
                            "email": string,
                            "phone": string
                        }
                    }
                ]
            }"""
        },
        {
            "role": "user",
            "content": html_content
        }
    ],
    response_format={"type": "json_object"}
)

Error Handling and Validation

Always validate GPT output to ensure data quality:

import json
from jsonschema import validate, ValidationError

# Define expected schema
schema = {
    "type": "object",
    "properties": {
        "name": {"type": "string"},
        "price": {"type": "number", "minimum": 0},
        "rating": {"type": "number", "minimum": 0, "maximum": 5}
    },
    "required": ["name", "price"]
}

def extract_with_validation(html_content):
    response = openai.chat.completions.create(
        model="gpt-4",
        messages=[...],
        response_format={"type": "json_object"}
    )

    try:
        data = json.loads(response.choices[0].message.content)
        validate(instance=data, schema=schema)
        return data
    except (json.JSONDecodeError, ValidationError) as e:
        print(f"Validation error: {e}")
        return None

Real-World Use Cases

E-commerce Product Scraping

def scrape_product_page(url):
    # Fetch HTML (using requests, Selenium, or Puppeteer)
    html = fetch_html(url)

    response = openai.chat.completions.create(
        model="gpt-4",
        messages=[
            {
                "role": "system",
                "content": """Extract e-commerce product data:
                - product_name
                - brand
                - price (numeric)
                - original_price (if discounted)
                - discount_percentage
                - availability (in_stock/out_of_stock/pre_order)
                - specifications (as array of {key, value} objects)
                - images (array of URLs)
                Return as JSON."""
            },
            {"role": "user", "content": html}
        ],
        response_format={"type": "json_object"}
    )

    return json.loads(response.choices[0].message.content)

News Article Extraction

def extract_article(html):
    response = openai.chat.completions.create(
        model="gpt-4",
        messages=[
            {
                "role": "system",
                "content": """Extract news article data:
                - headline
                - subheadline
                - author
                - publish_date (ISO format)
                - category
                - tags (array)
                - content (main article text)
                - summary (2-3 sentences)
                Return as JSON."""
            },
            {"role": "user", "content": html}
        ],
        response_format={"type": "json_object"}
    )

    return json.loads(response.choices[0].message.content)

Best Practices

Be Specific in Prompts: Clearly define the structure and data types you expect
Use JSON Mode: Enable response_format: {"type": "json_object"} for structured output
Implement Retry Logic: Handle API rate limits and transient errors
Cache Results: Store extracted data to avoid redundant API calls
Monitor Costs: Track token usage and implement usage limits
Validate Output: Always check that GPT returns data in the expected format
Combine Approaches: Use traditional selectors for navigation and GPT for complex extraction

When monitoring network requests in Puppeteer, you can capture API responses and use GPT to structure the data, even from dynamically loaded content.

Limitations and Considerations

Token Limits: GPT models have maximum token limits (context windows)
Cost: API calls can become expensive for large-scale scraping
Latency: API calls add 1-5 seconds per request compared to traditional parsing
Accuracy: GPT may occasionally misinterpret or hallucinate data
Rate Limits: OpenAI enforces rate limits on API requests

For high-volume scraping, consider using GPT selectively for complex pages while using traditional parsing for simple, structured content.

Conclusion

GPT-based data extraction offers a flexible, adaptive approach to web scraping that can handle diverse page structures and natural language content. By combining GPT with traditional scraping tools and following best practices for prompt engineering and validation, you can build robust data extraction pipelines that adapt to changing website layouts while maintaining data quality.

The key is knowing when to use GPT—leverage it for complex, unstructured data extraction while relying on traditional methods for simple, repetitive tasks to optimize both performance and cost.

Table of contents