How Can I Extract Structured Data Using GPT?
GPT (Generative Pre-trained Transformer) models can transform unstructured HTML content into structured data formats like JSON, CSV, or XML. By leveraging GPT's natural language understanding capabilities, you can extract specific information from web pages without writing complex CSS selectors or XPath expressions.
Understanding GPT-Based Data Extraction
Traditional web scraping relies on selectors to target specific HTML elements. GPT-based extraction takes a different approach: you provide the raw HTML or text content along with instructions about what data you want to extract, and the model returns structured output.
This method is particularly useful when: - Website structures change frequently - Data is embedded in natural language text - Multiple pages have different layouts but similar content - You need to extract semantic meaning, not just visible text
Setting Up GPT for Data Extraction
Using OpenAI API (Python)
First, install the OpenAI library:
pip install openai
Here's a basic example of extracting product information from HTML:
import openai
import json
openai.api_key = "your-api-key"
html_content = """
<div class="product">
<h2>Wireless Bluetooth Headphones</h2>
<p>Premium sound quality with active noise cancellation</p>
<span class="price">$129.99</span>
<div class="rating">4.5 stars (234 reviews)</div>
</div>
"""
response = openai.chat.completions.create(
model="gpt-4",
messages=[
{
"role": "system",
"content": "Extract product information from HTML and return as JSON with fields: name, description, price, rating, review_count"
},
{
"role": "user",
"content": html_content
}
],
response_format={"type": "json_object"}
)
product_data = json.loads(response.choices[0].message.content)
print(json.dumps(product_data, indent=2))
Output:
{
"name": "Wireless Bluetooth Headphones",
"description": "Premium sound quality with active noise cancellation",
"price": 129.99,
"rating": 4.5,
"review_count": 234
}
Using OpenAI API (JavaScript/Node.js)
npm install openai
const OpenAI = require('openai');
const openai = new OpenAI({
apiKey: process.env.OPENAI_API_KEY
});
async function extractProductData(html) {
const response = await openai.chat.completions.create({
model: "gpt-4",
messages: [
{
role: "system",
content: "Extract product information from HTML and return as JSON with fields: name, description, price, rating, review_count"
},
{
role: "user",
content: html
}
],
response_format: { type: "json_object" }
});
return JSON.parse(response.choices[0].message.content);
}
const htmlContent = `
<div class="product">
<h2>Wireless Bluetooth Headphones</h2>
<p>Premium sound quality with active noise cancellation</p>
<span class="price">$129.99</span>
<div class="rating">4.5 stars (234 reviews)</div>
</div>
`;
extractProductData(htmlContent).then(data => {
console.log(JSON.stringify(data, null, 2));
});
Advanced Extraction Techniques
Using Function Calling for Schema Validation
OpenAI's function calling feature ensures GPT returns data in a specific structure:
import openai
import json
def extract_with_schema(html_content):
response = openai.chat.completions.create(
model="gpt-4",
messages=[
{
"role": "user",
"content": f"Extract product details from this HTML: {html_content}"
}
],
tools=[
{
"type": "function",
"function": {
"name": "save_product",
"description": "Save extracted product information",
"parameters": {
"type": "object",
"properties": {
"name": {
"type": "string",
"description": "Product name"
},
"price": {
"type": "number",
"description": "Product price in USD"
},
"description": {
"type": "string",
"description": "Product description"
},
"rating": {
"type": "number",
"description": "Average rating (0-5)"
},
"in_stock": {
"type": "boolean",
"description": "Whether product is in stock"
}
},
"required": ["name", "price"]
}
}
}
],
tool_choice={"type": "function", "function": {"name": "save_product"}}
)
tool_call = response.choices[0].message.tool_calls[0]
return json.loads(tool_call.function.arguments)
# Extract data with strict schema
html = """
<div>
<h1>Gaming Laptop Pro</h1>
<p class="price">$1,299.00</p>
<p>High-performance gaming laptop with RTX 4070</p>
<span class="stock">In Stock</span>
<div class="stars">★★★★☆ 4.2/5</div>
</div>
"""
result = extract_with_schema(html)
print(json.dumps(result, indent=2))
Batch Processing Multiple Elements
When scraping lists of items, you can extract multiple records in a single API call:
import openai
import json
from bs4 import BeautifulSoup
def extract_multiple_products(html_content):
response = openai.chat.completions.create(
model="gpt-4",
messages=[
{
"role": "system",
"content": """Extract all products from the HTML and return a JSON array.
Each product should have: name, price, rating, availability.
Return format: {"products": [...]}"""
},
{
"role": "user",
"content": html_content
}
],
response_format={"type": "json_object"}
)
return json.loads(response.choices[0].message.content)
html = """
<div class="product-list">
<div class="item">
<h3>Laptop Stand</h3>
<span class="price">$39.99</span>
<div class="rating">4.7★</div>
<p class="stock">Available</p>
</div>
<div class="item">
<h3>USB-C Hub</h3>
<span class="price">$24.99</span>
<div class="rating">4.3★</div>
<p class="stock">Out of Stock</p>
</div>
<div class="item">
<h3>Wireless Mouse</h3>
<span class="price">$19.99</span>
<div class="rating">4.8★</div>
<p class="stock">Available</p>
</div>
</div>
"""
results = extract_multiple_products(html)
print(json.dumps(results, indent=2))
Combining GPT with Traditional Web Scraping
For optimal results, combine GPT extraction with traditional scraping tools. Use libraries like Puppeteer or Playwright to handle JavaScript-rendered pages, then use GPT to extract structured data:
const puppeteer = require('puppeteer');
const OpenAI = require('openai');
const openai = new OpenAI({
apiKey: process.env.OPENAI_API_KEY
});
async function scrapeWithGPT(url) {
// Launch browser and get content
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto(url, { waitUntil: 'networkidle2' });
// Get the HTML content
const htmlContent = await page.content();
await browser.close();
// Extract structured data with GPT
const response = await openai.chat.completions.create({
model: "gpt-4",
messages: [
{
role: "system",
content: "Extract article data: title, author, publish_date, content, tags. Return as JSON."
},
{
role: "user",
content: htmlContent
}
],
response_format: { type: "json_object" }
});
return JSON.parse(response.choices[0].message.content);
}
// Usage
scrapeWithGPT('https://example.com/article').then(data => {
console.log(data);
});
When handling AJAX requests using Puppeteer, you can wait for dynamic content to load before passing it to GPT for extraction.
Optimizing Token Usage and Costs
GPT API calls are priced by token usage. Here are strategies to minimize costs:
1. Clean HTML Before Processing
Remove unnecessary HTML tags, scripts, and styles:
from bs4 import BeautifulSoup
def clean_html(html_content):
soup = BeautifulSoup(html_content, 'html.parser')
# Remove script and style elements
for element in soup(['script', 'style', 'nav', 'footer', 'header']):
element.decompose()
# Get text with minimal formatting
return soup.get_text(separator='\n', strip=True)
# Use cleaned content
cleaned = clean_html(html_content)
# Pass cleaned content to GPT...
2. Use GPT-3.5-Turbo for Simple Extractions
For straightforward data extraction, GPT-3.5-Turbo is significantly cheaper than GPT-4:
response = openai.chat.completions.create(
model="gpt-3.5-turbo", # Cheaper alternative
messages=[...],
response_format={"type": "json_object"}
)
3. Extract Only Target Sections
Instead of sending entire pages, use CSS selectors to isolate relevant sections:
from bs4 import BeautifulSoup
def extract_section(html, selector):
soup = BeautifulSoup(html, 'html.parser')
section = soup.select_one(selector)
return str(section) if section else ""
# Extract only the product section
product_html = extract_section(full_html, '.product-details')
# Send only relevant section to GPT
Handling Complex Data Types
Extracting Dates and Numbers
GPT can parse and normalize dates and prices in various formats:
prompt = """
Extract and normalize the following data:
- Convert all dates to ISO 8601 format (YYYY-MM-DD)
- Convert all prices to numeric values (remove currency symbols)
- Parse relative dates like "2 days ago"
Return JSON with: event_name, event_date, ticket_price
"""
html = """
<div class="event">
<h2>Summer Music Festival</h2>
<p>Date: July 15th, 2024</p>
<span class="price">$75.00 USD</span>
</div>
"""
# GPT will return:
# {
# "event_name": "Summer Music Festival",
# "event_date": "2024-07-15",
# "ticket_price": 75.00
# }
Extracting Nested Structures
For complex hierarchical data:
response = openai.chat.completions.create(
model="gpt-4",
messages=[
{
"role": "system",
"content": """Extract company data with nested structure:
{
"company_name": string,
"employees": [
{
"name": string,
"position": string,
"contact": {
"email": string,
"phone": string
}
}
]
}"""
},
{
"role": "user",
"content": html_content
}
],
response_format={"type": "json_object"}
)
Error Handling and Validation
Always validate GPT output to ensure data quality:
import json
from jsonschema import validate, ValidationError
# Define expected schema
schema = {
"type": "object",
"properties": {
"name": {"type": "string"},
"price": {"type": "number", "minimum": 0},
"rating": {"type": "number", "minimum": 0, "maximum": 5}
},
"required": ["name", "price"]
}
def extract_with_validation(html_content):
response = openai.chat.completions.create(
model="gpt-4",
messages=[...],
response_format={"type": "json_object"}
)
try:
data = json.loads(response.choices[0].message.content)
validate(instance=data, schema=schema)
return data
except (json.JSONDecodeError, ValidationError) as e:
print(f"Validation error: {e}")
return None
Real-World Use Cases
E-commerce Product Scraping
def scrape_product_page(url):
# Fetch HTML (using requests, Selenium, or Puppeteer)
html = fetch_html(url)
response = openai.chat.completions.create(
model="gpt-4",
messages=[
{
"role": "system",
"content": """Extract e-commerce product data:
- product_name
- brand
- price (numeric)
- original_price (if discounted)
- discount_percentage
- availability (in_stock/out_of_stock/pre_order)
- specifications (as array of {key, value} objects)
- images (array of URLs)
Return as JSON."""
},
{"role": "user", "content": html}
],
response_format={"type": "json_object"}
)
return json.loads(response.choices[0].message.content)
News Article Extraction
def extract_article(html):
response = openai.chat.completions.create(
model="gpt-4",
messages=[
{
"role": "system",
"content": """Extract news article data:
- headline
- subheadline
- author
- publish_date (ISO format)
- category
- tags (array)
- content (main article text)
- summary (2-3 sentences)
Return as JSON."""
},
{"role": "user", "content": html}
],
response_format={"type": "json_object"}
)
return json.loads(response.choices[0].message.content)
Best Practices
- Be Specific in Prompts: Clearly define the structure and data types you expect
- Use JSON Mode: Enable
response_format: {"type": "json_object"}
for structured output - Implement Retry Logic: Handle API rate limits and transient errors
- Cache Results: Store extracted data to avoid redundant API calls
- Monitor Costs: Track token usage and implement usage limits
- Validate Output: Always check that GPT returns data in the expected format
- Combine Approaches: Use traditional selectors for navigation and GPT for complex extraction
When monitoring network requests in Puppeteer, you can capture API responses and use GPT to structure the data, even from dynamically loaded content.
Limitations and Considerations
- Token Limits: GPT models have maximum token limits (context windows)
- Cost: API calls can become expensive for large-scale scraping
- Latency: API calls add 1-5 seconds per request compared to traditional parsing
- Accuracy: GPT may occasionally misinterpret or hallucinate data
- Rate Limits: OpenAI enforces rate limits on API requests
For high-volume scraping, consider using GPT selectively for complex pages while using traditional parsing for simple, structured content.
Conclusion
GPT-based data extraction offers a flexible, adaptive approach to web scraping that can handle diverse page structures and natural language content. By combining GPT with traditional scraping tools and following best practices for prompt engineering and validation, you can build robust data extraction pipelines that adapt to changing website layouts while maintaining data quality.
The key is knowing when to use GPT—leverage it for complex, unstructured data extraction while relying on traditional methods for simple, repetitive tasks to optimize both performance and cost.