What are web scraping examples using ChatGPT?
ChatGPT and the OpenAI API can transform web scraping from a rigid, selector-based process into an intelligent, context-aware data extraction workflow. By leveraging large language models, you can extract structured data from HTML without writing complex parsing logic or maintaining fragile CSS selectors.
This guide explores practical examples of using ChatGPT for web scraping tasks, from simple HTML parsing to complex data extraction scenarios.
Understanding ChatGPT for Web Scraping
ChatGPT excels at understanding unstructured content and converting it into structured data. Unlike traditional web scraping that relies on XPath or CSS selectors, ChatGPT can interpret the semantic meaning of content, making it resilient to HTML structure changes.
Key Advantages
- Flexibility: Adapts to different page layouts without code changes
- Context awareness: Understands content meaning, not just structure
- Natural language instructions: Define extraction rules in plain English
- Robust to changes: Less brittle than selector-based approaches
Example 1: Extracting Product Information
Let's start with a common use case: extracting product details from an e-commerce page.
Python Implementation
import openai
import requests
# Fetch the HTML content
url = "https://example.com/product/laptop"
response = requests.get(url)
html_content = response.text
# Initialize OpenAI client
client = openai.OpenAI(api_key="your-api-key")
# Create a prompt for ChatGPT
prompt = f"""
Extract the following product information from this HTML and return it as JSON:
- Product name
- Price
- Rating (out of 5)
- Number of reviews
- Main features (as an array)
- Availability status
HTML:
{html_content[:4000]} # Limit to avoid token limits
Return only valid JSON, no additional text.
"""
# Call ChatGPT API
completion = client.chat.completions.create(
model="gpt-4-turbo-preview",
messages=[
{"role": "system", "content": "You are a web scraping assistant that extracts structured data from HTML."},
{"role": "user", "content": prompt}
],
response_format={"type": "json_object"}
)
# Parse the response
import json
product_data = json.loads(completion.choices[0].message.content)
print(json.dumps(product_data, indent=2))
Expected Output
{
"product_name": "Dell XPS 15 Laptop",
"price": "$1,299.99",
"rating": 4.5,
"number_of_reviews": 328,
"main_features": [
"15.6-inch 4K display",
"Intel Core i7 processor",
"16GB RAM",
"512GB SSD"
],
"availability_status": "In Stock"
}
Example 2: Scraping Article Metadata
Extract metadata from blog posts or news articles, including author, publication date, and tags.
JavaScript/Node.js Implementation
const OpenAI = require('openai');
const axios = require('axios');
const openai = new OpenAI({
apiKey: process.env.OPENAI_API_KEY
});
async function scrapeArticleMetadata(url) {
// Fetch HTML content
const response = await axios.get(url);
const html = response.data;
// Prepare the extraction prompt
const prompt = `
Extract the following article metadata from this HTML:
- Title
- Author name
- Publication date (in ISO 8601 format)
- Reading time (in minutes)
- Tags/Categories (as array)
- Article summary (2-3 sentences)
HTML:
${html.substring(0, 5000)}
Return as JSON only.
`;
// Call ChatGPT
const completion = await openai.chat.completions.create({
model: 'gpt-4-turbo-preview',
messages: [
{
role: 'system',
content: 'You extract structured metadata from article HTML. Always return valid JSON.'
},
{
role: 'user',
content: prompt
}
],
response_format: { type: 'json_object' }
});
return JSON.parse(completion.choices[0].message.content);
}
// Usage
scrapeArticleMetadata('https://example.com/blog/ai-trends-2024')
.then(data => console.log(data))
.catch(err => console.error(err));
Example 3: Batch Processing Multiple Pages
When scraping multiple pages, you can combine traditional HTTP requests with ChatGPT-powered data extraction for efficiency.
Python Batch Scraping
import openai
import requests
from concurrent.futures import ThreadPoolExecutor
import json
client = openai.OpenAI(api_key="your-api-key")
def extract_with_chatgpt(html_content, schema):
"""Extract data using ChatGPT based on a schema"""
prompt = f"""
Extract data according to this schema:
{json.dumps(schema, indent=2)}
From this HTML:
{html_content[:3000]}
Return valid JSON matching the schema.
"""
completion = client.chat.completions.create(
model="gpt-3.5-turbo",
messages=[
{"role": "system", "content": "You are a data extraction assistant."},
{"role": "user", "content": prompt}
],
response_format={"type": "json_object"}
)
return json.loads(completion.choices[0].message.content)
def scrape_listing_page(url):
"""Scrape a single listing page"""
response = requests.get(url)
schema = {
"listings": [
{
"title": "string",
"price": "number",
"location": "string",
"bedrooms": "number",
"bathrooms": "number"
}
]
}
return extract_with_chatgpt(response.text, schema)
# Scrape multiple pages in parallel
urls = [
"https://example.com/listings?page=1",
"https://example.com/listings?page=2",
"https://example.com/listings?page=3"
]
with ThreadPoolExecutor(max_workers=3) as executor:
results = list(executor.map(scrape_listing_page, urls))
# Combine all results
all_listings = []
for result in results:
all_listings.extend(result.get('listings', []))
print(f"Scraped {len(all_listings)} total listings")
Example 4: Intelligent Table Extraction
ChatGPT excels at extracting and structuring data from HTML tables, even when they have complex layouts.
import openai
import requests
client = openai.OpenAI(api_key="your-api-key")
def scrape_table_data(url):
"""Extract table data intelligently"""
response = requests.get(url)
html = response.text
prompt = f"""
Find all tables in this HTML and extract their data.
For each table:
1. Identify what the table represents
2. Extract headers
3. Extract all rows as structured data
Return as JSON with this structure:
{{
"tables": [
{{
"description": "what this table shows",
"headers": ["col1", "col2", ...],
"rows": [
{{"col1": "value", "col2": "value"}},
...
]
}}
]
}}
HTML:
{html[:4000]}
"""
completion = client.chat.completions.create(
model="gpt-4-turbo-preview",
messages=[
{"role": "system", "content": "You are an expert at extracting tabular data from HTML."},
{"role": "user", "content": prompt}
],
response_format={"type": "json_object"}
)
return json.loads(completion.choices[0].message.content)
# Example usage
table_data = scrape_table_data("https://example.com/statistics")
print(json.dumps(table_data, indent=2))
Example 5: Form Data Extraction and Validation
Extract form fields and their validation rules from HTML forms.
const OpenAI = require('openai');
const axios = require('axios');
const openai = new OpenAI({
apiKey: process.env.OPENAI_API_KEY
});
async function extractFormSchema(url) {
const response = await axios.get(url);
const prompt = `
Analyze this HTML form and extract:
1. All input fields with their names and types
2. Required fields
3. Validation rules (max length, patterns, etc.)
4. Select/dropdown options
5. Form action URL and method
Return as structured JSON.
HTML:
${response.data.substring(0, 4000)}
`;
const completion = await openai.chat.completions.create({
model: 'gpt-4-turbo-preview',
messages: [
{
role: 'system',
content: 'You analyze HTML forms and extract their structure and validation rules.'
},
{
role: 'user',
content: prompt
}
],
response_format: { type: 'json_object' }
});
return JSON.parse(completion.choices[0].message.content);
}
// Usage
extractFormSchema('https://example.com/contact-form')
.then(schema => {
console.log('Form Schema:', schema);
// Use schema to programmatically fill and submit forms
});
Example 6: Sentiment and Content Analysis
Combine web scraping with AI-powered content analysis to extract not just data, but insights.
import openai
import requests
client = openai.OpenAI(api_key="your-api-key")
def scrape_with_analysis(url):
"""Scrape reviews with sentiment analysis"""
response = requests.get(url)
prompt = f"""
Extract all customer reviews from this page.
For each review, provide:
- Reviewer name
- Rating (out of 5)
- Review text
- Review date
- Sentiment (positive/negative/neutral)
- Key topics mentioned (as array)
Return as JSON with a "reviews" array.
HTML:
{response.text[:4000]}
"""
completion = client.chat.completions.create(
model="gpt-4-turbo-preview",
messages=[
{"role": "system", "content": "You extract and analyze customer reviews from HTML."},
{"role": "user", "content": prompt}
],
response_format={"type": "json_object"}
)
return json.loads(completion.choices[0].message.content)
# Analyze reviews
reviews_data = scrape_with_analysis("https://example.com/product/reviews")
# Calculate statistics
positive_reviews = sum(1 for r in reviews_data['reviews'] if r['sentiment'] == 'positive')
print(f"Positive reviews: {positive_reviews}/{len(reviews_data['reviews'])}")
Best Practices and Optimization
1. Token Management
ChatGPT has token limits, so preprocessing HTML is crucial:
from bs4 import BeautifulSoup
def clean_html_for_gpt(html):
"""Remove unnecessary HTML to reduce tokens"""
soup = BeautifulSoup(html, 'html.parser')
# Remove scripts, styles, and comments
for tag in soup(['script', 'style', 'meta', 'link']):
tag.decompose()
# Get text content with minimal HTML
return soup.get_text(separator=' ', strip=True)
2. Cost Optimization
Use GPT-3.5-turbo for simple extractions and GPT-4 for complex tasks:
def choose_model(complexity):
"""Select appropriate model based on task complexity"""
if complexity == 'simple':
return 'gpt-3.5-turbo'
elif complexity == 'complex':
return 'gpt-4-turbo-preview'
return 'gpt-3.5-turbo'
# Use the cheaper model when possible
model = choose_model('simple')
3. Error Handling
Always implement robust error handling when integrating ChatGPT into your web scraping workflow:
import json
from openai import OpenAI, OpenAIError
def safe_extract(html_content, prompt, max_retries=3):
"""Extract data with retry logic"""
client = OpenAI(api_key="your-api-key")
for attempt in range(max_retries):
try:
completion = client.chat.completions.create(
model="gpt-3.5-turbo",
messages=[
{"role": "system", "content": "You extract structured data from HTML."},
{"role": "user", "content": prompt}
],
response_format={"type": "json_object"}
)
# Validate JSON response
data = json.loads(completion.choices[0].message.content)
return data
except OpenAIError as e:
print(f"Attempt {attempt + 1} failed: {e}")
if attempt == max_retries - 1:
raise
except json.JSONDecodeError as e:
print(f"Invalid JSON response: {e}")
if attempt == max_retries - 1:
return None
return None
Combining with Traditional Tools
For production systems, combine ChatGPT with traditional scraping tools:
import openai
import requests
from bs4 import BeautifulSoup
def hybrid_scraping(url):
"""Use BeautifulSoup for structure, ChatGPT for content understanding"""
# Fetch and parse HTML
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
# Extract article body using selectors
article_body = soup.find('article') or soup.find('div', class_='content')
if article_body:
# Use ChatGPT to understand and structure the content
client = openai.OpenAI(api_key="your-api-key")
prompt = f"""
Analyze this article content and extract:
- Main topic
- Key points (as bullet array)
- Mentioned entities (people, companies, places)
- Technical terms defined
Content:
{article_body.get_text()[:3000]}
Return as JSON.
"""
completion = client.chat.completions.create(
model="gpt-3.5-turbo",
messages=[
{"role": "system", "content": "You analyze article content."},
{"role": "user", "content": prompt}
],
response_format={"type": "json_object"}
)
return json.loads(completion.choices[0].message.content)
return None
Conclusion
ChatGPT transforms web scraping by adding intelligence and flexibility to data extraction. These examples demonstrate practical applications from simple product scraping to complex content analysis. While ChatGPT adds API costs, it significantly reduces development time and creates more maintainable scraping solutions that adapt to website changes.
For production deployments, consider combining ChatGPT with traditional tools, implementing proper error handling, and optimizing token usage to balance cost and performance. The key is choosing the right tool for each part of your scraping pipeline: traditional selectors for structure, ChatGPT for semantic understanding.