Table of contents

What are web scraping examples using ChatGPT?

ChatGPT and the OpenAI API can transform web scraping from a rigid, selector-based process into an intelligent, context-aware data extraction workflow. By leveraging large language models, you can extract structured data from HTML without writing complex parsing logic or maintaining fragile CSS selectors.

This guide explores practical examples of using ChatGPT for web scraping tasks, from simple HTML parsing to complex data extraction scenarios.

Understanding ChatGPT for Web Scraping

ChatGPT excels at understanding unstructured content and converting it into structured data. Unlike traditional web scraping that relies on XPath or CSS selectors, ChatGPT can interpret the semantic meaning of content, making it resilient to HTML structure changes.

Key Advantages

  • Flexibility: Adapts to different page layouts without code changes
  • Context awareness: Understands content meaning, not just structure
  • Natural language instructions: Define extraction rules in plain English
  • Robust to changes: Less brittle than selector-based approaches

Example 1: Extracting Product Information

Let's start with a common use case: extracting product details from an e-commerce page.

Python Implementation

import openai
import requests

# Fetch the HTML content
url = "https://example.com/product/laptop"
response = requests.get(url)
html_content = response.text

# Initialize OpenAI client
client = openai.OpenAI(api_key="your-api-key")

# Create a prompt for ChatGPT
prompt = f"""
Extract the following product information from this HTML and return it as JSON:
- Product name
- Price
- Rating (out of 5)
- Number of reviews
- Main features (as an array)
- Availability status

HTML:
{html_content[:4000]}  # Limit to avoid token limits

Return only valid JSON, no additional text.
"""

# Call ChatGPT API
completion = client.chat.completions.create(
    model="gpt-4-turbo-preview",
    messages=[
        {"role": "system", "content": "You are a web scraping assistant that extracts structured data from HTML."},
        {"role": "user", "content": prompt}
    ],
    response_format={"type": "json_object"}
)

# Parse the response
import json
product_data = json.loads(completion.choices[0].message.content)
print(json.dumps(product_data, indent=2))

Expected Output

{
  "product_name": "Dell XPS 15 Laptop",
  "price": "$1,299.99",
  "rating": 4.5,
  "number_of_reviews": 328,
  "main_features": [
    "15.6-inch 4K display",
    "Intel Core i7 processor",
    "16GB RAM",
    "512GB SSD"
  ],
  "availability_status": "In Stock"
}

Example 2: Scraping Article Metadata

Extract metadata from blog posts or news articles, including author, publication date, and tags.

JavaScript/Node.js Implementation

const OpenAI = require('openai');
const axios = require('axios');

const openai = new OpenAI({
  apiKey: process.env.OPENAI_API_KEY
});

async function scrapeArticleMetadata(url) {
  // Fetch HTML content
  const response = await axios.get(url);
  const html = response.data;

  // Prepare the extraction prompt
  const prompt = `
Extract the following article metadata from this HTML:
- Title
- Author name
- Publication date (in ISO 8601 format)
- Reading time (in minutes)
- Tags/Categories (as array)
- Article summary (2-3 sentences)

HTML:
${html.substring(0, 5000)}

Return as JSON only.
  `;

  // Call ChatGPT
  const completion = await openai.chat.completions.create({
    model: 'gpt-4-turbo-preview',
    messages: [
      {
        role: 'system',
        content: 'You extract structured metadata from article HTML. Always return valid JSON.'
      },
      {
        role: 'user',
        content: prompt
      }
    ],
    response_format: { type: 'json_object' }
  });

  return JSON.parse(completion.choices[0].message.content);
}

// Usage
scrapeArticleMetadata('https://example.com/blog/ai-trends-2024')
  .then(data => console.log(data))
  .catch(err => console.error(err));

Example 3: Batch Processing Multiple Pages

When scraping multiple pages, you can combine traditional HTTP requests with ChatGPT-powered data extraction for efficiency.

Python Batch Scraping

import openai
import requests
from concurrent.futures import ThreadPoolExecutor
import json

client = openai.OpenAI(api_key="your-api-key")

def extract_with_chatgpt(html_content, schema):
    """Extract data using ChatGPT based on a schema"""
    prompt = f"""
Extract data according to this schema:
{json.dumps(schema, indent=2)}

From this HTML:
{html_content[:3000]}

Return valid JSON matching the schema.
    """

    completion = client.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=[
            {"role": "system", "content": "You are a data extraction assistant."},
            {"role": "user", "content": prompt}
        ],
        response_format={"type": "json_object"}
    )

    return json.loads(completion.choices[0].message.content)

def scrape_listing_page(url):
    """Scrape a single listing page"""
    response = requests.get(url)

    schema = {
        "listings": [
            {
                "title": "string",
                "price": "number",
                "location": "string",
                "bedrooms": "number",
                "bathrooms": "number"
            }
        ]
    }

    return extract_with_chatgpt(response.text, schema)

# Scrape multiple pages in parallel
urls = [
    "https://example.com/listings?page=1",
    "https://example.com/listings?page=2",
    "https://example.com/listings?page=3"
]

with ThreadPoolExecutor(max_workers=3) as executor:
    results = list(executor.map(scrape_listing_page, urls))

# Combine all results
all_listings = []
for result in results:
    all_listings.extend(result.get('listings', []))

print(f"Scraped {len(all_listings)} total listings")

Example 4: Intelligent Table Extraction

ChatGPT excels at extracting and structuring data from HTML tables, even when they have complex layouts.

import openai
import requests

client = openai.OpenAI(api_key="your-api-key")

def scrape_table_data(url):
    """Extract table data intelligently"""
    response = requests.get(url)
    html = response.text

    prompt = f"""
Find all tables in this HTML and extract their data.
For each table:
1. Identify what the table represents
2. Extract headers
3. Extract all rows as structured data

Return as JSON with this structure:
{{
  "tables": [
    {{
      "description": "what this table shows",
      "headers": ["col1", "col2", ...],
      "rows": [
        {{"col1": "value", "col2": "value"}},
        ...
      ]
    }}
  ]
}}

HTML:
{html[:4000]}
    """

    completion = client.chat.completions.create(
        model="gpt-4-turbo-preview",
        messages=[
            {"role": "system", "content": "You are an expert at extracting tabular data from HTML."},
            {"role": "user", "content": prompt}
        ],
        response_format={"type": "json_object"}
    )

    return json.loads(completion.choices[0].message.content)

# Example usage
table_data = scrape_table_data("https://example.com/statistics")
print(json.dumps(table_data, indent=2))

Example 5: Form Data Extraction and Validation

Extract form fields and their validation rules from HTML forms.

const OpenAI = require('openai');
const axios = require('axios');

const openai = new OpenAI({
  apiKey: process.env.OPENAI_API_KEY
});

async function extractFormSchema(url) {
  const response = await axios.get(url);

  const prompt = `
Analyze this HTML form and extract:
1. All input fields with their names and types
2. Required fields
3. Validation rules (max length, patterns, etc.)
4. Select/dropdown options
5. Form action URL and method

Return as structured JSON.

HTML:
${response.data.substring(0, 4000)}
  `;

  const completion = await openai.chat.completions.create({
    model: 'gpt-4-turbo-preview',
    messages: [
      {
        role: 'system',
        content: 'You analyze HTML forms and extract their structure and validation rules.'
      },
      {
        role: 'user',
        content: prompt
      }
    ],
    response_format: { type: 'json_object' }
  });

  return JSON.parse(completion.choices[0].message.content);
}

// Usage
extractFormSchema('https://example.com/contact-form')
  .then(schema => {
    console.log('Form Schema:', schema);
    // Use schema to programmatically fill and submit forms
  });

Example 6: Sentiment and Content Analysis

Combine web scraping with AI-powered content analysis to extract not just data, but insights.

import openai
import requests

client = openai.OpenAI(api_key="your-api-key")

def scrape_with_analysis(url):
    """Scrape reviews with sentiment analysis"""
    response = requests.get(url)

    prompt = f"""
Extract all customer reviews from this page.
For each review, provide:
- Reviewer name
- Rating (out of 5)
- Review text
- Review date
- Sentiment (positive/negative/neutral)
- Key topics mentioned (as array)

Return as JSON with a "reviews" array.

HTML:
{response.text[:4000]}
    """

    completion = client.chat.completions.create(
        model="gpt-4-turbo-preview",
        messages=[
            {"role": "system", "content": "You extract and analyze customer reviews from HTML."},
            {"role": "user", "content": prompt}
        ],
        response_format={"type": "json_object"}
    )

    return json.loads(completion.choices[0].message.content)

# Analyze reviews
reviews_data = scrape_with_analysis("https://example.com/product/reviews")

# Calculate statistics
positive_reviews = sum(1 for r in reviews_data['reviews'] if r['sentiment'] == 'positive')
print(f"Positive reviews: {positive_reviews}/{len(reviews_data['reviews'])}")

Best Practices and Optimization

1. Token Management

ChatGPT has token limits, so preprocessing HTML is crucial:

from bs4 import BeautifulSoup

def clean_html_for_gpt(html):
    """Remove unnecessary HTML to reduce tokens"""
    soup = BeautifulSoup(html, 'html.parser')

    # Remove scripts, styles, and comments
    for tag in soup(['script', 'style', 'meta', 'link']):
        tag.decompose()

    # Get text content with minimal HTML
    return soup.get_text(separator=' ', strip=True)

2. Cost Optimization

Use GPT-3.5-turbo for simple extractions and GPT-4 for complex tasks:

def choose_model(complexity):
    """Select appropriate model based on task complexity"""
    if complexity == 'simple':
        return 'gpt-3.5-turbo'
    elif complexity == 'complex':
        return 'gpt-4-turbo-preview'
    return 'gpt-3.5-turbo'

# Use the cheaper model when possible
model = choose_model('simple')

3. Error Handling

Always implement robust error handling when integrating ChatGPT into your web scraping workflow:

import json
from openai import OpenAI, OpenAIError

def safe_extract(html_content, prompt, max_retries=3):
    """Extract data with retry logic"""
    client = OpenAI(api_key="your-api-key")

    for attempt in range(max_retries):
        try:
            completion = client.chat.completions.create(
                model="gpt-3.5-turbo",
                messages=[
                    {"role": "system", "content": "You extract structured data from HTML."},
                    {"role": "user", "content": prompt}
                ],
                response_format={"type": "json_object"}
            )

            # Validate JSON response
            data = json.loads(completion.choices[0].message.content)
            return data

        except OpenAIError as e:
            print(f"Attempt {attempt + 1} failed: {e}")
            if attempt == max_retries - 1:
                raise
        except json.JSONDecodeError as e:
            print(f"Invalid JSON response: {e}")
            if attempt == max_retries - 1:
                return None

    return None

Combining with Traditional Tools

For production systems, combine ChatGPT with traditional scraping tools:

import openai
import requests
from bs4 import BeautifulSoup

def hybrid_scraping(url):
    """Use BeautifulSoup for structure, ChatGPT for content understanding"""

    # Fetch and parse HTML
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')

    # Extract article body using selectors
    article_body = soup.find('article') or soup.find('div', class_='content')

    if article_body:
        # Use ChatGPT to understand and structure the content
        client = openai.OpenAI(api_key="your-api-key")

        prompt = f"""
Analyze this article content and extract:
- Main topic
- Key points (as bullet array)
- Mentioned entities (people, companies, places)
- Technical terms defined

Content:
{article_body.get_text()[:3000]}

Return as JSON.
        """

        completion = client.chat.completions.create(
            model="gpt-3.5-turbo",
            messages=[
                {"role": "system", "content": "You analyze article content."},
                {"role": "user", "content": prompt}
            ],
            response_format={"type": "json_object"}
        )

        return json.loads(completion.choices[0].message.content)

    return None

Conclusion

ChatGPT transforms web scraping by adding intelligence and flexibility to data extraction. These examples demonstrate practical applications from simple product scraping to complex content analysis. While ChatGPT adds API costs, it significantly reduces development time and creates more maintainable scraping solutions that adapt to website changes.

For production deployments, consider combining ChatGPT with traditional tools, implementing proper error handling, and optimizing token usage to balance cost and performance. The key is choosing the right tool for each part of your scraping pipeline: traditional selectors for structure, ChatGPT for semantic understanding.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon