Table of contents

How to Scrape Product Data Using ChatGPT

ChatGPT and other Large Language Models (LLMs) have revolutionized web scraping by providing intelligent data extraction capabilities that can understand context, handle complex layouts, and convert unstructured HTML into structured data. This guide shows you how to leverage ChatGPT's API to scrape product data from e-commerce websites effectively.

Why Use ChatGPT for Product Data Scraping?

Traditional web scraping relies on CSS selectors or XPath to extract data, which breaks when website layouts change. ChatGPT offers several advantages:

  • Layout-agnostic extraction: No need to write complex selectors
  • Intelligent parsing: Understands product attributes even when markup varies
  • Automatic data normalization: Converts prices, sizes, and other attributes to consistent formats
  • Multi-language support: Extracts data from international e-commerce sites
  • Context awareness: Distinguishes between regular price and sale price, product images vs. thumbnails, etc.

Prerequisites

Before you begin, you'll need:

  1. An OpenAI API key from platform.openai.com
  2. Python 3.7+ or Node.js 14+ installed
  3. A web scraping tool to fetch HTML content (requests, axios, or a headless browser)

Basic Workflow for Product Data Scraping

The typical workflow involves three steps:

  1. Fetch the HTML content from the product page
  2. Send the HTML to ChatGPT with a prompt specifying what data to extract
  3. Parse the structured response and store the product data

Python Implementation

Here's a complete example using Python with the OpenAI API:

import requests
import json
from openai import OpenAI

# Initialize OpenAI client
client = OpenAI(api_key="your-api-key-here")

def fetch_product_page(url):
    """Fetch HTML content from a product page"""
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
    }
    response = requests.get(url, headers=headers)
    return response.text

def extract_product_data(html_content):
    """Use ChatGPT to extract structured product data"""
    prompt = f"""
    Extract product information from the following HTML and return it as JSON.

    Required fields:
    - name: product name
    - price: current price (numeric value only)
    - currency: currency code
    - description: product description
    - images: array of image URLs
    - availability: in stock status (boolean)
    - sku: product SKU or ID
    - brand: brand name
    - rating: average rating (numeric)
    - reviews_count: number of reviews

    HTML content:
    {html_content[:8000]}  # Limit to avoid token limits

    Return only valid JSON, no additional text.
    """

    response = client.chat.completions.create(
        model="gpt-4o-mini",  # Cost-effective for scraping
        messages=[
            {"role": "system", "content": "You are a data extraction assistant that returns only valid JSON."},
            {"role": "user", "content": prompt}
        ],
        temperature=0  # Deterministic output
    )

    # Parse the JSON response
    product_data = json.loads(response.choices[0].message.content)
    return product_data

# Example usage
url = "https://example.com/products/wireless-headphones"
html = fetch_product_page(url)
product = extract_product_data(html)

print(json.dumps(product, indent=2))

JavaScript/Node.js Implementation

Here's the equivalent implementation in JavaScript:

const axios = require('axios');
const OpenAI = require('openai');

const openai = new OpenAI({
  apiKey: process.env.OPENAI_API_KEY
});

async function fetchProductPage(url) {
  const response = await axios.get(url, {
    headers: {
      'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
    }
  });
  return response.data;
}

async function extractProductData(htmlContent) {
  const prompt = `
    Extract product information from the following HTML and return it as JSON.

    Required fields:
    - name: product name
    - price: current price (numeric value only)
    - currency: currency code
    - description: product description
    - images: array of image URLs
    - availability: in stock status (boolean)
    - sku: product SKU or ID
    - brand: brand name
    - rating: average rating (numeric)
    - reviews_count: number of reviews

    HTML content:
    ${htmlContent.substring(0, 8000)}

    Return only valid JSON, no additional text.
  `;

  const response = await openai.chat.completions.create({
    model: 'gpt-4o-mini',
    messages: [
      { role: 'system', content: 'You are a data extraction assistant that returns only valid JSON.' },
      { role: 'user', content: prompt }
    ],
    temperature: 0
  });

  const productData = JSON.parse(response.choices[0].message.content);
  return productData;
}

// Example usage
(async () => {
  const url = 'https://example.com/products/wireless-headphones';
  const html = await fetchProductPage(url);
  const product = await extractProductData(html);

  console.log(JSON.stringify(product, null, 2));
})();

Using Function Calling for Structured Output

OpenAI's function calling feature ensures more reliable structured output:

def extract_product_with_function_calling(html_content):
    """Extract product data using function calling"""

    functions = [{
        "name": "save_product_data",
        "description": "Save extracted product information",
        "parameters": {
            "type": "object",
            "properties": {
                "name": {"type": "string", "description": "Product name"},
                "price": {"type": "number", "description": "Current price"},
                "currency": {"type": "string", "description": "Currency code (USD, EUR, etc.)"},
                "description": {"type": "string", "description": "Product description"},
                "images": {"type": "array", "items": {"type": "string"}, "description": "Image URLs"},
                "availability": {"type": "boolean", "description": "Is product in stock"},
                "sku": {"type": "string", "description": "Product SKU"},
                "brand": {"type": "string", "description": "Brand name"},
                "rating": {"type": "number", "description": "Average rating"},
                "reviews_count": {"type": "integer", "description": "Number of reviews"}
            },
            "required": ["name", "price", "currency"]
        }
    }]

    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": "Extract product data from HTML."},
            {"role": "user", "content": f"Extract product data from:\n\n{html_content[:8000]}"}
        ],
        functions=functions,
        function_call={"name": "save_product_data"}
    )

    # Get the function call arguments
    function_args = json.loads(
        response.choices[0].message.function_call.arguments
    )

    return function_args

Handling JavaScript-Rendered Pages

Many modern e-commerce sites render content with JavaScript. For these cases, combine a headless browser with ChatGPT for intelligent data extraction:

from playwright.sync_api import sync_playwright

def scrape_dynamic_product_page(url):
    """Scrape JavaScript-rendered product pages"""
    with sync_playwright() as p:
        browser = p.chromium.launch(headless=True)
        page = browser.new_page()

        # Navigate and wait for content
        page.goto(url)
        page.wait_for_selector('.product-details', timeout=5000)

        # Get the rendered HTML
        html_content = page.content()
        browser.close()

        # Extract data with ChatGPT
        return extract_product_data(html_content)

Optimizing Token Usage and Costs

Product pages often contain large HTML files. Here's how to optimize ChatGPT token usage:

1. Pre-filter HTML Content

from bs4 import BeautifulSoup

def extract_relevant_html(full_html):
    """Extract only product-relevant sections"""
    soup = BeautifulSoup(full_html, 'html.parser')

    # Remove unnecessary elements
    for element in soup(['script', 'style', 'nav', 'footer', 'header']):
        element.decompose()

    # Keep only product-related sections
    product_section = soup.find(['div', 'section'],
                                class_=lambda x: x and 'product' in x.lower())

    return str(product_section) if product_section else str(soup)

2. Use Smaller Models

For straightforward product pages, use gpt-4o-mini instead of gpt-4o:

# gpt-4o-mini: ~$0.15 per 1M input tokens
# gpt-4o: ~$2.50 per 1M input tokens
model = "gpt-4o-mini"  # 93% cheaper

3. Batch Processing

Process multiple products in a single request when possible:

def extract_multiple_products(html_contents):
    """Extract data from multiple product pages in one call"""
    combined_prompt = "Extract product data from these HTML sections:\n\n"

    for i, html in enumerate(html_contents):
        combined_prompt += f"Product {i+1}:\n{html[:2000]}\n\n"

    # Process all at once
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": combined_prompt}],
        temperature=0
    )

    return response.choices[0].message.content

Error Handling and Validation

Always validate the extracted data to ensure data quality when scraping with AI:

from pydantic import BaseModel, ValidationError
from typing import List, Optional

class Product(BaseModel):
    name: str
    price: float
    currency: str
    description: Optional[str] = None
    images: List[str] = []
    availability: bool = True
    sku: Optional[str] = None
    brand: Optional[str] = None
    rating: Optional[float] = None
    reviews_count: Optional[int] = None

def validate_product_data(raw_data):
    """Validate and clean extracted product data"""
    try:
        product = Product(**raw_data)
        return product.dict()
    except ValidationError as e:
        print(f"Validation error: {e}")
        return None

Complete Production-Ready Example

Here's a full implementation with error handling, retries, and logging:

import logging
from tenacity import retry, stop_after_attempt, wait_exponential

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

class ProductScraper:
    def __init__(self, api_key):
        self.client = OpenAI(api_key=api_key)

    @retry(stop=stop_after_attempt(3), wait=wait_exponential(multiplier=1, min=4, max=10))
    def fetch_html(self, url):
        """Fetch HTML with retry logic"""
        try:
            response = requests.get(url, timeout=10, headers={
                'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)'
            })
            response.raise_for_status()
            return response.text
        except requests.RequestException as e:
            logger.error(f"Failed to fetch {url}: {e}")
            raise

    def clean_html(self, html):
        """Remove unnecessary HTML elements"""
        soup = BeautifulSoup(html, 'html.parser')
        for tag in soup(['script', 'style', 'nav', 'footer']):
            tag.decompose()
        return str(soup)[:8000]

    @retry(stop=stop_after_attempt(2))
    def extract_data(self, html):
        """Extract product data using ChatGPT"""
        try:
            response = self.client.chat.completions.create(
                model="gpt-4o-mini",
                messages=[
                    {"role": "system", "content": "Extract product data as JSON."},
                    {"role": "user", "content": f"Extract from:\n{html}"}
                ],
                temperature=0,
                max_tokens=1000
            )

            data = json.loads(response.choices[0].message.content)
            return validate_product_data(data)

        except (json.JSONDecodeError, Exception) as e:
            logger.error(f"Extraction failed: {e}")
            return None

    def scrape_product(self, url):
        """Main scraping method"""
        logger.info(f"Scraping {url}")

        html = self.fetch_html(url)
        cleaned_html = self.clean_html(html)
        product_data = self.extract_data(cleaned_html)

        if product_data:
            logger.info(f"Successfully extracted: {product_data.get('name')}")
            return product_data
        else:
            logger.warning(f"Failed to extract data from {url}")
            return None

# Usage
scraper = ProductScraper(api_key="your-api-key")
product = scraper.scrape_product("https://example.com/products/item-123")

Best Practices

  1. Respect robots.txt: Always check the site's robots.txt file
  2. Rate limiting: Add delays between requests to avoid overwhelming servers
  3. Use caching: Cache HTML responses to reduce duplicate API calls
  4. Monitor costs: Track token usage and set spending limits in OpenAI dashboard
  5. Fallback mechanisms: Have traditional CSS selector-based scraping as backup
  6. Legal compliance: Ensure you have permission to scrape the target website

Cost Estimation

For typical product pages (5,000 tokens input, 500 tokens output):

  • GPT-4o-mini: ~$0.0008 per product
  • GPT-4o: ~$0.013 per product

Scraping 10,000 products: - GPT-4o-mini: ~$8 - GPT-4o: ~$130

Alternative: Using WebScrapingAI API

For production use cases, consider using a dedicated AI-powered web scraping API that combines headless browsing with built-in LLM extraction, offering better reliability and easier implementation than building your own solution.

Conclusion

ChatGPT provides a powerful alternative to traditional web scraping methods for extracting product data. By combining intelligent parsing with structured output formats, you can build robust scrapers that adapt to layout changes and handle complex e-commerce sites. Remember to optimize token usage, implement proper error handling, and always respect website terms of service.

For more advanced scenarios, explore combining ChatGPT with headless browsers for JavaScript-heavy sites, or consider using specialized web scraping APIs that integrate LLM capabilities out of the box.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon