How Do I Use GPT for Web Scraping Tasks?

GPT (Generative Pre-trained Transformer) models from OpenAI can revolutionize web scraping by enabling intelligent data extraction through natural language instructions. Instead of writing complex parsing logic with brittle CSS selectors, you can describe what data you want to extract and let GPT understand the HTML contextually. This approach combines traditional web scraping for fetching content with AI-powered parsing for extracting structured data.

Understanding GPT-Based Web Scraping

GPT models excel at understanding unstructured content and extracting meaningful information based on context rather than rigid patterns. When applied to web scraping, GPT can:

Extract data from complex layouts: Parse information scattered across multiple elements without knowing exact selectors
Handle layout changes: Adapt to website redesigns since it understands content semantically
Process unstructured text: Extract specific facts from paragraphs, articles, or poorly structured HTML
Interpret relationships: Understand how different page elements relate to each other
Multi-language support: Work with content in any language and optionally translate results

Traditional web scraping fails when websites change their structure or when data isn't consistently formatted. GPT-based scraping remains resilient because it comprehends the meaning of content rather than just its HTML structure.

Core Approaches to Using GPT for Web Scraping

1. Direct OpenAI API Integration

The most straightforward approach is fetching web content with standard HTTP libraries and then using OpenAI's API to parse and extract data.

Python Implementation

import requests
from openai import OpenAI
import json

# Initialize OpenAI client
client = OpenAI(api_key='YOUR_OPENAI_API_KEY')

def scrape_with_gpt(url, extraction_fields):
    """
    Scrape a webpage using GPT for data extraction

    Args:
        url: The webpage URL to scrape
        extraction_fields: Dictionary describing what data to extract
    """
    # Fetch the webpage
    response = requests.get(url, headers={
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
    })
    html_content = response.text

    # Build extraction prompt
    field_descriptions = '\n'.join([f"- {key}: {value}" for key, value in extraction_fields.items()])

    prompt = f"""
    Extract the following information from this HTML content:

    {field_descriptions}

    Return the data as a valid JSON object with these exact keys: {', '.join(extraction_fields.keys())}

    HTML Content:
    {html_content[:12000]}
    """

    # Use GPT to extract structured data
    completion = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {
                "role": "system",
                "content": "You are a precise web scraping assistant. Extract only the requested information from HTML and return it as valid JSON."
            },
            {
                "role": "user",
                "content": prompt
            }
        ],
        response_format={"type": "json_object"},
        temperature=0  # Use 0 for consistent, deterministic results
    )

    # Parse and return the extracted data
    result = json.loads(completion.choices[0].message.content)
    return result

# Example usage
product_data = scrape_with_gpt(
    'https://example.com/products/laptop',
    {
        'product_name': 'Full product name',
        'price': 'Current price as a number (extract just the numeric value)',
        'currency': 'Currency code (USD, EUR, etc.)',
        'in_stock': 'Boolean - whether the product is available',
        'specifications': 'List of key technical specifications',
        'rating': 'Average customer rating out of 5',
        'review_count': 'Number of customer reviews'
    }
)

print(json.dumps(product_data, indent=2))

JavaScript Implementation

const axios = require('axios');
const OpenAI = require('openai');

const openai = new OpenAI({
    apiKey: process.env.OPENAI_API_KEY
});

async function scrapeWithGPT(url, extractionFields) {
    // Fetch the webpage
    const response = await axios.get(url, {
        headers: {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
        }
    });

    const html = response.data;

    // Build field descriptions
    const fieldDescriptions = Object.entries(extractionFields)
        .map(([key, desc]) => `- ${key}: ${desc}`)
        .join('\n');

    const prompt = `
Extract the following information from this HTML:

${fieldDescriptions}

Return as valid JSON with keys: ${Object.keys(extractionFields).join(', ')}

HTML:
${html.substring(0, 12000)}
    `;

    // Use GPT for extraction
    const completion = await openai.chat.completions.create({
        model: "gpt-4o",
        messages: [
            {
                role: "system",
                content: "You are a web scraping assistant. Extract data from HTML and return valid JSON only."
            },
            {
                role: "user",
                content: prompt
            }
        ],
        response_format: { type: "json_object" },
        temperature: 0
    });

    const data = JSON.parse(completion.choices[0].message.content);
    return data;
}

// Example usage
scrapeWithGPT('https://example.com/article', {
    'title': 'Article title',
    'author': 'Author name',
    'publish_date': 'Publication date',
    'reading_time': 'Estimated reading time in minutes',
    'tags': 'Array of article tags or categories',
    'summary': 'Brief summary (2-3 sentences)'
})
.then(data => console.log(JSON.stringify(data, null, 2)))
.catch(error => console.error('Scraping error:', error));

2. Combining Browser Automation with GPT

For dynamic websites requiring JavaScript execution, combine browser automation tools with GPT. This is essential when handling AJAX requests and dynamic content.

from playwright.sync_api import sync_playwright
from openai import OpenAI
import json

client = OpenAI(api_key='YOUR_OPENAI_API_KEY')

def scrape_dynamic_with_gpt(url, extraction_instructions, wait_for_selector=None):
    """
    Scrape JavaScript-heavy websites using Playwright + GPT

    Args:
        url: Target URL
        extraction_instructions: What data to extract
        wait_for_selector: Optional CSS selector to wait for before scraping
    """
    with sync_playwright() as p:
        # Launch browser
        browser = p.chromium.launch(headless=True)
        page = browser.new_page()

        # Navigate to page
        page.goto(url, wait_until='networkidle')

        # Wait for specific content if needed
        if wait_for_selector:
            page.wait_for_selector(wait_for_selector, timeout=10000)

        # Get fully rendered HTML
        html_content = page.content()

        # Close browser
        browser.close()

        # Use GPT to extract data
        prompt = f"""
        {extraction_instructions}

        Return the data as valid JSON.

        HTML Content:
        {html_content[:15000]}
        """

        completion = client.chat.completions.create(
            model="gpt-4o",
            messages=[
                {"role": "system", "content": "Extract structured data from HTML. Return only valid JSON."},
                {"role": "user", "content": prompt}
            ],
            response_format={"type": "json_object"},
            temperature=0
        )

        return json.loads(completion.choices[0].message.content)

# Example: Scraping a single-page application
reviews = scrape_dynamic_with_gpt(
    'https://example.com/product/reviews',
    """
    Extract all customer reviews from this page. For each review, get:
    - reviewer_name: Name of the reviewer
    - rating: Star rating (1-5)
    - review_date: Date the review was posted
    - review_text: Full review text
    - helpful_votes: Number of people who found the review helpful

    Return as JSON with a "reviews" array containing these objects.
    """,
    wait_for_selector='.review-list'
)

print(json.dumps(reviews, indent=2))

3. Using Puppeteer with GPT in JavaScript

const puppeteer = require('puppeteer');
const OpenAI = require('openai');

const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });

async function scrapeDynamicPageWithGPT(url, extractionPrompt, options = {}) {
    const browser = await puppeteer.launch({
        headless: true,
        args: ['--no-sandbox', '--disable-setuid-sandbox']
    });

    const page = await browser.newPage();

    // Set viewport for consistent rendering
    await page.setViewport({ width: 1920, height: 1080 });

    // Navigate and wait for content
    await page.goto(url, {
        waitUntil: 'networkidle2',
        timeout: 30000
    });

    // Wait for specific element if provided
    if (options.waitForSelector) {
        await page.waitForSelector(options.waitForSelector);
    }

    // Additional wait time for lazy-loaded content
    if (options.additionalWait) {
        await page.waitForTimeout(options.additionalWait);
    }

    // Get rendered HTML
    const html = await page.content();
    await browser.close();

    // Extract data with GPT
    const completion = await openai.chat.completions.create({
        model: "gpt-4o",
        messages: [
            {
                role: "system",
                content: "Extract data from HTML and return as valid JSON."
            },
            {
                role: "user",
                content: `${extractionPrompt}\n\nHTML:\n${html.substring(0, 15000)}`
            }
        ],
        response_format: { type: "json_object" },
        temperature: 0
    });

    return JSON.parse(completion.choices[0].message.content);
}

// Example usage
scrapeDynamicPageWithGPT(
    'https://example.com/products',
    `Extract all products displayed on this page. For each product get:
    - name: Product name
    - price: Price (numeric value only)
    - image_url: Main product image URL
    - availability: "in_stock" or "out_of_stock"

    Return as JSON with a "products" array.`,
    {
        waitForSelector: '.product-card',
        additionalWait: 2000
    }
)
.then(data => console.log(JSON.stringify(data, null, 2)))
.catch(error => console.error('Error:', error));

Advanced GPT Scraping Techniques

Using Structured Outputs with JSON Schema

For more reliable data extraction, define exact schemas using OpenAI's structured outputs feature:

from openai import OpenAI
import requests

client = OpenAI(api_key='YOUR_OPENAI_API_KEY')

def scrape_with_schema(url):
    """Use JSON schema for guaranteed output structure"""

    html = requests.get(url).text

    # Define exact schema
    schema = {
        "type": "object",
        "properties": {
            "product_name": {"type": "string"},
            "price": {"type": "number"},
            "currency": {"type": "string", "enum": ["USD", "EUR", "GBP", "JPY"]},
            "in_stock": {"type": "boolean"},
            "features": {
                "type": "array",
                "items": {"type": "string"}
            },
            "dimensions": {
                "type": "object",
                "properties": {
                    "width": {"type": "number"},
                    "height": {"type": "number"},
                    "depth": {"type": "number"},
                    "unit": {"type": "string"}
                }
            }
        },
        "required": ["product_name", "price", "currency", "in_stock"],
        "additionalProperties": False
    }

    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {
                "role": "system",
                "content": "Extract product data from HTML according to the provided schema."
            },
            {
                "role": "user",
                "content": f"Extract product information from this HTML:\n\n{html[:10000]}"
            }
        ],
        response_format={
            "type": "json_schema",
            "json_schema": {
                "name": "product_data",
                "strict": True,
                "schema": schema
            }
        }
    )

    return response.choices[0].message.content

# The output is guaranteed to match the schema
product = scrape_with_schema('https://example.com/product/123')
print(product)

Intelligent Pagination Handling

Use GPT to identify and navigate pagination when working with multiple pages:

import requests
from openai import OpenAI
from urllib.parse import urljoin
import time

client = OpenAI(api_key='YOUR_OPENAI_API_KEY')

def extract_pagination_info(html, current_url):
    """Use GPT to find pagination details"""

    prompt = f"""
    Analyze this HTML and extract pagination information:

    1. next_page_url: URL of the next page (full URL, not relative)
    2. current_page: Current page number
    3. total_pages: Total number of pages (if available)
    4. has_next_page: Boolean indicating if there's a next page

    Current page URL: {current_url}

    Return as JSON. If there's no next page, set next_page_url to null and has_next_page to false.

    HTML (pagination section):
    {html[:8000]}
    """

    response = client.chat.completions.create(
        model="gpt-4o-mini",  # Cheaper model for simple tasks
        messages=[
            {"role": "system", "content": "Analyze HTML pagination. Return valid JSON."},
            {"role": "user", "content": prompt}
        ],
        response_format={"type": "json_object"},
        temperature=0
    )

    return json.loads(response.choices[0].message.content)

def scrape_all_pages(start_url, data_extraction_prompt, max_pages=20):
    """Scrape data across multiple pages automatically"""

    all_results = []
    current_url = start_url

    for page_num in range(1, max_pages + 1):
        print(f"Scraping page {page_num}: {current_url}")

        # Fetch page
        html = requests.get(current_url).text

        # Extract data from current page
        page_data = client.chat.completions.create(
            model="gpt-4o",
            messages=[
                {"role": "system", "content": "Extract data and return JSON."},
                {"role": "user", "content": f"{data_extraction_prompt}\n\nHTML:\n{html[:12000]}"}
            ],
            response_format={"type": "json_object"}
        )

        all_results.append(json.loads(page_data.choices[0].message.content))

        # Find next page
        pagination = extract_pagination_info(html, current_url)

        if not pagination.get('has_next_page') or not pagination.get('next_page_url'):
            print(f"Reached last page at page {page_num}")
            break

        current_url = pagination['next_page_url']

        # Respectful delay between requests
        time.sleep(2)

    return all_results

# Usage
results = scrape_all_pages(
    'https://example.com/blog',
    """
    Extract all blog posts on this page. For each post get:
    - title: Post title
    - author: Author name
    - date: Publication date
    - excerpt: Brief excerpt or summary
    - url: Link to full post

    Return as JSON with a "posts" array.
    """
)

print(f"Scraped {len(results)} pages")

Processing Large HTML Documents

For large HTML documents, implement chunking strategies:

const cheerio = require('cheerio');
const OpenAI = require('openai');
const axios = require('axios');

const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });

async function scrapelargeDocument(url) {
    // Fetch HTML
    const response = await axios.get(url);
    const html = response.data;

    // Use Cheerio to extract relevant sections
    const $ = cheerio.load(html);

    // Remove unnecessary elements to reduce size
    $('script, style, nav, header, footer, aside, .advertisement').remove();

    // Extract main content area
    const mainContent = $('main, article, .content, #content').html() || $.html();

    // If still too large, split into chunks
    const maxChunkSize = 12000;
    const chunks = [];

    if (mainContent.length > maxChunkSize) {
        // Split by paragraphs to maintain context
        const paragraphs = mainContent.split(/<\/p>|<\/div>|<\/section>/);

        let currentChunk = '';
        for (const para of paragraphs) {
            if ((currentChunk + para).length > maxChunkSize) {
                chunks.push(currentChunk);
                currentChunk = para;
            } else {
                currentChunk += para;
            }
        }
        if (currentChunk) chunks.push(currentChunk);
    } else {
        chunks.push(mainContent);
    }

    // Process each chunk
    const results = [];
    for (let i = 0; i < chunks.length; i++) {
        const completion = await openai.chat.completions.create({
            model: "gpt-4o",
            messages: [
                {
                    role: "system",
                    content: "Extract key information from this HTML chunk."
                },
                {
                    role: "user",
                    content: `Extract main topics, key facts, and important data from this content (chunk ${i + 1} of ${chunks.length}):\n\n${chunks[i]}`
                }
            ],
            response_format: { type: "json_object" }
        });

        results.push(JSON.parse(completion.choices[0].message.content));
    }

    return results;
}

Best Practices for GPT Web Scraping

1. Optimize Costs with Smart Token Management

GPT APIs charge based on tokens. Minimize costs while maintaining effectiveness:

from bs4 import BeautifulSoup
import requests
from openai import OpenAI

client = OpenAI(api_key='YOUR_OPENAI_API_KEY')

def clean_html_for_gpt(html):
    """Remove unnecessary elements to reduce token usage"""
    soup = BeautifulSoup(html, 'html.parser')

    # Remove elements that don't contain useful data
    for tag in soup(['script', 'style', 'nav', 'header', 'footer',
                     'aside', 'iframe', 'noscript', 'meta', 'link']):
        tag.decompose()

    # Remove comments
    for comment in soup.find_all(string=lambda text: isinstance(text, Comment)):
        comment.extract()

    # Remove empty tags
    for tag in soup.find_all():
        if not tag.get_text(strip=True) and not tag.find('img'):
            tag.decompose()

    # Get cleaned HTML
    cleaned = str(soup)

    # Further reduce by extracting just text with minimal structure
    text_content = soup.get_text(separator='\n', strip=True)

    return text_content

def cost_effective_scrape(url, extraction_prompt):
    """Scrape with minimal token usage"""

    html = requests.get(url).text
    cleaned_content = clean_html_for_gpt(html)

    # Use cheaper model for simple tasks
    response = client.chat.completions.create(
        model="gpt-4o-mini",  # 10x cheaper than gpt-4o
        messages=[
            {"role": "system", "content": "Extract data and return JSON."},
            {"role": "user", "content": f"{extraction_prompt}\n\nContent:\n{cleaned_content[:8000]}"}
        ],
        response_format={"type": "json_object"},
        temperature=0
    )

    return response.choices[0].message.content

# Usage
data = cost_effective_scrape(
    'https://example.com/article',
    'Extract: article_title, author, publish_date, main_topic, key_points (array)'
)

2. Implement Robust Error Handling

When handling timeouts and errors, implement comprehensive retry logic:

import time
from openai import OpenAI, APIError, RateLimitError, Timeout
import logging

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

client = OpenAI(api_key='YOUR_OPENAI_API_KEY')

def scrape_with_retry(url, extraction_prompt, max_retries=3):
    """Robust scraping with exponential backoff"""

    for attempt in range(max_retries):
        try:
            # Fetch HTML
            html = requests.get(url, timeout=15).text

            # Extract with GPT
            response = client.chat.completions.create(
                model="gpt-4o",
                messages=[
                    {"role": "system", "content": "Extract data, return JSON."},
                    {"role": "user", "content": f"{extraction_prompt}\n\nHTML:\n{html[:10000]}"}
                ],
                response_format={"type": "json_object"},
                timeout=30
            )

            return json.loads(response.choices[0].message.content)

        except RateLimitError as e:
            wait_time = (2 ** attempt) * 2  # Exponential backoff
            logger.warning(f"Rate limited. Waiting {wait_time}s before retry {attempt + 1}/{max_retries}")
            time.sleep(wait_time)

        except Timeout as e:
            logger.error(f"Timeout error: {e}")
            if attempt < max_retries - 1:
                time.sleep(2)
            else:
                raise

        except APIError as e:
            logger.error(f"API error: {e}")
            if e.status_code >= 500:  # Server error, retry
                time.sleep(2)
            else:
                raise

        except requests.RequestException as e:
            logger.error(f"HTTP request failed: {e}")
            if attempt < max_retries - 1:
                time.sleep(2)
            else:
                raise

        except json.JSONDecodeError as e:
            logger.error(f"Invalid JSON from GPT: {e}")
            # Don't retry on invalid JSON, it's likely a prompt issue
            raise

        except Exception as e:
            logger.error(f"Unexpected error: {e}")
            raise

    raise Exception(f"Failed after {max_retries} retries")

# Usage
try:
    data = scrape_with_retry(
        'https://example.com/product',
        'Extract: name, price, description'
    )
    print(data)
except Exception as e:
    logger.error(f"Scraping failed completely: {e}")

3. Validate Extracted Data

Always validate GPT's output to ensure data quality:

from typing import Dict, List, Any
from jsonschema import validate, ValidationError, Draft7Validator
import json

def validate_scraped_data(data: str, expected_schema: Dict) -> Dict:
    """
    Validate JSON data against a schema

    Args:
        data: JSON string from GPT
        expected_schema: JSON schema to validate against

    Returns:
        Validated and parsed data

    Raises:
        ValueError: If data doesn't match schema
    """
    try:
        # Parse JSON
        parsed_data = json.loads(data)
    except json.JSONDecodeError as e:
        raise ValueError(f"Invalid JSON: {e}")

    # Validate against schema
    try:
        validate(instance=parsed_data, schema=expected_schema)
    except ValidationError as e:
        raise ValueError(f"Data validation failed: {e.message}")

    return parsed_data

# Define expected data structure
product_schema = {
    "type": "object",
    "properties": {
        "name": {"type": "string", "minLength": 1},
        "price": {"type": "number", "minimum": 0},
        "currency": {"type": "string", "pattern": "^[A-Z]{3}$"},
        "in_stock": {"type": "boolean"},
        "rating": {"type": "number", "minimum": 0, "maximum": 5},
        "features": {
            "type": "array",
            "items": {"type": "string"},
            "minItems": 1
        }
    },
    "required": ["name", "price", "currency", "in_stock"]
}

# Scrape and validate
result = scrape_with_gpt('https://example.com/product', {...})
validated_data = validate_scraped_data(result, product_schema)

print("Validated data:", validated_data)

4. Cache Results to Reduce API Calls

Implement caching for frequently accessed pages:

const crypto = require('crypto');
const fs = require('fs').promises;
const path = require('path');

class GPTScraperCache {
    constructor(cacheDir = './scrape_cache') {
        this.cacheDir = cacheDir;
    }

    async init() {
        await fs.mkdir(this.cacheDir, { recursive: true });
    }

    getCacheKey(url, extractionPrompt) {
        const combined = `${url}:${extractionPrompt}`;
        return crypto.createHash('md5').update(combined).digest('hex');
    }

    async getCachePath(key) {
        return path.join(this.cacheDir, `${key}.json`);
    }

    async get(url, extractionPrompt) {
        const key = this.getCacheKey(url, extractionPrompt);
        const cachePath = await this.getCachePath(key);

        try {
            const data = await fs.readFile(cachePath, 'utf8');
            const cached = JSON.parse(data);

            // Check if cache is still valid (24 hours)
            const age = Date.now() - cached.timestamp;
            if (age < 24 * 60 * 60 * 1000) {
                console.log('Cache hit:', url);
                return cached.data;
            }
        } catch (error) {
            // Cache miss or error reading cache
        }

        return null;
    }

    async set(url, extractionPrompt, data) {
        const key = this.getCacheKey(url, extractionPrompt);
        const cachePath = await this.getCachePath(key);

        const cacheData = {
            url,
            timestamp: Date.now(),
            data
        };

        await fs.writeFile(cachePath, JSON.stringify(cacheData, null, 2));
    }
}

// Usage
const cache = new GPTScraperCache();
await cache.init();

async function scrapeWithCache(url, extractionPrompt) {
    // Check cache first
    const cached = await cache.get(url, extractionPrompt);
    if (cached) return cached;

    // Cache miss - scrape fresh data
    const data = await scrapeWithGPT(url, extractionPrompt);

    // Store in cache
    await cache.set(url, extractionPrompt, data);

    return data;
}

5. Use Specific, Detailed Prompts

Prompt quality directly impacts extraction accuracy:

# ❌ Vague prompt - poor results
bad_prompt = "Get product info"

# ✅ Specific prompt - excellent results
good_prompt = """
Extract the following product information with high precision:

1. product_name: The main product title (string, from h1 or primary heading)
2. price: Current price as numeric value only, without currency symbol (number)
3. currency: Three-letter currency code (string: USD, EUR, GBP, etc.)
4. original_price: Original price before discount, if shown (number or null)
5. discount_percentage: Percentage discount if on sale (number or null)
6. in_stock: Availability status (boolean: true if available, false otherwise)
7. stock_quantity: Number of units available, if shown (number or null)
8. features: Key product features and specifications (array of strings, max 10)
9. dimensions: Product dimensions if available (object with width, height, depth, unit)
10. weight: Product weight if shown (object with value and unit)
11. rating: Average customer rating (number 0-5, or null)
12. review_count: Total number of reviews (number or null)
13. brand: Product brand or manufacturer (string)
14. model_number: Model or SKU number (string or null)
15. images: URLs of product images (array of strings, main image first)

Return as JSON with these exact keys. Use null for unavailable data.
"""

# Use the detailed prompt
result = scrape_with_gpt(url, good_prompt)

Comparison: GPT Models for Web Scraping

| Model | Best For | Speed | Cost | Accuracy | |-------|----------|-------|------|----------| | gpt-4o | Complex extraction, high accuracy needs | Medium | $$$ | Excellent | | gpt-4o-mini | Simple extraction, bulk scraping | Fast | $ | Very Good | | gpt-4-turbo | Complex tasks, larger contexts | Medium | $$$$ | Excellent | | gpt-3.5-turbo | Basic extraction, budget projects | Very Fast | $ | Good |

Recommendations: - Production scrapers: Use gpt-4o-mini for most tasks, gpt-4o for complex extraction - Prototyping: Start with gpt-4o-mini to test feasibility - High-value data: Use gpt-4o or gpt-4-turbo for maximum accuracy - Bulk operations: Use gpt-4o-mini with caching and rate limiting

Real-World Use Cases

E-commerce Competitor Analysis

def analyze_competitor_products(competitor_urls):
    """Extract product details from competitor websites"""

    results = []

    for url in competitor_urls:
        data = scrape_with_gpt(url, {
            'product_name': 'Full product name',
            'brand': 'Product brand',
            'price': 'Current price (numeric value)',
            'currency': 'Currency code',
            'in_stock': 'Stock availability (boolean)',
            'shipping_cost': 'Shipping cost if displayed',
            'delivery_time': 'Estimated delivery time',
            'features': 'Key product features (array)',
            'warranty': 'Warranty information',
            'return_policy': 'Return policy details'
        })

        data['competitor_url'] = url
        data['scraped_at'] = datetime.now().isoformat()
        results.append(data)

    return results

News Article Aggregation

async function aggregateNews(newsUrls) {
    const articles = [];

    for (const url of newsUrls) {
        const data = await scrapeWithGPT(url, {
            'headline': 'Main article headline',
            'subheadline': 'Subheadline or deck',
            'author': 'Author name(s)',
            'publish_date': 'Publication date and time',
            'update_date': 'Last updated date if shown',
            'category': 'Article category or section',
            'tags': 'Article tags (array)',
            'summary': 'First paragraph or summary',
            'reading_time': 'Estimated reading time',
            'image_url': 'Main article image URL',
            'video_url': 'Embedded video URL if present'
        });

        articles.push({ ...data, source_url: url });
    }

    return articles;
}

Conclusion

GPT-powered web scraping represents a paradigm shift from brittle, selector-based extraction to intelligent, context-aware data harvesting. By combining traditional web scraping techniques for content retrieval with GPT's natural language understanding for data extraction, you can build scrapers that are more resilient to layout changes, capable of handling unstructured content, and maintainable through simple prompt adjustments rather than complex code rewrites.

The key to successful GPT-based scraping is strategic application—use AI for complex, unstructured, or frequently-changing content where traditional methods struggle, while reserving simpler parsing techniques for straightforward, well-structured data. Always implement proper error handling, validation, caching, and cost optimization to build production-ready scraping systems.

As GPT models continue to evolve with improved accuracy, speed, and lower costs, AI-powered web scraping will become an increasingly essential tool for developers working with web data extraction at any scale.

Table of contents