How do I Combine Traditional Web Scraping with LLM-based Extraction?

Combining traditional web scraping techniques with LLM-based extraction creates a powerful hybrid approach that leverages the strengths of both methods. Traditional scrapers excel at efficiently navigating websites and extracting structured HTML, while LLMs are exceptional at understanding context and extracting data from unstructured content.

Why Use a Hybrid Approach?

A hybrid scraping architecture offers several advantages:

Cost Efficiency: Traditional methods handle navigation and basic extraction, reducing expensive LLM API calls
Performance: Classic scrapers are faster for simple, repetitive tasks
Reliability: Structured selectors work well for consistent page layouts
Flexibility: LLMs handle variations, unstructured text, and complex extraction logic
Scalability: Process only relevant HTML sections through the LLM to manage token limits

Architectural Patterns

1. Pre-Processing with Traditional Scrapers

Use traditional tools to fetch and clean HTML before sending it to an LLM:

import requests
from bs4 import BeautifulSoup
import openai

# Traditional scraping: fetch and extract relevant section
response = requests.get('https://example.com/products/laptop-pro')
soup = BeautifulSoup(response.text, 'html.parser')

# Extract only the product details section
product_section = soup.find('div', class_='product-details')
product_html = str(product_section)

# LLM extraction: parse the structured data
client = openai.OpenAI()
completion = client.chat.completions.create(
    model="gpt-4",
    messages=[
        {
            "role": "system",
            "content": "Extract product information as JSON with fields: name, price, specs, reviews_count"
        },
        {
            "role": "user",
            "content": f"Extract data from this HTML:\n{product_html}"
        }
    ],
    response_format={"type": "json_object"}
)

product_data = completion.choices[0].message.content
print(product_data)

2. Navigation with Puppeteer, Extraction with LLM

For JavaScript-heavy sites, use browser automation for navigation and rendering, then apply LLM extraction:

const puppeteer = require('puppeteer');
const OpenAI = require('openai');

const openai = new OpenAI();

async function scrapeWithHybridApproach() {
    const browser = await puppeteer.launch();
    const page = await browser.newPage();

    // Traditional scraping: navigate and wait for content
    await page.goto('https://example.com/articles', {
        waitUntil: 'networkidle2'
    });

    // Wait for dynamic content to load
    await page.waitForSelector('.article-content');

    // Extract specific sections with traditional methods
    const articleSections = await page.$$eval('.article-content', sections => {
        return sections.map(section => section.innerText);
    });

    await browser.close();

    // LLM extraction: process each section
    const extractedData = [];
    for (const section of articleSections) {
        const completion = await openai.chat.completions.create({
            model: "gpt-4",
            messages: [
                {
                    role: "system",
                    content: "Extract: title, author, date, summary, key_points as JSON"
                },
                {
                    role: "user",
                    content: section
                }
            ],
            response_format: { type: "json_object" }
        });

        extractedData.push(JSON.parse(completion.choices[0].message.content));
    }

    return extractedData;
}

scrapeWithHybridApproach().then(data => console.log(data));

This pattern is particularly useful when you need to handle AJAX requests using Puppeteer before extracting data with an LLM.

3. Fallback Strategy

Use traditional selectors as the primary method, with LLM extraction as a fallback:

import requests
from bs4 import BeautifulSoup
from anthropic import Anthropic

def extract_with_fallback(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')

    # Try traditional extraction first
    try:
        product = {
            'name': soup.select_one('h1.product-name').text.strip(),
            'price': soup.select_one('span.price').text.strip(),
            'description': soup.select_one('div.description').text.strip()
        }
        return product
    except AttributeError:
        # Fallback to LLM if selectors fail
        print("Traditional extraction failed, using LLM...")

        client = Anthropic()
        message = client.messages.create(
            model="claude-3-5-sonnet-20241022",
            max_tokens=1024,
            messages=[
                {
                    "role": "user",
                    "content": f"Extract product name, price, and description as JSON from:\n{soup.get_text()}"
                }
            ]
        )

        import json
        return json.loads(message.content[0].text)

# Usage
data = extract_with_fallback('https://example.com/product/123')
print(data)

4. Batch Processing Pipeline

Process multiple pages efficiently by batching LLM requests:

import asyncio
import aiohttp
from bs4 import BeautifulSoup
import openai
from typing import List

async def fetch_page(session, url):
    async with session.get(url) as response:
        return await response.text()

async def scrape_urls_traditional(urls: List[str]):
    """Traditional scraping: fetch all pages concurrently"""
    async with aiohttp.ClientSession() as session:
        tasks = [fetch_page(session, url) for url in urls]
        html_contents = await asyncio.gather(*tasks)

    # Extract relevant sections
    extracted_sections = []
    for html in html_contents:
        soup = BeautifulSoup(html, 'html.parser')
        content = soup.find('article') or soup.find('main')
        if content:
            extracted_sections.append(content.get_text()[:2000])  # Limit tokens

    return extracted_sections

async def process_with_llm(sections: List[str]):
    """LLM extraction: batch process sections"""
    client = openai.OpenAI()
    results = []

    # Process in batches to manage rate limits
    batch_size = 5
    for i in range(0, len(sections), batch_size):
        batch = sections[i:i+batch_size]

        tasks = [
            client.chat.completions.create(
                model="gpt-4",
                messages=[
                    {
                        "role": "system",
                        "content": "Extract headline, summary, and category as JSON"
                    },
                    {
                        "role": "user",
                        "content": section
                    }
                ],
                response_format={"type": "json_object"}
            )
            for section in batch
        ]

        # Wait for batch to complete
        batch_results = await asyncio.gather(*tasks)
        results.extend([r.choices[0].message.content for r in batch_results])

        # Rate limiting
        await asyncio.sleep(1)

    return results

# Main pipeline
async def hybrid_pipeline(urls):
    sections = await scrape_urls_traditional(urls)
    extracted_data = await process_with_llm(sections)
    return extracted_data

# Usage
urls = ['https://example.com/page1', 'https://example.com/page2']
results = asyncio.run(hybrid_pipeline(urls))

Best Practices

1. Optimize HTML Before Sending to LLM

Strip unnecessary elements to reduce tokens and costs:

from bs4 import BeautifulSoup

def clean_html_for_llm(html):
    soup = BeautifulSoup(html, 'html.parser')

    # Remove scripts, styles, and navigation
    for element in soup(['script', 'style', 'nav', 'footer', 'header']):
        element.decompose()

    # Remove attributes to reduce size
    for tag in soup.find_all(True):
        tag.attrs = {}

    # Get text with minimal whitespace
    text = soup.get_text(separator=' ', strip=True)

    return text

2. Use Traditional Methods for Pagination

When working with multiple pages in parallel, handle navigation traditionally and reserve LLM processing for content extraction:

const puppeteer = require('puppeteer');

async function scrapeMultiplePages(baseUrl, maxPages) {
    const browser = await puppeteer.launch();
    const page = await browser.newPage();

    const allContent = [];

    for (let i = 1; i <= maxPages; i++) {
        // Traditional: navigate to each page
        await page.goto(`${baseUrl}?page=${i}`);

        // Traditional: extract content sections
        const sections = await page.$$eval('.content-item', items =>
            items.map(item => ({
                html: item.innerHTML,
                text: item.innerText
            }))
        );

        allContent.push(...sections);
    }

    await browser.close();

    // LLM: Process all extracted content
    // (LLM processing code here)

    return allContent;
}

3. Implement Caching

Cache LLM responses to avoid redundant API calls:

import hashlib
import json
import os

class LLMCache:
    def __init__(self, cache_dir='./llm_cache'):
        self.cache_dir = cache_dir
        os.makedirs(cache_dir, exist_ok=True)

    def get_cache_key(self, content):
        return hashlib.md5(content.encode()).hexdigest()

    def get(self, content):
        key = self.get_cache_key(content)
        cache_file = f"{self.cache_dir}/{key}.json"

        if os.path.exists(cache_file):
            with open(cache_file, 'r') as f:
                return json.load(f)
        return None

    def set(self, content, result):
        key = self.get_cache_key(content)
        cache_file = f"{self.cache_dir}/{key}.json"

        with open(cache_file, 'w') as f:
            json.dump(result, f)

# Usage
cache = LLMCache()

def extract_with_cache(html_content):
    # Check cache first
    cached = cache.get(html_content)
    if cached:
        return cached

    # Call LLM if not cached
    result = call_llm_api(html_content)
    cache.set(html_content, result)

    return result

4. Handle Authentication Traditionally

Use traditional methods to handle authentication before extracting protected content with LLMs:

import requests
from bs4 import BeautifulSoup

class HybridScraper:
    def __init__(self):
        self.session = requests.Session()

    def login(self, login_url, credentials):
        # Traditional: handle authentication
        response = self.session.post(login_url, data=credentials)
        return response.status_code == 200

    def scrape_protected_page(self, url):
        # Traditional: fetch with authenticated session
        response = self.session.get(url)
        soup = BeautifulSoup(response.text, 'html.parser')

        # Extract relevant content
        content = soup.find('div', class_='protected-content')

        # LLM: extract structured data
        return self.extract_with_llm(str(content))

    def extract_with_llm(self, content):
        # LLM extraction logic here
        pass

Real-World Example: E-commerce Scraper

Here's a complete example combining both approaches:

import requests
from bs4 import BeautifulSoup
import openai
import json

class HybridProductScraper:
    def __init__(self, api_key):
        self.client = openai.OpenAI(api_key=api_key)

    def scrape_product_page(self, url):
        # Phase 1: Traditional scraping
        response = requests.get(url, headers={
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
        })

        soup = BeautifulSoup(response.text, 'html.parser')

        # Extract structured data with traditional methods
        basic_data = {
            'url': url,
            'title': soup.find('title').text if soup.find('title') else None,
            'images': [img['src'] for img in soup.find_all('img', class_='product-image')]
        }

        # Extract product description section
        description_section = soup.find('div', {'id': 'product-description'})
        reviews_section = soup.find('div', {'id': 'customer-reviews'})

        # Phase 2: LLM extraction for complex data
        if description_section:
            product_details = self.extract_product_details(str(description_section))
            basic_data.update(product_details)

        if reviews_section:
            review_summary = self.extract_review_insights(reviews_section.get_text())
            basic_data['review_insights'] = review_summary

        return basic_data

    def extract_product_details(self, html):
        completion = self.client.chat.completions.create(
            model="gpt-4",
            messages=[
                {
                    "role": "system",
                    "content": """Extract product details as JSON:
                    - name: product name
                    - price: current price
                    - original_price: if on sale
                    - specs: key specifications as object
                    - features: list of key features
                    """
                },
                {
                    "role": "user",
                    "content": html
                }
            ],
            response_format={"type": "json_object"}
        )

        return json.loads(completion.choices[0].message.content)

    def extract_review_insights(self, review_text):
        completion = self.client.chat.completions.create(
            model="gpt-4",
            messages=[
                {
                    "role": "system",
                    "content": """Analyze reviews and return JSON:
                    - sentiment: overall sentiment (positive/neutral/negative)
                    - common_praises: list of commonly praised features
                    - common_complaints: list of common complaints
                    - summary: brief summary of customer feedback
                    """
                },
                {
                    "role": "user",
                    "content": f"Reviews: {review_text[:3000]}"  # Limit tokens
                }
            ],
            response_format={"type": "json_object"}
        )

        return json.loads(completion.choices[0].message.content)

# Usage
scraper = HybridProductScraper(api_key='your-api-key')
product = scraper.scrape_product_page('https://example.com/product/123')
print(json.dumps(product, indent=2))

Monitoring and Debugging

Track performance and costs in your hybrid pipeline:

import time
from functools import wraps

class ScraperMetrics:
    def __init__(self):
        self.traditional_time = 0
        self.llm_time = 0
        self.llm_calls = 0
        self.total_tokens = 0

    def track_traditional(self, func):
        @wraps(func)
        def wrapper(*args, **kwargs):
            start = time.time()
            result = func(*args, **kwargs)
            self.traditional_time += time.time() - start
            return result
        return wrapper

    def track_llm(self, func):
        @wraps(func)
        def wrapper(*args, **kwargs):
            start = time.time()
            result = func(*args, **kwargs)
            self.llm_time += time.time() - start
            self.llm_calls += 1
            # Estimate tokens (actual implementation would use tiktoken)
            self.total_tokens += len(str(args)) // 4
            return result
        return wrapper

    def report(self):
        print(f"""
        Scraping Metrics:
        - Traditional scraping time: {self.traditional_time:.2f}s
        - LLM processing time: {self.llm_time:.2f}s
        - LLM API calls: {self.llm_calls}
        - Estimated tokens: {self.total_tokens}
        - Estimated cost: ${self.total_tokens * 0.00001:.4f}
        """)

Conclusion

Combining traditional web scraping with LLM-based extraction creates robust, efficient pipelines that leverage the best of both worlds. Use traditional methods for navigation, page rendering, and structured data extraction, while deploying LLMs for complex extraction, unstructured data, and adaptive parsing. This hybrid approach reduces costs, improves performance, and increases reliability in production web scraping systems.

Table of contents

How do I Combine Traditional Web Scraping with LLM-based Extraction?

Why Use a Hybrid Approach?

Architectural Patterns

1. Pre-Processing with Traditional Scrapers

2. Navigation with Puppeteer, Extraction with LLM

3. Fallback Strategy

4. Batch Processing Pipeline

Best Practices

1. Optimize HTML Before Sending to LLM

2. Use Traditional Methods for Pagination

3. Implement Caching

4. Handle Authentication Traditionally

Real-World Example: E-commerce Scraper

Monitoring and Debugging

Conclusion

Try WebScraping.AI for Your Web Scraping Needs

Key Features:

Getting Started:

Related Questions

Can LLMs understand and extract data from complex page layouts?

How do I scrape multilingual websites using LLMs?

What are the legal and ethical considerations when using AI for web scraping?

Get Started Now

Support