How do I Combine Traditional Web Scraping with LLM-based Extraction?
Combining traditional web scraping techniques with LLM-based extraction creates a powerful hybrid approach that leverages the strengths of both methods. Traditional scrapers excel at efficiently navigating websites and extracting structured HTML, while LLMs are exceptional at understanding context and extracting data from unstructured content.
Why Use a Hybrid Approach?
A hybrid scraping architecture offers several advantages:
- Cost Efficiency: Traditional methods handle navigation and basic extraction, reducing expensive LLM API calls
- Performance: Classic scrapers are faster for simple, repetitive tasks
- Reliability: Structured selectors work well for consistent page layouts
- Flexibility: LLMs handle variations, unstructured text, and complex extraction logic
- Scalability: Process only relevant HTML sections through the LLM to manage token limits
Architectural Patterns
1. Pre-Processing with Traditional Scrapers
Use traditional tools to fetch and clean HTML before sending it to an LLM:
import requests
from bs4 import BeautifulSoup
import openai
# Traditional scraping: fetch and extract relevant section
response = requests.get('https://example.com/products/laptop-pro')
soup = BeautifulSoup(response.text, 'html.parser')
# Extract only the product details section
product_section = soup.find('div', class_='product-details')
product_html = str(product_section)
# LLM extraction: parse the structured data
client = openai.OpenAI()
completion = client.chat.completions.create(
model="gpt-4",
messages=[
{
"role": "system",
"content": "Extract product information as JSON with fields: name, price, specs, reviews_count"
},
{
"role": "user",
"content": f"Extract data from this HTML:\n{product_html}"
}
],
response_format={"type": "json_object"}
)
product_data = completion.choices[0].message.content
print(product_data)
2. Navigation with Puppeteer, Extraction with LLM
For JavaScript-heavy sites, use browser automation for navigation and rendering, then apply LLM extraction:
const puppeteer = require('puppeteer');
const OpenAI = require('openai');
const openai = new OpenAI();
async function scrapeWithHybridApproach() {
const browser = await puppeteer.launch();
const page = await browser.newPage();
// Traditional scraping: navigate and wait for content
await page.goto('https://example.com/articles', {
waitUntil: 'networkidle2'
});
// Wait for dynamic content to load
await page.waitForSelector('.article-content');
// Extract specific sections with traditional methods
const articleSections = await page.$$eval('.article-content', sections => {
return sections.map(section => section.innerText);
});
await browser.close();
// LLM extraction: process each section
const extractedData = [];
for (const section of articleSections) {
const completion = await openai.chat.completions.create({
model: "gpt-4",
messages: [
{
role: "system",
content: "Extract: title, author, date, summary, key_points as JSON"
},
{
role: "user",
content: section
}
],
response_format: { type: "json_object" }
});
extractedData.push(JSON.parse(completion.choices[0].message.content));
}
return extractedData;
}
scrapeWithHybridApproach().then(data => console.log(data));
This pattern is particularly useful when you need to handle AJAX requests using Puppeteer before extracting data with an LLM.
3. Fallback Strategy
Use traditional selectors as the primary method, with LLM extraction as a fallback:
import requests
from bs4 import BeautifulSoup
from anthropic import Anthropic
def extract_with_fallback(url):
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
# Try traditional extraction first
try:
product = {
'name': soup.select_one('h1.product-name').text.strip(),
'price': soup.select_one('span.price').text.strip(),
'description': soup.select_one('div.description').text.strip()
}
return product
except AttributeError:
# Fallback to LLM if selectors fail
print("Traditional extraction failed, using LLM...")
client = Anthropic()
message = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=1024,
messages=[
{
"role": "user",
"content": f"Extract product name, price, and description as JSON from:\n{soup.get_text()}"
}
]
)
import json
return json.loads(message.content[0].text)
# Usage
data = extract_with_fallback('https://example.com/product/123')
print(data)
4. Batch Processing Pipeline
Process multiple pages efficiently by batching LLM requests:
import asyncio
import aiohttp
from bs4 import BeautifulSoup
import openai
from typing import List
async def fetch_page(session, url):
async with session.get(url) as response:
return await response.text()
async def scrape_urls_traditional(urls: List[str]):
"""Traditional scraping: fetch all pages concurrently"""
async with aiohttp.ClientSession() as session:
tasks = [fetch_page(session, url) for url in urls]
html_contents = await asyncio.gather(*tasks)
# Extract relevant sections
extracted_sections = []
for html in html_contents:
soup = BeautifulSoup(html, 'html.parser')
content = soup.find('article') or soup.find('main')
if content:
extracted_sections.append(content.get_text()[:2000]) # Limit tokens
return extracted_sections
async def process_with_llm(sections: List[str]):
"""LLM extraction: batch process sections"""
client = openai.OpenAI()
results = []
# Process in batches to manage rate limits
batch_size = 5
for i in range(0, len(sections), batch_size):
batch = sections[i:i+batch_size]
tasks = [
client.chat.completions.create(
model="gpt-4",
messages=[
{
"role": "system",
"content": "Extract headline, summary, and category as JSON"
},
{
"role": "user",
"content": section
}
],
response_format={"type": "json_object"}
)
for section in batch
]
# Wait for batch to complete
batch_results = await asyncio.gather(*tasks)
results.extend([r.choices[0].message.content for r in batch_results])
# Rate limiting
await asyncio.sleep(1)
return results
# Main pipeline
async def hybrid_pipeline(urls):
sections = await scrape_urls_traditional(urls)
extracted_data = await process_with_llm(sections)
return extracted_data
# Usage
urls = ['https://example.com/page1', 'https://example.com/page2']
results = asyncio.run(hybrid_pipeline(urls))
Best Practices
1. Optimize HTML Before Sending to LLM
Strip unnecessary elements to reduce tokens and costs:
from bs4 import BeautifulSoup
def clean_html_for_llm(html):
soup = BeautifulSoup(html, 'html.parser')
# Remove scripts, styles, and navigation
for element in soup(['script', 'style', 'nav', 'footer', 'header']):
element.decompose()
# Remove attributes to reduce size
for tag in soup.find_all(True):
tag.attrs = {}
# Get text with minimal whitespace
text = soup.get_text(separator=' ', strip=True)
return text
2. Use Traditional Methods for Pagination
When working with multiple pages in parallel, handle navigation traditionally and reserve LLM processing for content extraction:
const puppeteer = require('puppeteer');
async function scrapeMultiplePages(baseUrl, maxPages) {
const browser = await puppeteer.launch();
const page = await browser.newPage();
const allContent = [];
for (let i = 1; i <= maxPages; i++) {
// Traditional: navigate to each page
await page.goto(`${baseUrl}?page=${i}`);
// Traditional: extract content sections
const sections = await page.$$eval('.content-item', items =>
items.map(item => ({
html: item.innerHTML,
text: item.innerText
}))
);
allContent.push(...sections);
}
await browser.close();
// LLM: Process all extracted content
// (LLM processing code here)
return allContent;
}
3. Implement Caching
Cache LLM responses to avoid redundant API calls:
import hashlib
import json
import os
class LLMCache:
def __init__(self, cache_dir='./llm_cache'):
self.cache_dir = cache_dir
os.makedirs(cache_dir, exist_ok=True)
def get_cache_key(self, content):
return hashlib.md5(content.encode()).hexdigest()
def get(self, content):
key = self.get_cache_key(content)
cache_file = f"{self.cache_dir}/{key}.json"
if os.path.exists(cache_file):
with open(cache_file, 'r') as f:
return json.load(f)
return None
def set(self, content, result):
key = self.get_cache_key(content)
cache_file = f"{self.cache_dir}/{key}.json"
with open(cache_file, 'w') as f:
json.dump(result, f)
# Usage
cache = LLMCache()
def extract_with_cache(html_content):
# Check cache first
cached = cache.get(html_content)
if cached:
return cached
# Call LLM if not cached
result = call_llm_api(html_content)
cache.set(html_content, result)
return result
4. Handle Authentication Traditionally
Use traditional methods to handle authentication before extracting protected content with LLMs:
import requests
from bs4 import BeautifulSoup
class HybridScraper:
def __init__(self):
self.session = requests.Session()
def login(self, login_url, credentials):
# Traditional: handle authentication
response = self.session.post(login_url, data=credentials)
return response.status_code == 200
def scrape_protected_page(self, url):
# Traditional: fetch with authenticated session
response = self.session.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
# Extract relevant content
content = soup.find('div', class_='protected-content')
# LLM: extract structured data
return self.extract_with_llm(str(content))
def extract_with_llm(self, content):
# LLM extraction logic here
pass
Real-World Example: E-commerce Scraper
Here's a complete example combining both approaches:
import requests
from bs4 import BeautifulSoup
import openai
import json
class HybridProductScraper:
def __init__(self, api_key):
self.client = openai.OpenAI(api_key=api_key)
def scrape_product_page(self, url):
# Phase 1: Traditional scraping
response = requests.get(url, headers={
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
})
soup = BeautifulSoup(response.text, 'html.parser')
# Extract structured data with traditional methods
basic_data = {
'url': url,
'title': soup.find('title').text if soup.find('title') else None,
'images': [img['src'] for img in soup.find_all('img', class_='product-image')]
}
# Extract product description section
description_section = soup.find('div', {'id': 'product-description'})
reviews_section = soup.find('div', {'id': 'customer-reviews'})
# Phase 2: LLM extraction for complex data
if description_section:
product_details = self.extract_product_details(str(description_section))
basic_data.update(product_details)
if reviews_section:
review_summary = self.extract_review_insights(reviews_section.get_text())
basic_data['review_insights'] = review_summary
return basic_data
def extract_product_details(self, html):
completion = self.client.chat.completions.create(
model="gpt-4",
messages=[
{
"role": "system",
"content": """Extract product details as JSON:
- name: product name
- price: current price
- original_price: if on sale
- specs: key specifications as object
- features: list of key features
"""
},
{
"role": "user",
"content": html
}
],
response_format={"type": "json_object"}
)
return json.loads(completion.choices[0].message.content)
def extract_review_insights(self, review_text):
completion = self.client.chat.completions.create(
model="gpt-4",
messages=[
{
"role": "system",
"content": """Analyze reviews and return JSON:
- sentiment: overall sentiment (positive/neutral/negative)
- common_praises: list of commonly praised features
- common_complaints: list of common complaints
- summary: brief summary of customer feedback
"""
},
{
"role": "user",
"content": f"Reviews: {review_text[:3000]}" # Limit tokens
}
],
response_format={"type": "json_object"}
)
return json.loads(completion.choices[0].message.content)
# Usage
scraper = HybridProductScraper(api_key='your-api-key')
product = scraper.scrape_product_page('https://example.com/product/123')
print(json.dumps(product, indent=2))
Monitoring and Debugging
Track performance and costs in your hybrid pipeline:
import time
from functools import wraps
class ScraperMetrics:
def __init__(self):
self.traditional_time = 0
self.llm_time = 0
self.llm_calls = 0
self.total_tokens = 0
def track_traditional(self, func):
@wraps(func)
def wrapper(*args, **kwargs):
start = time.time()
result = func(*args, **kwargs)
self.traditional_time += time.time() - start
return result
return wrapper
def track_llm(self, func):
@wraps(func)
def wrapper(*args, **kwargs):
start = time.time()
result = func(*args, **kwargs)
self.llm_time += time.time() - start
self.llm_calls += 1
# Estimate tokens (actual implementation would use tiktoken)
self.total_tokens += len(str(args)) // 4
return result
return wrapper
def report(self):
print(f"""
Scraping Metrics:
- Traditional scraping time: {self.traditional_time:.2f}s
- LLM processing time: {self.llm_time:.2f}s
- LLM API calls: {self.llm_calls}
- Estimated tokens: {self.total_tokens}
- Estimated cost: ${self.total_tokens * 0.00001:.4f}
""")
Conclusion
Combining traditional web scraping with LLM-based extraction creates robust, efficient pipelines that leverage the best of both worlds. Use traditional methods for navigation, page rendering, and structured data extraction, while deploying LLMs for complex extraction, unstructured data, and adaptive parsing. This hybrid approach reduces costs, improves performance, and increases reliability in production web scraping systems.