Can I use multiple LLM providers for web scraping to improve reliability?
Yes, you can and should use multiple LLM providers for web scraping to improve reliability, reduce costs, and avoid single points of failure. By implementing a multi-provider strategy, you can automatically fall back to alternative LLMs when one provider experiences downtime, rate limiting, or performance issues. This approach ensures continuous operation of your AI-powered web scraping workflows.
Using multiple LLM providers also allows you to optimize for different use cases—some models excel at structured data extraction while others are better at answering complex questions about webpage content. You can route requests to the most appropriate model based on task complexity, cost, or performance requirements.
Why Use Multiple LLM Providers?
1. Improved Reliability and Uptime
LLM APIs experience occasional outages, degraded performance, or unexpected downtime. By distributing requests across multiple providers, your scraping pipeline remains operational even when one service fails.
2. Rate Limit Management
Each provider has different rate limits. When you hit the limit with one provider, you can automatically route requests to another, maintaining consistent throughput.
3. Cost Optimization
Different providers have varying pricing structures. You can route simple extraction tasks to cheaper models and reserve expensive, high-capability models for complex operations.
4. Performance Optimization
Some models are faster but less accurate, while others are more precise but slower. Multi-provider setups let you balance speed and accuracy based on your needs.
5. Feature-Based Routing
Different LLMs have unique strengths—GPT-4 Vision for image analysis, Claude for large context windows, or specialized models for specific data types.
Available LLM Providers for Web Scraping
Here are the major LLM providers you can integrate:
OpenAI (GPT-4, GPT-3.5) - Best for: General-purpose extraction, JSON generation - Pricing: $0.01-0.03 per 1K tokens (input), $0.03-0.06 per 1K tokens (output) - Rate limits: Tier-based, 3-10,000 RPM
Anthropic (Claude) - Best for: Large documents, complex reasoning, LLM data extraction - Pricing: $0.003-0.015 per 1K tokens - Rate limits: 50-1,000 RPM depending on tier - Advantage: 200K token context window
Google (Gemini) - Best for: Multimodal content, video/image analysis - Pricing: Free tier available, $0.0005-0.002 per 1K tokens - Rate limits: 60-1,000 RPM
Cohere - Best for: Classification, semantic search - Pricing: Free tier, pay-as-you-go available - Rate limits: 100-10,000 RPM
DeepSeek - Best for: Cost-effective extraction - Pricing: Competitive pricing - Rate limits: Varies by plan
Implementation Strategies
1. Simple Fallback Pattern
The most basic approach: try the primary provider, fall back to secondary if it fails.
Python Example:
import anthropic
import openai
from typing import Optional
class MultiProviderLLM:
def __init__(self, openai_key: str, anthropic_key: str):
self.openai_key = openai_key
self.anthropic_key = anthropic_key
def extract_data(self, html_content: str, prompt: str) -> Optional[str]:
"""Try OpenAI first, fall back to Anthropic"""
try:
return self._call_openai(html_content, prompt)
except Exception as e:
print(f"OpenAI failed: {e}, trying Anthropic...")
try:
return self._call_anthropic(html_content, prompt)
except Exception as e2:
print(f"Anthropic also failed: {e2}")
return None
def _call_openai(self, html_content: str, prompt: str) -> str:
"""Call OpenAI GPT-4"""
openai.api_key = self.openai_key
response = openai.ChatCompletion.create(
model="gpt-4",
messages=[
{"role": "system", "content": "Extract structured data from HTML."},
{"role": "user", "content": f"{prompt}\n\nHTML:\n{html_content[:5000]}"}
],
temperature=0,
timeout=30
)
return response.choices[0].message.content
def _call_anthropic(self, html_content: str, prompt: str) -> str:
"""Call Anthropic Claude"""
client = anthropic.Anthropic(api_key=self.anthropic_key)
message = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=1024,
messages=[
{
"role": "user",
"content": f"{prompt}\n\nHTML:\n{html_content[:5000]}"
}
]
)
return message.content[0].text
# Usage
scraper = MultiProviderLLM(
openai_key="sk-...",
anthropic_key="sk-ant-..."
)
html = "<div class='product'><h1>Laptop</h1><span class='price'>$999</span></div>"
result = scraper.extract_data(html, "Extract the product name and price as JSON")
print(result)
JavaScript Example:
const OpenAI = require('openai');
const Anthropic = require('@anthropic-ai/sdk');
class MultiProviderLLM {
constructor(openaiKey, anthropicKey) {
this.openai = new OpenAI({ apiKey: openaiKey });
this.anthropic = new Anthropic({ apiKey: anthropicKey });
}
async extractData(htmlContent, prompt) {
try {
return await this.callOpenAI(htmlContent, prompt);
} catch (error) {
console.log(`OpenAI failed: ${error.message}, trying Anthropic...`);
try {
return await this.callAnthropic(htmlContent, prompt);
} catch (error2) {
console.error(`Anthropic also failed: ${error2.message}`);
return null;
}
}
}
async callOpenAI(htmlContent, prompt) {
const response = await this.openai.chat.completions.create({
model: 'gpt-4',
messages: [
{ role: 'system', content: 'Extract structured data from HTML.' },
{ role: 'user', content: `${prompt}\n\nHTML:\n${htmlContent.slice(0, 5000)}` }
],
temperature: 0,
timeout: 30000
});
return response.choices[0].message.content;
}
async callAnthropic(htmlContent, prompt) {
const message = await this.anthropic.messages.create({
model: 'claude-3-5-sonnet-20241022',
max_tokens: 1024,
messages: [{
role: 'user',
content: `${prompt}\n\nHTML:\n${htmlContent.slice(0, 5000)}`
}]
});
return message.content[0].text;
}
}
// Usage
const scraper = new MultiProviderLLM('sk-...', 'sk-ant-...');
const html = "<div class='product'><h1>Laptop</h1><span class='price'>$999</span></div>";
scraper.extractData(html, 'Extract the product name and price as JSON')
.then(result => console.log(result));
2. Round-Robin Load Distribution
Distribute requests evenly across providers to balance load and costs:
Python Example:
from itertools import cycle
from typing import List, Dict, Callable
import time
class RoundRobinLLM:
def __init__(self, providers: List[Dict[str, Callable]]):
"""
providers: List of dicts with 'name' and 'call' function
Example: [
{'name': 'openai', 'call': openai_function},
{'name': 'anthropic', 'call': anthropic_function}
]
"""
self.providers = cycle(providers)
self.current_provider = next(self.providers)
self.stats = {p['name']: {'calls': 0, 'errors': 0} for p in providers}
def extract_data(self, html_content: str, prompt: str, max_retries: int = 3) -> str:
"""Round-robin through providers with retry logic"""
attempts = 0
errors = []
while attempts < max_retries:
provider = self.current_provider
provider_name = provider['name']
try:
print(f"Attempt {attempts + 1}: Using {provider_name}")
result = provider['call'](html_content, prompt)
# Track success
self.stats[provider_name]['calls'] += 1
return result
except Exception as e:
print(f"{provider_name} failed: {e}")
self.stats[provider_name]['errors'] += 1
errors.append(f"{provider_name}: {str(e)}")
# Move to next provider
self.current_provider = next(self.providers)
attempts += 1
time.sleep(2 ** attempts) # Exponential backoff
raise Exception(f"All providers failed after {max_retries} attempts: {errors}")
def get_statistics(self) -> Dict:
"""Get usage statistics for all providers"""
return self.stats
# Define provider functions
def call_openai_provider(html: str, prompt: str) -> str:
# OpenAI implementation
pass
def call_anthropic_provider(html: str, prompt: str) -> str:
# Anthropic implementation
pass
def call_gemini_provider(html: str, prompt: str) -> str:
# Google Gemini implementation
pass
# Setup round-robin scraper
providers = [
{'name': 'openai', 'call': call_openai_provider},
{'name': 'anthropic', 'call': call_anthropic_provider},
{'name': 'gemini', 'call': call_gemini_provider}
]
scraper = RoundRobinLLM(providers)
# Scrape multiple pages - requests distributed evenly
urls = ['https://example.com/page1', 'https://example.com/page2', 'https://example.com/page3']
for url in urls:
html = fetch_page(url) # Your scraping function
result = scraper.extract_data(html, "Extract product information")
print(result)
print("Statistics:", scraper.get_statistics())
3. Smart Routing Based on Task Type
Route requests to the most appropriate provider based on task complexity or content type:
Python Example:
import json
from enum import Enum
class TaskType(Enum):
SIMPLE_EXTRACTION = "simple"
COMPLEX_REASONING = "complex"
LARGE_DOCUMENT = "large"
IMAGE_ANALYSIS = "image"
class SmartRoutingLLM:
def __init__(self, api_keys: Dict[str, str]):
self.keys = api_keys
# Define which provider is best for each task type
self.routing_map = {
TaskType.SIMPLE_EXTRACTION: 'gemini', # Fast and cheap
TaskType.COMPLEX_REASONING: 'gpt4', # Most capable
TaskType.LARGE_DOCUMENT: 'claude', # Large context window
TaskType.IMAGE_ANALYSIS: 'gpt4_vision' # Vision capabilities
}
def extract_data(self, content: str, prompt: str, task_type: TaskType) -> str:
"""Route to appropriate provider based on task type"""
primary_provider = self.routing_map[task_type]
try:
return self._call_provider(primary_provider, content, prompt)
except Exception as e:
print(f"{primary_provider} failed, trying fallback...")
# Fallback to GPT-4 for most tasks
fallback = 'gpt4' if primary_provider != 'gpt4' else 'claude'
return self._call_provider(fallback, content, prompt)
def _call_provider(self, provider: str, content: str, prompt: str) -> str:
"""Call the specified provider"""
if provider == 'gemini':
return self._call_gemini(content, prompt)
elif provider == 'gpt4':
return self._call_gpt4(content, prompt)
elif provider == 'claude':
return self._call_claude(content, prompt)
elif provider == 'gpt4_vision':
return self._call_gpt4_vision(content, prompt)
else:
raise ValueError(f"Unknown provider: {provider}")
def _call_gemini(self, content: str, prompt: str) -> str:
"""Google Gemini - fast and cheap for simple tasks"""
import google.generativeai as genai
genai.configure(api_key=self.keys['gemini'])
model = genai.GenerativeModel('gemini-pro')
response = model.generate_content(f"{prompt}\n\n{content[:5000]}")
return response.text
def _call_gpt4(self, content: str, prompt: str) -> str:
"""OpenAI GPT-4 - best for complex reasoning"""
import openai
openai.api_key = self.keys['openai']
response = openai.ChatCompletion.create(
model="gpt-4",
messages=[{"role": "user", "content": f"{prompt}\n\n{content[:8000]}"}]
)
return response.choices[0].message.content
def _call_claude(self, content: str, prompt: str) -> str:
"""Anthropic Claude - best for large documents"""
import anthropic
client = anthropic.Anthropic(api_key=self.keys['anthropic'])
message = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=2048,
messages=[{"role": "user", "content": f"{prompt}\n\n{content[:100000]}"}]
)
return message.content[0].text
def _call_gpt4_vision(self, image_url: str, prompt: str) -> str:
"""GPT-4 Vision for image analysis"""
import openai
openai.api_key = self.keys['openai']
response = openai.ChatCompletion.create(
model="gpt-4-vision-preview",
messages=[{
"role": "user",
"content": [
{"type": "text", "text": prompt},
{"type": "image_url", "image_url": {"url": image_url}}
]
}]
)
return response.choices[0].message.content
# Usage
api_keys = {
'openai': 'sk-...',
'anthropic': 'sk-ant-...',
'gemini': 'AI...'
}
scraper = SmartRoutingLLM(api_keys)
# Simple extraction - routed to Gemini (cheap/fast)
simple_html = "<div><span class='price'>$99</span></div>"
price = scraper.extract_data(
simple_html,
"Extract the price",
TaskType.SIMPLE_EXTRACTION
)
# Complex reasoning - routed to GPT-4
complex_html = fetch_large_product_page()
analysis = scraper.extract_data(
complex_html,
"Analyze the product reviews and summarize sentiment",
TaskType.COMPLEX_REASONING
)
# Large document - routed to Claude
large_doc = fetch_full_article() # 50K tokens
summary = scraper.extract_data(
large_doc,
"Extract all mentioned companies and their relationships",
TaskType.LARGE_DOCUMENT
)
4. Health Monitoring and Circuit Breaker
Automatically disable unhealthy providers and route around failures:
Python Example:
import time
from datetime import datetime, timedelta
from typing import Dict, Optional
class ProviderHealthMonitor:
def __init__(self, failure_threshold: int = 3, recovery_time: int = 300):
"""
failure_threshold: Number of consecutive failures before circuit opens
recovery_time: Seconds to wait before retrying failed provider
"""
self.failure_threshold = failure_threshold
self.recovery_time = recovery_time
self.health_status = {} # provider -> {'failures': int, 'disabled_until': datetime}
def is_healthy(self, provider_name: str) -> bool:
"""Check if provider is healthy and available"""
if provider_name not in self.health_status:
return True
status = self.health_status[provider_name]
# Check if recovery period has passed
if status.get('disabled_until'):
if datetime.now() > status['disabled_until']:
# Reset health status
self.health_status[provider_name] = {'failures': 0, 'disabled_until': None}
print(f"{provider_name} recovered, re-enabling")
return True
return False
return status.get('failures', 0) < self.failure_threshold
def record_success(self, provider_name: str):
"""Record successful call"""
if provider_name in self.health_status:
self.health_status[provider_name]['failures'] = 0
def record_failure(self, provider_name: str):
"""Record failed call and potentially disable provider"""
if provider_name not in self.health_status:
self.health_status[provider_name] = {'failures': 0, 'disabled_until': None}
self.health_status[provider_name]['failures'] += 1
if self.health_status[provider_name]['failures'] >= self.failure_threshold:
disabled_until = datetime.now() + timedelta(seconds=self.recovery_time)
self.health_status[provider_name]['disabled_until'] = disabled_until
print(f"{provider_name} disabled until {disabled_until} due to repeated failures")
def get_status(self) -> Dict:
"""Get current health status of all providers"""
return self.health_status
class ResilientMultiProviderLLM:
def __init__(self, providers: Dict[str, Callable]):
self.providers = providers
self.health_monitor = ProviderHealthMonitor(failure_threshold=3, recovery_time=300)
def extract_data(self, html_content: str, prompt: str) -> Optional[str]:
"""Try providers in order, skipping unhealthy ones"""
healthy_providers = [
(name, func) for name, func in self.providers.items()
if self.health_monitor.is_healthy(name)
]
if not healthy_providers:
print("No healthy providers available!")
return None
for provider_name, provider_func in healthy_providers:
try:
print(f"Trying {provider_name}...")
result = provider_func(html_content, prompt)
# Record success
self.health_monitor.record_success(provider_name)
return result
except Exception as e:
print(f"{provider_name} failed: {e}")
self.health_monitor.record_failure(provider_name)
continue
return None
def get_health_status(self) -> Dict:
"""Get health status of all providers"""
return self.health_monitor.get_status()
# Usage
providers = {
'openai': call_openai_provider,
'anthropic': call_anthropic_provider,
'gemini': call_gemini_provider
}
scraper = ResilientMultiProviderLLM(providers)
# Scrape multiple pages
for i in range(100):
html = fetch_page(f"https://example.com/page{i}")
result = scraper.extract_data(html, "Extract product data")
if result:
save_result(result)
# Check health status periodically
if i % 10 == 0:
print("Health status:", scraper.get_health_status())
Best Practices for Multi-Provider Scraping
1. Standardize Output Format
Ensure all providers return data in the same format:
def normalize_llm_response(response: str, provider: str) -> Dict:
"""Normalize responses from different providers"""
try:
# Try to parse as JSON
return json.loads(response)
except json.JSONDecodeError:
# Extract JSON from markdown code blocks
import re
json_match = re.search(r'```language-json\n(.*?)\n```', response, re.DOTALL)
if json_match:
return json.loads(json_match.group(1))
# Provider-specific normalization
if provider == 'claude':
# Claude might wrap in XML tags
return parse_claude_response(response)
# Fallback: return as plain text
return {'text': response}
2. Implement Caching
Cache responses to avoid redundant API calls across providers:
import hashlib
import json
class CachedMultiProviderLLM:
def __init__(self, providers: Dict, cache_dir: str = './cache'):
self.providers = providers
self.cache_dir = cache_dir
os.makedirs(cache_dir, exist_ok=True)
def extract_data(self, html_content: str, prompt: str) -> str:
# Generate cache key
cache_key = hashlib.md5(f"{html_content}{prompt}".encode()).hexdigest()
cache_file = f"{self.cache_dir}/{cache_key}.json"
# Check cache
if os.path.exists(cache_file):
with open(cache_file, 'r') as f:
cached = json.load(f)
print(f"Cache hit (from {cached['provider']})")
return cached['result']
# Try providers
for provider_name, provider_func in self.providers.items():
try:
result = provider_func(html_content, prompt)
# Cache the result
with open(cache_file, 'w') as f:
json.dump({
'provider': provider_name,
'result': result,
'timestamp': time.time()
}, f)
return result
except Exception as e:
continue
return None
3. Monitor Costs Across Providers
Track spending to optimize your multi-provider strategy:
class CostTrackingLLM:
def __init__(self, providers: Dict, pricing: Dict):
"""
pricing: Dict of provider -> {'input': cost_per_1k, 'output': cost_per_1k}
"""
self.providers = providers
self.pricing = pricing
self.usage_stats = {name: {'requests': 0, 'tokens': 0, 'cost': 0}
for name in providers}
def extract_data(self, html_content: str, prompt: str) -> str:
for provider_name, provider_func in self.providers.items():
try:
result = provider_func(html_content, prompt)
# Estimate tokens (rough approximation)
input_tokens = len(html_content + prompt) / 4
output_tokens = len(result) / 4
# Calculate cost
cost = (
(input_tokens / 1000) * self.pricing[provider_name]['input'] +
(output_tokens / 1000) * self.pricing[provider_name]['output']
)
# Update stats
self.usage_stats[provider_name]['requests'] += 1
self.usage_stats[provider_name]['tokens'] += input_tokens + output_tokens
self.usage_stats[provider_name]['cost'] += cost
return result
except Exception:
continue
return None
def get_cost_report(self) -> Dict:
"""Generate cost report"""
return self.usage_stats
4. Handle Rate Limiting Across Providers
Implement proper rate limiting strategies for each provider:
import asyncio
from asyncio import Semaphore
class RateLimitedMultiProvider:
def __init__(self, providers: Dict, rate_limits: Dict):
"""
rate_limits: Dict of provider -> requests_per_minute
"""
self.providers = providers
self.limiters = {
name: Semaphore(rate_limits[name])
for name in providers
}
async def extract_data_async(self, html_content: str, prompt: str) -> str:
"""Try providers with rate limiting"""
for provider_name, provider_func in self.providers.items():
async with self.limiters[provider_name]:
try:
result = await provider_func(html_content, prompt)
return result
except Exception as e:
print(f"{provider_name} failed: {e}")
continue
return None
Integration with Web Scraping Workflows
When combining multiple LLM providers with browser automation, implement proper error handling and timeout management:
from playwright.sync_api import sync_playwright
def scrape_with_multi_llm(url: str, scraper: MultiProviderLLM) -> Dict:
"""Scrape page and extract data using multi-provider LLM"""
try:
with sync_playwright() as p:
browser = p.chromium.launch()
page = browser.new_page()
# Navigate with timeout
page.goto(url, timeout=30000)
# Get content
html = page.content()
browser.close()
# Extract data with LLM fallback
result = scraper.extract_data(
html,
"Extract product name, price, and availability as JSON"
)
return json.loads(result) if result else None
except Exception as e:
print(f"Scraping failed for {url}: {e}")
return None
# Usage
scraper = MultiProviderLLM(
openai_key="sk-...",
anthropic_key="sk-ant-..."
)
products = []
for url in product_urls:
data = scrape_with_multi_llm(url, scraper)
if data:
products.append(data)
Conclusion
Using multiple LLM providers for web scraping significantly improves reliability, reduces costs, and optimizes performance. By implementing fallback strategies, smart routing, health monitoring, and proper rate limiting, you can build robust scraping systems that handle failures gracefully and maintain consistent operation.
Start with a simple fallback pattern and gradually add sophistication as your needs grow. Monitor performance and costs across providers to continuously optimize your multi-provider strategy. The key is balancing reliability with complexity—use as many providers as needed to meet your uptime requirements, but avoid over-engineering for simple use cases.
With the right multi-provider architecture, you can scrape at scale with confidence, knowing that no single provider failure will bring your entire operation to a halt.