What is the LLM Context Window and How Does It Affect Web Scraping?
The LLM context window is one of the most critical constraints when using Large Language Models for web scraping and data extraction. Understanding how it works and how to work around its limitations is essential for building effective AI-powered scraping solutions.
Understanding the LLM Context Window
The context window refers to the maximum amount of text (measured in tokens) that an LLM can process in a single request. This includes both your input prompt and the model's response combined. Different models have different context window sizes:
- GPT-3.5-turbo: 16,385 tokens (~12,000 words)
- GPT-4: 8,192 tokens (~6,000 words)
- GPT-4-turbo: 128,000 tokens (~96,000 words)
- Claude 3 Haiku: 200,000 tokens (~150,000 words)
- Claude 3 Sonnet: 200,000 tokens (~150,000 words)
- Claude 3 Opus: 200,000 tokens (~150,000 words)
- Gemini 1.5 Pro: 1,000,000 tokens (~750,000 words)
What Counts as a Token?
Tokens are chunks of text that the model processes. As a rough estimate: - 1 token ≈ 4 characters in English - 1 token ≈ ¾ of a word on average - 100 tokens ≈ 75 words
# Example: Counting tokens using tiktoken (OpenAI's tokenizer)
import tiktoken
def count_tokens(text, model="gpt-4"):
encoding = tiktoken.encoding_for_model(model)
tokens = encoding.encode(text)
return len(tokens)
html_content = "<html><body>Your scraped content here...</body></html>"
token_count = count_tokens(html_content)
print(f"Token count: {token_count}")
How Context Windows Affect Web Scraping
1. Large Page Content Limitations
Modern web pages often contain massive amounts of HTML, CSS, and JavaScript. A typical e-commerce product page might contain 50,000-200,000 characters of HTML, which can easily exceed smaller context windows.
import requests
from bs4 import BeautifulSoup
# Scrape a webpage
response = requests.get("https://example.com/product")
html = response.text
# Check size
print(f"HTML length: {len(html)} characters")
print(f"Estimated tokens: {len(html) // 4}")
# This might exceed your LLM's context window!
2. Prompt Overhead
Your extraction prompt also consumes tokens from the context window. A detailed prompt with examples and instructions might use 500-2000 tokens, leaving less room for the actual content.
prompt = """
Extract the following information from this product page:
- Product name
- Price
- Description
- Availability
- Customer ratings
Return the data in JSON format like this:
{
"name": "...",
"price": "...",
"description": "...",
"availability": "...",
"rating": "..."
}
Here is the HTML content:
{html_content}
"""
# This prompt uses tokens before you even add the HTML!
3. Output Space Requirements
The model's response also counts toward the context window. If you're extracting large amounts of data, you need to reserve sufficient tokens for the output.
Strategies for Working Within Context Window Limits
Strategy 1: Pre-process and Clean HTML
Remove unnecessary content before sending it to the LLM. Strip out scripts, styles, and irrelevant elements.
from bs4 import BeautifulSoup
def clean_html_for_llm(html):
soup = BeautifulSoup(html, 'html.parser')
# Remove script and style elements
for element in soup(['script', 'style', 'noscript', 'svg']):
element.decompose()
# Remove comments
for comment in soup.find_all(string=lambda text: isinstance(text, Comment)):
comment.extract()
# Get clean text or simplified HTML
return soup.get_text(separator=' ', strip=True)
cleaned_content = clean_html_for_llm(html)
print(f"Reduced from {len(html)} to {len(cleaned_content)} characters")
Strategy 2: Extract Relevant Sections First
Use traditional parsing methods to identify and extract only the relevant sections before sending them to the LLM.
from bs4 import BeautifulSoup
def extract_product_section(html):
soup = BeautifulSoup(html, 'html.parser')
# Find the specific section containing product info
product_div = soup.find('div', class_='product-details')
if product_div:
return str(product_div)
return html # Fallback to full HTML
# Only send the relevant section to the LLM
relevant_html = extract_product_section(html)
Strategy 3: Chunking for Large Documents
Split large content into smaller chunks and process them separately, then combine the results.
def chunk_text(text, max_tokens=3000):
"""Split text into chunks that fit within token limits"""
words = text.split()
chunks = []
current_chunk = []
current_length = 0
for word in words:
word_length = len(word) // 4 # Rough token estimate
if current_length + word_length > max_tokens:
chunks.append(' '.join(current_chunk))
current_chunk = [word]
current_length = word_length
else:
current_chunk.append(word)
current_length += word_length
if current_chunk:
chunks.append(' '.join(current_chunk))
return chunks
# Process each chunk separately
chunks = chunk_text(large_content, max_tokens=3000)
results = []
for chunk in chunks:
result = process_with_llm(chunk)
results.append(result)
# Combine results
final_result = combine_results(results)
Strategy 4: Use Markdown Conversion
Convert HTML to Markdown to significantly reduce token count while preserving structure and content.
from markdownify import markdownify as md
def html_to_markdown(html):
# Convert HTML to Markdown
markdown = md(html, heading_style="ATX")
return markdown
markdown_content = html_to_markdown(html)
print(f"Reduced from {len(html)} to {len(markdown_content)} characters")
# Markdown typically uses 40-60% fewer tokens than HTML
Strategy 5: Two-Stage Processing
Use a traditional scraper to extract raw data, then use the LLM only for parsing and structuring the extracted content.
# Stage 1: Traditional scraping
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')
raw_data = {
'title': soup.find('h1', class_='product-title').text,
'price_text': soup.find('span', class_='price').text,
'description': soup.find('div', class_='description').text,
}
# Stage 2: Use LLM only for cleaning and structuring
prompt = f"""
Parse this raw product data and return clean, structured JSON:
{raw_data}
Clean the price to a number, summarize the description to 100 words, etc.
"""
JavaScript Example: Managing Context Windows
const axios = require('axios');
const cheerio = require('cheerio');
const { encode } = require('gpt-3-encoder');
async function scrapeWithContextLimit(url, maxTokens = 4000) {
// Fetch the page
const response = await axios.get(url);
const $ = cheerio.load(response.data);
// Remove unnecessary elements
$('script, style, noscript, svg').remove();
// Extract main content
const mainContent = $('main, article, .content').text();
// Count tokens
const tokens = encode(mainContent);
console.log(`Content tokens: ${tokens.length}`);
// Truncate if necessary
let finalContent = mainContent;
if (tokens.length > maxTokens) {
// Decode back to text with token limit
const truncatedTokens = tokens.slice(0, maxTokens);
finalContent = truncatedTokens.map(t =>
String.fromCharCode(t)
).join('');
console.log(`Truncated to ${maxTokens} tokens`);
}
return finalContent;
}
// Usage
scrapeWithContextLimit('https://example.com/article')
.then(content => {
// Send to LLM for processing
return processWithLLM(content);
});
Choosing the Right Model for Web Scraping
When selecting an LLM for web scraping, consider the context window size:
Small Context Windows (8K-16K tokens)
Best for: - Single product pages - Short articles - Structured data extraction from small pages - Pre-processed, cleaned content
Medium Context Windows (32K-128K tokens)
Best for: - Full article extraction - Multi-section pages - E-commerce category pages - Documentation scraping
Large Context Windows (200K-1M+ tokens)
Best for: - Entire website analysis - Long-form content with multiple pages - Bulk data processing - Complex, nested HTML structures
Using WebScraping.AI with LLM Integration
When combining traditional web scraping APIs with LLM-powered data extraction, you can optimize for context windows:
import requests
# Use WebScraping.AI to get clean HTML
api_url = "https://api.webscraping.ai/html"
params = {
'url': 'https://example.com/product',
'api_key': 'YOUR_API_KEY'
}
response = requests.get(api_url, params=params)
clean_html = response.text
# Now send to LLM with AI-powered question answering
api_url = "https://api.webscraping.ai/ai/question"
params = {
'url': 'https://example.com/product',
'question': 'What is the product name, price, and availability?',
'api_key': 'YOUR_API_KEY'
}
# This handles context window management automatically
ai_response = requests.get(api_url, params=params)
structured_data = ai_response.json()
Best Practices for Context Window Management
- Measure First: Always count tokens before sending to the LLM
- Clean Aggressively: Remove all unnecessary HTML elements
- Extract Smart: Use CSS selectors to get only relevant sections
- Convert Format: Use Markdown or plain text instead of raw HTML
- Chunk Large Content: Split documents that exceed limits
- Choose the Right Model: Match the model's context window to your needs
- Monitor Costs: Larger context windows often cost more per token
- Cache Results: Don't re-process the same content multiple times
Conclusion
The LLM context window is a fundamental constraint in AI-powered web scraping, but with proper planning and optimization strategies, you can work effectively within these limits. By pre-processing content, using appropriate models, and implementing smart chunking strategies, you can extract structured data from even the largest web pages.
Remember that the context window includes your prompt, the input data, and the model's response—so always leave sufficient room for the output you need. When in doubt, clean more, extract less, and use models with larger context windows for complex scraping tasks.