Table of contents

What is the LLM Context Window and How Does It Affect Web Scraping?

The LLM context window is one of the most critical constraints when using Large Language Models for web scraping and data extraction. Understanding how it works and how to work around its limitations is essential for building effective AI-powered scraping solutions.

Understanding the LLM Context Window

The context window refers to the maximum amount of text (measured in tokens) that an LLM can process in a single request. This includes both your input prompt and the model's response combined. Different models have different context window sizes:

  • GPT-3.5-turbo: 16,385 tokens (~12,000 words)
  • GPT-4: 8,192 tokens (~6,000 words)
  • GPT-4-turbo: 128,000 tokens (~96,000 words)
  • Claude 3 Haiku: 200,000 tokens (~150,000 words)
  • Claude 3 Sonnet: 200,000 tokens (~150,000 words)
  • Claude 3 Opus: 200,000 tokens (~150,000 words)
  • Gemini 1.5 Pro: 1,000,000 tokens (~750,000 words)

What Counts as a Token?

Tokens are chunks of text that the model processes. As a rough estimate: - 1 token ≈ 4 characters in English - 1 token ≈ ¾ of a word on average - 100 tokens ≈ 75 words

# Example: Counting tokens using tiktoken (OpenAI's tokenizer)
import tiktoken

def count_tokens(text, model="gpt-4"):
    encoding = tiktoken.encoding_for_model(model)
    tokens = encoding.encode(text)
    return len(tokens)

html_content = "<html><body>Your scraped content here...</body></html>"
token_count = count_tokens(html_content)
print(f"Token count: {token_count}")

How Context Windows Affect Web Scraping

1. Large Page Content Limitations

Modern web pages often contain massive amounts of HTML, CSS, and JavaScript. A typical e-commerce product page might contain 50,000-200,000 characters of HTML, which can easily exceed smaller context windows.

import requests
from bs4 import BeautifulSoup

# Scrape a webpage
response = requests.get("https://example.com/product")
html = response.text

# Check size
print(f"HTML length: {len(html)} characters")
print(f"Estimated tokens: {len(html) // 4}")

# This might exceed your LLM's context window!

2. Prompt Overhead

Your extraction prompt also consumes tokens from the context window. A detailed prompt with examples and instructions might use 500-2000 tokens, leaving less room for the actual content.

prompt = """
Extract the following information from this product page:
- Product name
- Price
- Description
- Availability
- Customer ratings

Return the data in JSON format like this:
{
  "name": "...",
  "price": "...",
  "description": "...",
  "availability": "...",
  "rating": "..."
}

Here is the HTML content:
{html_content}
"""

# This prompt uses tokens before you even add the HTML!

3. Output Space Requirements

The model's response also counts toward the context window. If you're extracting large amounts of data, you need to reserve sufficient tokens for the output.

Strategies for Working Within Context Window Limits

Strategy 1: Pre-process and Clean HTML

Remove unnecessary content before sending it to the LLM. Strip out scripts, styles, and irrelevant elements.

from bs4 import BeautifulSoup

def clean_html_for_llm(html):
    soup = BeautifulSoup(html, 'html.parser')

    # Remove script and style elements
    for element in soup(['script', 'style', 'noscript', 'svg']):
        element.decompose()

    # Remove comments
    for comment in soup.find_all(string=lambda text: isinstance(text, Comment)):
        comment.extract()

    # Get clean text or simplified HTML
    return soup.get_text(separator=' ', strip=True)

cleaned_content = clean_html_for_llm(html)
print(f"Reduced from {len(html)} to {len(cleaned_content)} characters")

Strategy 2: Extract Relevant Sections First

Use traditional parsing methods to identify and extract only the relevant sections before sending them to the LLM.

from bs4 import BeautifulSoup

def extract_product_section(html):
    soup = BeautifulSoup(html, 'html.parser')

    # Find the specific section containing product info
    product_div = soup.find('div', class_='product-details')

    if product_div:
        return str(product_div)

    return html  # Fallback to full HTML

# Only send the relevant section to the LLM
relevant_html = extract_product_section(html)

Strategy 3: Chunking for Large Documents

Split large content into smaller chunks and process them separately, then combine the results.

def chunk_text(text, max_tokens=3000):
    """Split text into chunks that fit within token limits"""
    words = text.split()
    chunks = []
    current_chunk = []
    current_length = 0

    for word in words:
        word_length = len(word) // 4  # Rough token estimate
        if current_length + word_length > max_tokens:
            chunks.append(' '.join(current_chunk))
            current_chunk = [word]
            current_length = word_length
        else:
            current_chunk.append(word)
            current_length += word_length

    if current_chunk:
        chunks.append(' '.join(current_chunk))

    return chunks

# Process each chunk separately
chunks = chunk_text(large_content, max_tokens=3000)
results = []

for chunk in chunks:
    result = process_with_llm(chunk)
    results.append(result)

# Combine results
final_result = combine_results(results)

Strategy 4: Use Markdown Conversion

Convert HTML to Markdown to significantly reduce token count while preserving structure and content.

from markdownify import markdownify as md

def html_to_markdown(html):
    # Convert HTML to Markdown
    markdown = md(html, heading_style="ATX")
    return markdown

markdown_content = html_to_markdown(html)
print(f"Reduced from {len(html)} to {len(markdown_content)} characters")

# Markdown typically uses 40-60% fewer tokens than HTML

Strategy 5: Two-Stage Processing

Use a traditional scraper to extract raw data, then use the LLM only for parsing and structuring the extracted content.

# Stage 1: Traditional scraping
from bs4 import BeautifulSoup

soup = BeautifulSoup(html, 'html.parser')
raw_data = {
    'title': soup.find('h1', class_='product-title').text,
    'price_text': soup.find('span', class_='price').text,
    'description': soup.find('div', class_='description').text,
}

# Stage 2: Use LLM only for cleaning and structuring
prompt = f"""
Parse this raw product data and return clean, structured JSON:
{raw_data}

Clean the price to a number, summarize the description to 100 words, etc.
"""

JavaScript Example: Managing Context Windows

const axios = require('axios');
const cheerio = require('cheerio');
const { encode } = require('gpt-3-encoder');

async function scrapeWithContextLimit(url, maxTokens = 4000) {
    // Fetch the page
    const response = await axios.get(url);
    const $ = cheerio.load(response.data);

    // Remove unnecessary elements
    $('script, style, noscript, svg').remove();

    // Extract main content
    const mainContent = $('main, article, .content').text();

    // Count tokens
    const tokens = encode(mainContent);
    console.log(`Content tokens: ${tokens.length}`);

    // Truncate if necessary
    let finalContent = mainContent;
    if (tokens.length > maxTokens) {
        // Decode back to text with token limit
        const truncatedTokens = tokens.slice(0, maxTokens);
        finalContent = truncatedTokens.map(t =>
            String.fromCharCode(t)
        ).join('');
        console.log(`Truncated to ${maxTokens} tokens`);
    }

    return finalContent;
}

// Usage
scrapeWithContextLimit('https://example.com/article')
    .then(content => {
        // Send to LLM for processing
        return processWithLLM(content);
    });

Choosing the Right Model for Web Scraping

When selecting an LLM for web scraping, consider the context window size:

Small Context Windows (8K-16K tokens)

Best for: - Single product pages - Short articles - Structured data extraction from small pages - Pre-processed, cleaned content

Medium Context Windows (32K-128K tokens)

Best for: - Full article extraction - Multi-section pages - E-commerce category pages - Documentation scraping

Large Context Windows (200K-1M+ tokens)

Best for: - Entire website analysis - Long-form content with multiple pages - Bulk data processing - Complex, nested HTML structures

Using WebScraping.AI with LLM Integration

When combining traditional web scraping APIs with LLM-powered data extraction, you can optimize for context windows:

import requests

# Use WebScraping.AI to get clean HTML
api_url = "https://api.webscraping.ai/html"
params = {
    'url': 'https://example.com/product',
    'api_key': 'YOUR_API_KEY'
}

response = requests.get(api_url, params=params)
clean_html = response.text

# Now send to LLM with AI-powered question answering
api_url = "https://api.webscraping.ai/ai/question"
params = {
    'url': 'https://example.com/product',
    'question': 'What is the product name, price, and availability?',
    'api_key': 'YOUR_API_KEY'
}

# This handles context window management automatically
ai_response = requests.get(api_url, params=params)
structured_data = ai_response.json()

Best Practices for Context Window Management

  1. Measure First: Always count tokens before sending to the LLM
  2. Clean Aggressively: Remove all unnecessary HTML elements
  3. Extract Smart: Use CSS selectors to get only relevant sections
  4. Convert Format: Use Markdown or plain text instead of raw HTML
  5. Chunk Large Content: Split documents that exceed limits
  6. Choose the Right Model: Match the model's context window to your needs
  7. Monitor Costs: Larger context windows often cost more per token
  8. Cache Results: Don't re-process the same content multiple times

Conclusion

The LLM context window is a fundamental constraint in AI-powered web scraping, but with proper planning and optimization strategies, you can work effectively within these limits. By pre-processing content, using appropriate models, and implementing smart chunking strategies, you can extract structured data from even the largest web pages.

Remember that the context window includes your prompt, the input data, and the model's response—so always leave sufficient room for the output you need. When in doubt, clean more, extract less, and use models with larger context windows for complex scraping tasks.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon