What is the Token Limit for LLMs and How Does It Impact Scraping Large Pages?

When using Large Language Models (LLMs) for web scraping and data extraction, understanding token limits is crucial for building reliable and efficient scraping solutions. Token limits directly impact how much content you can process in a single request and determine the strategies you need to employ when scraping large web pages.

Understanding LLM Token Limits

A token is the basic unit of text that LLMs process. Tokens can be words, parts of words, or individual characters, depending on the model's tokenization algorithm. Most modern LLMs use subword tokenization, where common words are single tokens, while less common words may be split into multiple tokens.

Token Limits by Popular LLM Providers

Different LLM providers offer varying context window sizes (the maximum number of tokens that can be processed in a single request, including both input and output):

OpenAI Models: - GPT-4 Turbo: 128,000 tokens (~300 pages of text) - GPT-4: 8,192 tokens (standard) or 32,768 tokens (extended) - GPT-3.5 Turbo: 16,385 tokens

Anthropic Claude: - Claude 3 Opus/Sonnet: 200,000 tokens (~500 pages) - Claude 3 Haiku: 200,000 tokens

Google Gemini: - Gemini 1.5 Pro: 1,000,000 tokens (up to 2 million tokens) - Gemini 1.0 Pro: 32,768 tokens

Other Providers: - Llama 3: 8,192 tokens - Mistral Large: 32,000 tokens

How Token Limits Impact Web Scraping

When scraping web pages with LLMs, token limits create several challenges:

1. Large HTML Documents

Modern web pages often contain massive HTML documents with extensive JavaScript, CSS, and embedded content. A typical e-commerce product page might consume 10,000-50,000 tokens when including all HTML markup, making it impossible to process with smaller context window models.

2. Multiple Page Scraping

When scraping multiple pages in a single session or conversation, the token count accumulates across all requests and responses, quickly exhausting the available context window.

3. Output Token Requirements

Remember that the token limit includes both input (the HTML you send) and output (the extracted data). You must reserve sufficient tokens for the model's response, typically 1,000-4,000 tokens depending on the complexity of the extraction task.

Strategies for Handling Large Pages

Strategy 1: HTML Preprocessing and Cleanup

Before sending HTML to an LLM, strip unnecessary content to reduce token count:

from bs4 import BeautifulSoup
import requests

def clean_html_for_llm(url, target_selectors=None):
    """Clean HTML and extract only relevant content for LLM processing"""
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')

    # Remove script and style elements
    for script in soup(['script', 'style', 'noscript']):
        script.decompose()

    # Remove comments
    for comment in soup.find_all(string=lambda text: isinstance(text, Comment)):
        comment.extract()

    # If specific selectors provided, extract only those elements
    if target_selectors:
        relevant_content = []
        for selector in target_selectors:
            elements = soup.select(selector)
            relevant_content.extend(elements)

        # Create new soup with only relevant content
        new_soup = BeautifulSoup('<html><body></body></html>', 'html.parser')
        for element in relevant_content:
            new_soup.body.append(element)
        soup = new_soup

    # Remove excessive whitespace
    html_text = str(soup)
    html_text = ' '.join(html_text.split())

    return html_text

# Example usage
cleaned_html = clean_html_for_llm(
    'https://example.com/product',
    target_selectors=['.product-info', '.reviews', '.specifications']
)

Strategy 2: Convert HTML to Markdown

Converting HTML to markdown significantly reduces token count while preserving structure:

from markdownify import markdownify as md
import requests

def html_to_markdown(url):
    """Convert HTML to markdown for more efficient LLM processing"""
    response = requests.get(url)

    # Convert to markdown
    markdown_content = md(
        response.text,
        heading_style="ATX",
        bullets="-",
        strip=['script', 'style']
    )

    return markdown_content

# Example usage with OpenAI
import openai

markdown = html_to_markdown('https://example.com/article')

response = openai.chat.completions.create(
    model="gpt-4-turbo",
    messages=[
        {"role": "system", "content": "Extract the main article content and metadata."},
        {"role": "user", "content": markdown}
    ]
)

extracted_data = response.choices[0].message.content

Strategy 3: Chunking Large Pages

For pages that exceed token limits even after cleaning, split the content into chunks:

import tiktoken

def chunk_content_by_tokens(content, model="gpt-4", max_tokens=6000):
    """Split content into chunks based on token count"""
    encoding = tiktoken.encoding_for_model(model)
    tokens = encoding.encode(content)

    chunks = []
    for i in range(0, len(tokens), max_tokens):
        chunk_tokens = tokens[i:i + max_tokens]
        chunk_text = encoding.decode(chunk_tokens)
        chunks.append(chunk_text)

    return chunks

def scrape_large_page_with_chunking(html_content, extraction_prompt):
    """Process large pages by chunking and combining results"""
    chunks = chunk_content_by_tokens(html_content, max_tokens=6000)
    all_results = []

    for i, chunk in enumerate(chunks):
        response = openai.chat.completions.create(
            model="gpt-4",
            messages=[
                {"role": "system", "content": extraction_prompt},
                {"role": "user", "content": f"Process chunk {i+1}/{len(chunks)}:\n\n{chunk}"}
            ]
        )
        all_results.append(response.choices[0].message.content)

    return all_results

# Example usage
large_html = requests.get('https://example.com/large-page').text
cleaned = clean_html_for_llm(large_html)
results = scrape_large_page_with_chunking(
    cleaned,
    "Extract all product names and prices from this HTML."
)

Strategy 4: Two-Stage Extraction

Use a fast, lightweight tool for initial extraction, then use LLMs for refinement:

// JavaScript example using Cheerio for initial extraction
const cheerio = require('cheerio');
const axios = require('axios');
const { OpenAI } = require('openai');

async function twoStageExtraction(url) {
    // Stage 1: Extract raw data with Cheerio
    const response = await axios.get(url);
    const $ = cheerio.load(response.data);

    const rawProducts = [];
    $('.product-card').each((i, elem) => {
        rawProducts.push({
            title: $(elem).find('.title').text().trim(),
            price: $(elem).find('.price').text().trim(),
            description: $(elem).find('.description').text().trim()
        });
    });

    // Stage 2: Use LLM to clean and structure data
    const openai = new OpenAI();
    const completion = await openai.chat.completions.create({
        model: "gpt-3.5-turbo",
        messages: [
            {
                role: "system",
                content: "Clean and normalize product data. Extract numeric prices, standardize titles."
            },
            {
                role: "user",
                content: JSON.stringify(rawProducts)
            }
        ]
    });

    return JSON.parse(completion.choices[0].message.content);
}

// Usage
twoStageExtraction('https://example.com/products')
    .then(cleanedProducts => console.log(cleanedProducts));

Strategy 5: Selective Element Extraction

Instead of processing entire pages, target specific elements for LLM processing. This approach is particularly effective when you know which sections contain the data you need:

from bs4 import BeautifulSoup
import requests
import json

def selective_llm_extraction(url, target_selector, extraction_schema):
    """Extract only specific elements for LLM processing"""
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')

    # Extract only the target elements
    target_elements = soup.select(target_selector)

    # Convert to simplified HTML or text
    selected_content = []
    for elem in target_elements:
        selected_content.append(str(elem))

    combined_html = '\n'.join(selected_content)

    # Send to LLM
    import openai
    response = openai.chat.completions.create(
        model="gpt-4",
        messages=[
            {"role": "system", "content": f"Extract data according to this schema: {json.dumps(extraction_schema)}"},
            {"role": "user", "content": combined_html}
        ]
    )

    return response.choices[0].message.content

# Example usage
schema = {
    "products": [
        {
            "name": "string",
            "price": "number",
            "rating": "number",
            "availability": "string"
        }
    ]
}

extracted = selective_llm_extraction(
    'https://example.com/store',
    '.product-grid .product-item',
    schema
)

Monitoring Token Usage

Always monitor your token consumption to avoid exceeding limits and optimize costs:

import tiktoken

def estimate_tokens(text, model="gpt-4"):
    """Estimate token count for a given text"""
    encoding = tiktoken.encoding_for_model(model)
    return len(encoding.encode(text))

def scrape_with_token_monitoring(html_content, max_input_tokens=100000):
    """Scrape with token limit checking"""
    token_count = estimate_tokens(html_content)

    print(f"Input token count: {token_count}")

    if token_count > max_input_tokens:
        print(f"Warning: Content exceeds {max_input_tokens} tokens")
        # Apply chunking or cleaning strategy
        html_content = clean_html_for_llm(html_content)
        token_count = estimate_tokens(html_content)
        print(f"Cleaned token count: {token_count}")

    if token_count > max_input_tokens:
        # Still too large, use chunking
        chunks = chunk_content_by_tokens(html_content, max_tokens=max_input_tokens)
        return process_chunks(chunks)

    return process_single_request(html_content)

Best Practices for Token-Efficient Web Scraping

Pre-process HTML: Always clean HTML before sending to LLMs by removing scripts, styles, and irrelevant content
Use appropriate models: Choose models with sufficient context windows for your use case
Implement fallback strategies: Have chunking or cleanup strategies ready for unexpectedly large pages
Cache cleaned content: Store preprocessed versions of frequently accessed pages
Combine approaches: Use traditional parsing for structure extraction and LLMs for complex interpretation tasks
Monitor costs: Token usage directly impacts API costs, so optimization saves money
Test token counts: Always estimate tokens before sending to production

When to Use Specialized Web Scraping APIs

For large-scale web scraping projects where token limits become a bottleneck, consider using specialized web scraping APIs that handle JavaScript rendering and HTML extraction efficiently. These APIs can process large pages and return only the data you need, which you can then optionally enhance with LLM-based data extraction for complex fields.

By understanding and working within LLM token limits, you can build robust web scraping solutions that leverage AI capabilities while maintaining efficiency and cost-effectiveness. The key is choosing the right combination of traditional parsing techniques and LLM-powered extraction based on your specific requirements.

Table of contents

What is the Token Limit for LLMs and How Does It Impact Scraping Large Pages?

Understanding LLM Token Limits

Token Limits by Popular LLM Providers

How Token Limits Impact Web Scraping

1. Large HTML Documents

2. Multiple Page Scraping

3. Output Token Requirements

Strategies for Handling Large Pages

Strategy 1: HTML Preprocessing and Cleanup

Strategy 2: Convert HTML to Markdown

Strategy 3: Chunking Large Pages

Strategy 4: Two-Stage Extraction

Strategy 5: Selective Element Extraction

Monitoring Token Usage

Best Practices for Token-Efficient Web Scraping

When to Use Specialized Web Scraping APIs

Try WebScraping.AI for Your Web Scraping Needs

Key Features:

Getting Started:

Related Questions

How do I handle LLM hallucinations when extracting data from web pages?

What is LLM hallucination and how can I prevent it in web scraping?

How accurate are LLMs for web scraping compared to traditional parsers?

Get Started Now

Support