Table of contents

What is the Token Limit for LLMs and How Does It Impact Scraping Large Pages?

When using Large Language Models (LLMs) for web scraping and data extraction, understanding token limits is crucial for building reliable and efficient scraping solutions. Token limits directly impact how much content you can process in a single request and determine the strategies you need to employ when scraping large web pages.

Understanding LLM Token Limits

A token is the basic unit of text that LLMs process. Tokens can be words, parts of words, or individual characters, depending on the model's tokenization algorithm. Most modern LLMs use subword tokenization, where common words are single tokens, while less common words may be split into multiple tokens.

Token Limits by Popular LLM Providers

Different LLM providers offer varying context window sizes (the maximum number of tokens that can be processed in a single request, including both input and output):

OpenAI Models: - GPT-4 Turbo: 128,000 tokens (~300 pages of text) - GPT-4: 8,192 tokens (standard) or 32,768 tokens (extended) - GPT-3.5 Turbo: 16,385 tokens

Anthropic Claude: - Claude 3 Opus/Sonnet: 200,000 tokens (~500 pages) - Claude 3 Haiku: 200,000 tokens

Google Gemini: - Gemini 1.5 Pro: 1,000,000 tokens (up to 2 million tokens) - Gemini 1.0 Pro: 32,768 tokens

Other Providers: - Llama 3: 8,192 tokens - Mistral Large: 32,000 tokens

How Token Limits Impact Web Scraping

When scraping web pages with LLMs, token limits create several challenges:

1. Large HTML Documents

Modern web pages often contain massive HTML documents with extensive JavaScript, CSS, and embedded content. A typical e-commerce product page might consume 10,000-50,000 tokens when including all HTML markup, making it impossible to process with smaller context window models.

2. Multiple Page Scraping

When scraping multiple pages in a single session or conversation, the token count accumulates across all requests and responses, quickly exhausting the available context window.

3. Output Token Requirements

Remember that the token limit includes both input (the HTML you send) and output (the extracted data). You must reserve sufficient tokens for the model's response, typically 1,000-4,000 tokens depending on the complexity of the extraction task.

Strategies for Handling Large Pages

Strategy 1: HTML Preprocessing and Cleanup

Before sending HTML to an LLM, strip unnecessary content to reduce token count:

from bs4 import BeautifulSoup
import requests

def clean_html_for_llm(url, target_selectors=None):
    """Clean HTML and extract only relevant content for LLM processing"""
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')

    # Remove script and style elements
    for script in soup(['script', 'style', 'noscript']):
        script.decompose()

    # Remove comments
    for comment in soup.find_all(string=lambda text: isinstance(text, Comment)):
        comment.extract()

    # If specific selectors provided, extract only those elements
    if target_selectors:
        relevant_content = []
        for selector in target_selectors:
            elements = soup.select(selector)
            relevant_content.extend(elements)

        # Create new soup with only relevant content
        new_soup = BeautifulSoup('<html><body></body></html>', 'html.parser')
        for element in relevant_content:
            new_soup.body.append(element)
        soup = new_soup

    # Remove excessive whitespace
    html_text = str(soup)
    html_text = ' '.join(html_text.split())

    return html_text

# Example usage
cleaned_html = clean_html_for_llm(
    'https://example.com/product',
    target_selectors=['.product-info', '.reviews', '.specifications']
)

Strategy 2: Convert HTML to Markdown

Converting HTML to markdown significantly reduces token count while preserving structure:

from markdownify import markdownify as md
import requests

def html_to_markdown(url):
    """Convert HTML to markdown for more efficient LLM processing"""
    response = requests.get(url)

    # Convert to markdown
    markdown_content = md(
        response.text,
        heading_style="ATX",
        bullets="-",
        strip=['script', 'style']
    )

    return markdown_content

# Example usage with OpenAI
import openai

markdown = html_to_markdown('https://example.com/article')

response = openai.chat.completions.create(
    model="gpt-4-turbo",
    messages=[
        {"role": "system", "content": "Extract the main article content and metadata."},
        {"role": "user", "content": markdown}
    ]
)

extracted_data = response.choices[0].message.content

Strategy 3: Chunking Large Pages

For pages that exceed token limits even after cleaning, split the content into chunks:

import tiktoken

def chunk_content_by_tokens(content, model="gpt-4", max_tokens=6000):
    """Split content into chunks based on token count"""
    encoding = tiktoken.encoding_for_model(model)
    tokens = encoding.encode(content)

    chunks = []
    for i in range(0, len(tokens), max_tokens):
        chunk_tokens = tokens[i:i + max_tokens]
        chunk_text = encoding.decode(chunk_tokens)
        chunks.append(chunk_text)

    return chunks

def scrape_large_page_with_chunking(html_content, extraction_prompt):
    """Process large pages by chunking and combining results"""
    chunks = chunk_content_by_tokens(html_content, max_tokens=6000)
    all_results = []

    for i, chunk in enumerate(chunks):
        response = openai.chat.completions.create(
            model="gpt-4",
            messages=[
                {"role": "system", "content": extraction_prompt},
                {"role": "user", "content": f"Process chunk {i+1}/{len(chunks)}:\n\n{chunk}"}
            ]
        )
        all_results.append(response.choices[0].message.content)

    return all_results

# Example usage
large_html = requests.get('https://example.com/large-page').text
cleaned = clean_html_for_llm(large_html)
results = scrape_large_page_with_chunking(
    cleaned,
    "Extract all product names and prices from this HTML."
)

Strategy 4: Two-Stage Extraction

Use a fast, lightweight tool for initial extraction, then use LLMs for refinement:

// JavaScript example using Cheerio for initial extraction
const cheerio = require('cheerio');
const axios = require('axios');
const { OpenAI } = require('openai');

async function twoStageExtraction(url) {
    // Stage 1: Extract raw data with Cheerio
    const response = await axios.get(url);
    const $ = cheerio.load(response.data);

    const rawProducts = [];
    $('.product-card').each((i, elem) => {
        rawProducts.push({
            title: $(elem).find('.title').text().trim(),
            price: $(elem).find('.price').text().trim(),
            description: $(elem).find('.description').text().trim()
        });
    });

    // Stage 2: Use LLM to clean and structure data
    const openai = new OpenAI();
    const completion = await openai.chat.completions.create({
        model: "gpt-3.5-turbo",
        messages: [
            {
                role: "system",
                content: "Clean and normalize product data. Extract numeric prices, standardize titles."
            },
            {
                role: "user",
                content: JSON.stringify(rawProducts)
            }
        ]
    });

    return JSON.parse(completion.choices[0].message.content);
}

// Usage
twoStageExtraction('https://example.com/products')
    .then(cleanedProducts => console.log(cleanedProducts));

Strategy 5: Selective Element Extraction

Instead of processing entire pages, target specific elements for LLM processing. This approach is particularly effective when you know which sections contain the data you need:

from bs4 import BeautifulSoup
import requests
import json

def selective_llm_extraction(url, target_selector, extraction_schema):
    """Extract only specific elements for LLM processing"""
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')

    # Extract only the target elements
    target_elements = soup.select(target_selector)

    # Convert to simplified HTML or text
    selected_content = []
    for elem in target_elements:
        selected_content.append(str(elem))

    combined_html = '\n'.join(selected_content)

    # Send to LLM
    import openai
    response = openai.chat.completions.create(
        model="gpt-4",
        messages=[
            {"role": "system", "content": f"Extract data according to this schema: {json.dumps(extraction_schema)}"},
            {"role": "user", "content": combined_html}
        ]
    )

    return response.choices[0].message.content

# Example usage
schema = {
    "products": [
        {
            "name": "string",
            "price": "number",
            "rating": "number",
            "availability": "string"
        }
    ]
}

extracted = selective_llm_extraction(
    'https://example.com/store',
    '.product-grid .product-item',
    schema
)

Monitoring Token Usage

Always monitor your token consumption to avoid exceeding limits and optimize costs:

import tiktoken

def estimate_tokens(text, model="gpt-4"):
    """Estimate token count for a given text"""
    encoding = tiktoken.encoding_for_model(model)
    return len(encoding.encode(text))

def scrape_with_token_monitoring(html_content, max_input_tokens=100000):
    """Scrape with token limit checking"""
    token_count = estimate_tokens(html_content)

    print(f"Input token count: {token_count}")

    if token_count > max_input_tokens:
        print(f"Warning: Content exceeds {max_input_tokens} tokens")
        # Apply chunking or cleaning strategy
        html_content = clean_html_for_llm(html_content)
        token_count = estimate_tokens(html_content)
        print(f"Cleaned token count: {token_count}")

    if token_count > max_input_tokens:
        # Still too large, use chunking
        chunks = chunk_content_by_tokens(html_content, max_tokens=max_input_tokens)
        return process_chunks(chunks)

    return process_single_request(html_content)

Best Practices for Token-Efficient Web Scraping

  1. Pre-process HTML: Always clean HTML before sending to LLMs by removing scripts, styles, and irrelevant content
  2. Use appropriate models: Choose models with sufficient context windows for your use case
  3. Implement fallback strategies: Have chunking or cleanup strategies ready for unexpectedly large pages
  4. Cache cleaned content: Store preprocessed versions of frequently accessed pages
  5. Combine approaches: Use traditional parsing for structure extraction and LLMs for complex interpretation tasks
  6. Monitor costs: Token usage directly impacts API costs, so optimization saves money
  7. Test token counts: Always estimate tokens before sending to production

When to Use Specialized Web Scraping APIs

For large-scale web scraping projects where token limits become a bottleneck, consider using specialized web scraping APIs that handle JavaScript rendering and HTML extraction efficiently. These APIs can process large pages and return only the data you need, which you can then optionally enhance with LLM-based data extraction for complex fields.

By understanding and working within LLM token limits, you can build robust web scraping solutions that leverage AI capabilities while maintaining efficiency and cost-effectiveness. The key is choosing the right combination of traditional parsing techniques and LLM-powered extraction based on your specific requirements.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon