Table of contents

What Are the Token Limits for ChatGPT When Scraping Data?

When using ChatGPT for web scraping, understanding token limits is crucial. Token limits determine how much HTML content you can send to the API in a single request and directly impact your scraping strategy, costs, and performance.

Understanding Tokens and Context Windows

Tokens are the basic units of text that language models process. In English, one token roughly equals: - 4 characters of text - 0.75 words on average - A single word, part of a word, or punctuation mark

For example, the sentence "Web scraping is efficient" contains 5 tokens: ["Web", " scraping", " is", " efficient"].

The context window (or token limit) is the maximum number of tokens a model can process in one request, including both your input (prompt + HTML) and the model's output (extracted data).

Token Limits by ChatGPT Model

Different ChatGPT models have different token limits:

GPT-3.5-Turbo Models

| Model | Total Context Window | Max Input Tokens | Max Output Tokens | Cost per 1K Input Tokens | |-------|---------------------|------------------|-------------------|--------------------------| | gpt-3.5-turbo | 16,385 tokens | ~12,000 tokens | 4,096 tokens | $0.0005 | | gpt-3.5-turbo-16k | 16,385 tokens | ~12,000 tokens | 4,096 tokens | $0.0010 |

GPT-4 Models

| Model | Total Context Window | Max Input Tokens | Max Output Tokens | Cost per 1K Input Tokens | |-------|---------------------|------------------|-------------------|--------------------------| | gpt-4 | 8,192 tokens | ~6,000 tokens | 2,192 tokens | $0.03 | | gpt-4-32k | 32,768 tokens | ~28,000 tokens | 4,768 tokens | $0.06 | | gpt-4-turbo | 128,000 tokens | ~120,000 tokens | 8,000 tokens | $0.01 | | gpt-4o | 128,000 tokens | ~120,000 tokens | 8,000 tokens | $0.005 |

Note: The max input tokens are approximate, as you need to reserve space for the system prompt, user prompt, and output tokens.

How HTML Size Affects Token Count

HTML pages are typically much larger in tokens than plain text due to:

  • Tags and attributes: <div class="product-item"> = multiple tokens
  • Inline styles and scripts: CSS and JavaScript consume many tokens
  • Whitespace and formatting: Indentation and newlines add tokens
  • Repeated boilerplate: Headers, footers, navigation menus

Example token counts for typical web pages:

  • Simple blog post (text-heavy): 2,000-5,000 tokens
  • E-commerce product page: 8,000-15,000 tokens
  • Full homepage with navigation: 20,000-50,000 tokens
  • Complex single-page application: 50,000-100,000+ tokens

Calculating Tokens Before Making API Calls

To avoid errors and optimize costs, calculate token counts before sending requests to ChatGPT:

Python Example

import tiktoken

def count_tokens(text, model="gpt-4"):
    """
    Count the number of tokens in a text string
    """
    encoding = tiktoken.encoding_for_model(model)
    return len(encoding.encode(text))

# Example: Count tokens in HTML content
html_content = """
<div class="product">
    <h1>Premium Coffee Beans</h1>
    <p class="price">$24.99</p>
    <p class="description">Organic, fair-trade coffee...</p>
</div>
"""

token_count = count_tokens(html_content)
print(f"HTML contains {token_count} tokens")

# Calculate if content fits in model's context window
MAX_TOKENS = 8000  # GPT-4 limit
if token_count > MAX_TOKENS:
    print(f"Content exceeds limit by {token_count - MAX_TOKENS} tokens")
else:
    print(f"Content fits with {MAX_TOKENS - token_count} tokens to spare")

JavaScript Example

const { encoding_for_model } = require('tiktoken');

function countTokens(text, model = 'gpt-4') {
  const encoding = encoding_for_model(model);
  const tokens = encoding.encode(text);
  encoding.free();
  return tokens.length;
}

// Example: Estimate cost before scraping
const htmlContent = `
<article>
  <h1>Understanding Web Scraping</h1>
  <p>Web scraping is a technique...</p>
</article>
`;

const tokenCount = countTokens(htmlContent);
console.log(`HTML contains ${tokenCount} tokens`);

// Estimate API cost
const costPerToken = 0.03 / 1000; // GPT-4 pricing
const estimatedCost = tokenCount * costPerToken;
console.log(`Estimated cost: $${estimatedCost.toFixed(4)}`);

Strategies for Handling Large HTML Pages

When your HTML exceeds token limits, use these strategies:

Strategy 1: Extract Relevant Sections Only

Instead of sending the entire page, extract only the sections containing the data you need:

from bs4 import BeautifulSoup
import openai

def scrape_with_section_extraction(url, css_selector, extraction_prompt):
    """
    Extract only relevant HTML sections before sending to ChatGPT
    """
    import requests

    # Fetch full HTML
    response = requests.get(url, headers={
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)'
    })
    html = response.text

    # Extract only the relevant section
    soup = BeautifulSoup(html, 'html.parser')
    relevant_section = soup.select_one(css_selector)

    if not relevant_section:
        raise ValueError(f"No element found matching '{css_selector}'")

    # Convert section to string and check token count
    section_html = str(relevant_section)
    token_count = count_tokens(section_html)
    print(f"Extracted section: {token_count} tokens")

    # Send to ChatGPT
    response = openai.ChatCompletion.create(
        model="gpt-4",
        messages=[
            {"role": "system", "content": "Extract data from HTML as JSON."},
            {"role": "user", "content": f"{extraction_prompt}\n\nHTML:\n{section_html}"}
        ],
        temperature=0
    )

    return response.choices[0].message.content

# Example: Extract only the product grid
result = scrape_with_section_extraction(
    url="https://example.com/products",
    css_selector=".product-grid",
    extraction_prompt="Extract all products with name, price, and rating"
)

Strategy 2: Strip Unnecessary HTML Elements

Remove scripts, styles, comments, and other non-essential elements:

from bs4 import BeautifulSoup

def clean_html_for_llm(html):
    """
    Remove unnecessary elements to reduce token count
    """
    soup = BeautifulSoup(html, 'html.parser')

    # Remove script and style elements
    for element in soup(['script', 'style', 'noscript', 'svg']):
        element.decompose()

    # Remove HTML comments
    for comment in soup.find_all(string=lambda text: isinstance(text, Comment)):
        comment.extract()

    # Remove inline styles and unnecessary attributes
    for tag in soup.find_all():
        # Keep only essential attributes
        attrs_to_keep = ['class', 'id', 'href', 'src', 'alt', 'title']
        tag.attrs = {k: v for k, v in tag.attrs.items() if k in attrs_to_keep}

    # Get cleaned HTML
    cleaned = str(soup)

    # Calculate token savings
    original_tokens = count_tokens(html)
    cleaned_tokens = count_tokens(cleaned)
    savings = original_tokens - cleaned_tokens

    print(f"Reduced from {original_tokens} to {cleaned_tokens} tokens ({savings} saved)")

    return cleaned

# Example usage
html_content = requests.get("https://example.com").text
cleaned_html = clean_html_for_llm(html_content)

Strategy 3: Convert HTML to Text

For content-heavy pages, convert to plain text while preserving structure:

from bs4 import BeautifulSoup

def html_to_structured_text(html):
    """
    Convert HTML to cleaner text format, reducing tokens significantly
    """
    soup = BeautifulSoup(html, 'html.parser')

    # Remove unwanted elements
    for element in soup(['script', 'style', 'nav', 'footer', 'header']):
        element.decompose()

    # Extract text with some structure preservation
    text_parts = []

    for element in soup.find_all(['h1', 'h2', 'h3', 'p', 'li', 'td']):
        text = element.get_text(strip=True)
        if text:
            # Add markers for structure
            if element.name in ['h1', 'h2', 'h3']:
                text_parts.append(f"\n## {text}\n")
            else:
                text_parts.append(text)

    structured_text = '\n'.join(text_parts)

    # Compare token counts
    html_tokens = count_tokens(html)
    text_tokens = count_tokens(structured_text)

    print(f"HTML: {html_tokens} tokens → Text: {text_tokens} tokens")
    print(f"Reduction: {((html_tokens - text_tokens) / html_tokens * 100):.1f}%")

    return structured_text

# Example: Convert product page to text
text_content = html_to_structured_text(html_content)

# Use with ChatGPT
response = openai.ChatCompletion.create(
    model="gpt-4",
    messages=[
        {"role": "system", "content": "Extract product data as JSON."},
        {"role": "user", "content": f"Extract products from:\n{text_content}"}
    ]
)

Strategy 4: Chunking for Very Large Pages

When dealing with pages that exceed even GPT-4 Turbo's 128K limit, split the content into chunks:

def scrape_with_chunking(html_content, chunk_size=10000, extraction_prompt=""):
    """
    Process large HTML in chunks and aggregate results
    """
    import json

    # Calculate optimal chunk size in tokens
    encoding = tiktoken.encoding_for_model("gpt-4")
    tokens = encoding.encode(html_content)

    # Split into chunks
    chunks = []
    for i in range(0, len(tokens), chunk_size):
        chunk_tokens = tokens[i:i + chunk_size]
        chunk_text = encoding.decode(chunk_tokens)
        chunks.append(chunk_text)

    print(f"Split into {len(chunks)} chunks")

    # Process each chunk
    all_results = []
    for i, chunk in enumerate(chunks):
        print(f"Processing chunk {i+1}/{len(chunks)}...")

        response = openai.ChatCompletion.create(
            model="gpt-4",
            messages=[
                {"role": "system", "content": "Extract data as JSON array."},
                {"role": "user", "content": f"{extraction_prompt}\n\n{chunk}"}
            ],
            temperature=0,
            response_format={"type": "json_object"}
        )

        chunk_result = json.loads(response.choices[0].message.content)

        # Aggregate results (adjust based on your schema)
        if 'items' in chunk_result:
            all_results.extend(chunk_result['items'])

    return {"items": all_results, "total": len(all_results)}

# Example usage
result = scrape_with_chunking(
    html_content=large_html,
    chunk_size=8000,
    extraction_prompt="Extract all product listings"
)
print(f"Extracted {result['total']} items across all chunks")

Optimizing Token Usage for Cost Efficiency

Token limits directly affect costs. Here's how to optimize:

1. Choose the Right Model

def choose_optimal_model(html_content, complexity='simple'):
    """
    Select the most cost-effective model for the task
    """
    token_count = count_tokens(html_content)

    if complexity == 'simple' and token_count < 12000:
        # Use GPT-3.5-Turbo for simple extractions (10x cheaper)
        return "gpt-3.5-turbo", 0.0005
    elif token_count < 6000:
        # Use standard GPT-4 for complex tasks
        return "gpt-4", 0.03
    elif token_count < 120000:
        # Use GPT-4-Turbo for large pages
        return "gpt-4-turbo", 0.01
    else:
        raise ValueError("Content too large, use chunking strategy")

model, cost_per_1k = choose_optimal_model(html_content, complexity='complex')
print(f"Using {model} at ${cost_per_1k}/1K tokens")

2. Monitor and Log Token Usage

import logging
from datetime import datetime

class TokenUsageTracker:
    def __init__(self):
        self.total_tokens = 0
        self.total_cost = 0
        self.requests = []

    def track_request(self, model, input_tokens, output_tokens):
        """Track tokens and costs for each request"""
        # Pricing per 1K tokens
        pricing = {
            'gpt-3.5-turbo': {'input': 0.0005, 'output': 0.0015},
            'gpt-4': {'input': 0.03, 'output': 0.06},
            'gpt-4-turbo': {'input': 0.01, 'output': 0.03}
        }

        cost = (input_tokens / 1000 * pricing[model]['input'] +
                output_tokens / 1000 * pricing[model]['output'])

        self.total_tokens += (input_tokens + output_tokens)
        self.total_cost += cost

        self.requests.append({
            'timestamp': datetime.now(),
            'model': model,
            'input_tokens': input_tokens,
            'output_tokens': output_tokens,
            'cost': cost
        })

        logging.info(f"Request: {input_tokens + output_tokens} tokens, ${cost:.4f}")

    def get_summary(self):
        """Get usage summary"""
        return {
            'total_requests': len(self.requests),
            'total_tokens': self.total_tokens,
            'total_cost': self.total_cost,
            'average_cost_per_request': self.total_cost / len(self.requests) if self.requests else 0
        }

# Usage
tracker = TokenUsageTracker()

response = openai.ChatCompletion.create(
    model="gpt-4",
    messages=[{"role": "user", "content": html_content}]
)

tracker.track_request(
    model="gpt-4",
    input_tokens=response.usage.prompt_tokens,
    output_tokens=response.usage.completion_tokens
)

print(tracker.get_summary())

Handling Token Limit Errors

When you exceed token limits, OpenAI returns an error. Here's how to handle it gracefully:

from openai import OpenAI, BadRequestError

client = OpenAI(api_key="your-api-key")

def scrape_with_fallback(html_content, prompt, max_retries=3):
    """
    Attempt scraping with automatic fallback strategies
    """
    strategies = [
        ('full', html_content),
        ('cleaned', clean_html_for_llm(html_content)),
        ('text', html_to_structured_text(html_content)),
    ]

    for strategy_name, content in strategies:
        token_count = count_tokens(content)
        print(f"Trying {strategy_name} strategy ({token_count} tokens)...")

        try:
            response = client.chat.completions.create(
                model="gpt-4-turbo",
                messages=[
                    {"role": "system", "content": "Extract data as JSON."},
                    {"role": "user", "content": f"{prompt}\n\n{content}"}
                ],
                temperature=0
            )

            print(f"Success with {strategy_name} strategy!")
            return response.choices[0].message.content

        except BadRequestError as e:
            if "maximum context length" in str(e):
                print(f"{strategy_name} strategy failed: content too large")
                continue
            else:
                raise

    # If all strategies fail, use chunking
    print("All strategies failed, falling back to chunking...")
    return scrape_with_chunking(html_content, extraction_prompt=prompt)

# Example usage
result = scrape_with_fallback(large_html, "Extract all product data")

Best Practices for Token Management

  1. Always calculate tokens before API calls: Use tiktoken to estimate costs and avoid errors
  2. Start with smaller models: Use GPT-3.5-Turbo for simple extractions to save costs
  3. Preprocess HTML aggressively: Remove scripts, styles, and non-essential elements
  4. Target specific sections: Use CSS selectors to extract only relevant HTML parts
  5. Monitor usage: Track token consumption and costs across your scraping operations
  6. Set token budgets: Implement alerts when approaching budget limits
  7. Cache results: Store extracted data to avoid re-processing identical pages

Combining with Browser Automation

For JavaScript-heavy sites, you can combine browser automation with token-aware scraping. When you need to handle dynamic content, use headless browsers to get the rendered HTML, then apply token optimization strategies:

from playwright.sync_api import sync_playwright

def scrape_dynamic_with_token_optimization(url, selector, prompt):
    """
    Scrape dynamic content with token optimization
    """
    with sync_playwright() as p:
        browser = p.chromium.launch(headless=True)
        page = browser.new_page()
        page.goto(url)
        page.wait_for_selector(selector)

        # Extract only the relevant section
        element = page.query_selector(selector)
        html_content = element.inner_html()
        browser.close()

    # Check token count
    token_count = count_tokens(html_content)
    print(f"Extracted HTML: {token_count} tokens")

    # Clean if necessary
    if token_count > 8000:
        html_content = clean_html_for_llm(html_content)
        print(f"After cleaning: {count_tokens(html_content)} tokens")

    # Send to ChatGPT
    response = openai.ChatCompletion.create(
        model="gpt-4",
        messages=[
            {"role": "system", "content": "Extract data as JSON."},
            {"role": "user", "content": f"{prompt}\n\n{html_content}"}
        ]
    )

    return response.choices[0].message.content

Conclusion

Understanding and managing token limits is essential for successful web scraping with ChatGPT. Key takeaways:

  • GPT-3.5-Turbo supports up to 16K tokens, suitable for small to medium pages
  • GPT-4 ranges from 8K to 128K tokens depending on the variant
  • HTML is token-heavy: Always count tokens before making API calls
  • Optimization strategies: Extract relevant sections, clean HTML, convert to text, or use chunking
  • Cost management: Choose the right model, monitor usage, and implement fallback strategies

By implementing these token management strategies, you can scrape large websites efficiently while keeping costs under control. Always test your token counting and optimization logic with sample pages before scaling to production.

For more advanced scraping scenarios, consider combining ChatGPT with traditional parsing tools to balance cost, speed, and flexibility based on your specific needs.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon