What Are the Token Limits for ChatGPT When Scraping Data?
When using ChatGPT for web scraping, understanding token limits is crucial. Token limits determine how much HTML content you can send to the API in a single request and directly impact your scraping strategy, costs, and performance.
Understanding Tokens and Context Windows
Tokens are the basic units of text that language models process. In English, one token roughly equals: - 4 characters of text - 0.75 words on average - A single word, part of a word, or punctuation mark
For example, the sentence "Web scraping is efficient" contains 5 tokens: ["Web", " scraping", " is", " efficient"].
The context window (or token limit) is the maximum number of tokens a model can process in one request, including both your input (prompt + HTML) and the model's output (extracted data).
Token Limits by ChatGPT Model
Different ChatGPT models have different token limits:
GPT-3.5-Turbo Models
| Model | Total Context Window | Max Input Tokens | Max Output Tokens | Cost per 1K Input Tokens | |-------|---------------------|------------------|-------------------|--------------------------| | gpt-3.5-turbo | 16,385 tokens | ~12,000 tokens | 4,096 tokens | $0.0005 | | gpt-3.5-turbo-16k | 16,385 tokens | ~12,000 tokens | 4,096 tokens | $0.0010 |
GPT-4 Models
| Model | Total Context Window | Max Input Tokens | Max Output Tokens | Cost per 1K Input Tokens | |-------|---------------------|------------------|-------------------|--------------------------| | gpt-4 | 8,192 tokens | ~6,000 tokens | 2,192 tokens | $0.03 | | gpt-4-32k | 32,768 tokens | ~28,000 tokens | 4,768 tokens | $0.06 | | gpt-4-turbo | 128,000 tokens | ~120,000 tokens | 8,000 tokens | $0.01 | | gpt-4o | 128,000 tokens | ~120,000 tokens | 8,000 tokens | $0.005 |
Note: The max input tokens are approximate, as you need to reserve space for the system prompt, user prompt, and output tokens.
How HTML Size Affects Token Count
HTML pages are typically much larger in tokens than plain text due to:
- Tags and attributes: <div class="product-item">= multiple tokens
- Inline styles and scripts: CSS and JavaScript consume many tokens
- Whitespace and formatting: Indentation and newlines add tokens
- Repeated boilerplate: Headers, footers, navigation menus
Example token counts for typical web pages:
- Simple blog post (text-heavy): 2,000-5,000 tokens
- E-commerce product page: 8,000-15,000 tokens
- Full homepage with navigation: 20,000-50,000 tokens
- Complex single-page application: 50,000-100,000+ tokens
Calculating Tokens Before Making API Calls
To avoid errors and optimize costs, calculate token counts before sending requests to ChatGPT:
Python Example
import tiktoken
def count_tokens(text, model="gpt-4"):
    """
    Count the number of tokens in a text string
    """
    encoding = tiktoken.encoding_for_model(model)
    return len(encoding.encode(text))
# Example: Count tokens in HTML content
html_content = """
<div class="product">
    <h1>Premium Coffee Beans</h1>
    <p class="price">$24.99</p>
    <p class="description">Organic, fair-trade coffee...</p>
</div>
"""
token_count = count_tokens(html_content)
print(f"HTML contains {token_count} tokens")
# Calculate if content fits in model's context window
MAX_TOKENS = 8000  # GPT-4 limit
if token_count > MAX_TOKENS:
    print(f"Content exceeds limit by {token_count - MAX_TOKENS} tokens")
else:
    print(f"Content fits with {MAX_TOKENS - token_count} tokens to spare")
JavaScript Example
const { encoding_for_model } = require('tiktoken');
function countTokens(text, model = 'gpt-4') {
  const encoding = encoding_for_model(model);
  const tokens = encoding.encode(text);
  encoding.free();
  return tokens.length;
}
// Example: Estimate cost before scraping
const htmlContent = `
<article>
  <h1>Understanding Web Scraping</h1>
  <p>Web scraping is a technique...</p>
</article>
`;
const tokenCount = countTokens(htmlContent);
console.log(`HTML contains ${tokenCount} tokens`);
// Estimate API cost
const costPerToken = 0.03 / 1000; // GPT-4 pricing
const estimatedCost = tokenCount * costPerToken;
console.log(`Estimated cost: $${estimatedCost.toFixed(4)}`);
Strategies for Handling Large HTML Pages
When your HTML exceeds token limits, use these strategies:
Strategy 1: Extract Relevant Sections Only
Instead of sending the entire page, extract only the sections containing the data you need:
from bs4 import BeautifulSoup
import openai
def scrape_with_section_extraction(url, css_selector, extraction_prompt):
    """
    Extract only relevant HTML sections before sending to ChatGPT
    """
    import requests
    # Fetch full HTML
    response = requests.get(url, headers={
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)'
    })
    html = response.text
    # Extract only the relevant section
    soup = BeautifulSoup(html, 'html.parser')
    relevant_section = soup.select_one(css_selector)
    if not relevant_section:
        raise ValueError(f"No element found matching '{css_selector}'")
    # Convert section to string and check token count
    section_html = str(relevant_section)
    token_count = count_tokens(section_html)
    print(f"Extracted section: {token_count} tokens")
    # Send to ChatGPT
    response = openai.ChatCompletion.create(
        model="gpt-4",
        messages=[
            {"role": "system", "content": "Extract data from HTML as JSON."},
            {"role": "user", "content": f"{extraction_prompt}\n\nHTML:\n{section_html}"}
        ],
        temperature=0
    )
    return response.choices[0].message.content
# Example: Extract only the product grid
result = scrape_with_section_extraction(
    url="https://example.com/products",
    css_selector=".product-grid",
    extraction_prompt="Extract all products with name, price, and rating"
)
Strategy 2: Strip Unnecessary HTML Elements
Remove scripts, styles, comments, and other non-essential elements:
from bs4 import BeautifulSoup
def clean_html_for_llm(html):
    """
    Remove unnecessary elements to reduce token count
    """
    soup = BeautifulSoup(html, 'html.parser')
    # Remove script and style elements
    for element in soup(['script', 'style', 'noscript', 'svg']):
        element.decompose()
    # Remove HTML comments
    for comment in soup.find_all(string=lambda text: isinstance(text, Comment)):
        comment.extract()
    # Remove inline styles and unnecessary attributes
    for tag in soup.find_all():
        # Keep only essential attributes
        attrs_to_keep = ['class', 'id', 'href', 'src', 'alt', 'title']
        tag.attrs = {k: v for k, v in tag.attrs.items() if k in attrs_to_keep}
    # Get cleaned HTML
    cleaned = str(soup)
    # Calculate token savings
    original_tokens = count_tokens(html)
    cleaned_tokens = count_tokens(cleaned)
    savings = original_tokens - cleaned_tokens
    print(f"Reduced from {original_tokens} to {cleaned_tokens} tokens ({savings} saved)")
    return cleaned
# Example usage
html_content = requests.get("https://example.com").text
cleaned_html = clean_html_for_llm(html_content)
Strategy 3: Convert HTML to Text
For content-heavy pages, convert to plain text while preserving structure:
from bs4 import BeautifulSoup
def html_to_structured_text(html):
    """
    Convert HTML to cleaner text format, reducing tokens significantly
    """
    soup = BeautifulSoup(html, 'html.parser')
    # Remove unwanted elements
    for element in soup(['script', 'style', 'nav', 'footer', 'header']):
        element.decompose()
    # Extract text with some structure preservation
    text_parts = []
    for element in soup.find_all(['h1', 'h2', 'h3', 'p', 'li', 'td']):
        text = element.get_text(strip=True)
        if text:
            # Add markers for structure
            if element.name in ['h1', 'h2', 'h3']:
                text_parts.append(f"\n## {text}\n")
            else:
                text_parts.append(text)
    structured_text = '\n'.join(text_parts)
    # Compare token counts
    html_tokens = count_tokens(html)
    text_tokens = count_tokens(structured_text)
    print(f"HTML: {html_tokens} tokens → Text: {text_tokens} tokens")
    print(f"Reduction: {((html_tokens - text_tokens) / html_tokens * 100):.1f}%")
    return structured_text
# Example: Convert product page to text
text_content = html_to_structured_text(html_content)
# Use with ChatGPT
response = openai.ChatCompletion.create(
    model="gpt-4",
    messages=[
        {"role": "system", "content": "Extract product data as JSON."},
        {"role": "user", "content": f"Extract products from:\n{text_content}"}
    ]
)
Strategy 4: Chunking for Very Large Pages
When dealing with pages that exceed even GPT-4 Turbo's 128K limit, split the content into chunks:
def scrape_with_chunking(html_content, chunk_size=10000, extraction_prompt=""):
    """
    Process large HTML in chunks and aggregate results
    """
    import json
    # Calculate optimal chunk size in tokens
    encoding = tiktoken.encoding_for_model("gpt-4")
    tokens = encoding.encode(html_content)
    # Split into chunks
    chunks = []
    for i in range(0, len(tokens), chunk_size):
        chunk_tokens = tokens[i:i + chunk_size]
        chunk_text = encoding.decode(chunk_tokens)
        chunks.append(chunk_text)
    print(f"Split into {len(chunks)} chunks")
    # Process each chunk
    all_results = []
    for i, chunk in enumerate(chunks):
        print(f"Processing chunk {i+1}/{len(chunks)}...")
        response = openai.ChatCompletion.create(
            model="gpt-4",
            messages=[
                {"role": "system", "content": "Extract data as JSON array."},
                {"role": "user", "content": f"{extraction_prompt}\n\n{chunk}"}
            ],
            temperature=0,
            response_format={"type": "json_object"}
        )
        chunk_result = json.loads(response.choices[0].message.content)
        # Aggregate results (adjust based on your schema)
        if 'items' in chunk_result:
            all_results.extend(chunk_result['items'])
    return {"items": all_results, "total": len(all_results)}
# Example usage
result = scrape_with_chunking(
    html_content=large_html,
    chunk_size=8000,
    extraction_prompt="Extract all product listings"
)
print(f"Extracted {result['total']} items across all chunks")
Optimizing Token Usage for Cost Efficiency
Token limits directly affect costs. Here's how to optimize:
1. Choose the Right Model
def choose_optimal_model(html_content, complexity='simple'):
    """
    Select the most cost-effective model for the task
    """
    token_count = count_tokens(html_content)
    if complexity == 'simple' and token_count < 12000:
        # Use GPT-3.5-Turbo for simple extractions (10x cheaper)
        return "gpt-3.5-turbo", 0.0005
    elif token_count < 6000:
        # Use standard GPT-4 for complex tasks
        return "gpt-4", 0.03
    elif token_count < 120000:
        # Use GPT-4-Turbo for large pages
        return "gpt-4-turbo", 0.01
    else:
        raise ValueError("Content too large, use chunking strategy")
model, cost_per_1k = choose_optimal_model(html_content, complexity='complex')
print(f"Using {model} at ${cost_per_1k}/1K tokens")
2. Monitor and Log Token Usage
import logging
from datetime import datetime
class TokenUsageTracker:
    def __init__(self):
        self.total_tokens = 0
        self.total_cost = 0
        self.requests = []
    def track_request(self, model, input_tokens, output_tokens):
        """Track tokens and costs for each request"""
        # Pricing per 1K tokens
        pricing = {
            'gpt-3.5-turbo': {'input': 0.0005, 'output': 0.0015},
            'gpt-4': {'input': 0.03, 'output': 0.06},
            'gpt-4-turbo': {'input': 0.01, 'output': 0.03}
        }
        cost = (input_tokens / 1000 * pricing[model]['input'] +
                output_tokens / 1000 * pricing[model]['output'])
        self.total_tokens += (input_tokens + output_tokens)
        self.total_cost += cost
        self.requests.append({
            'timestamp': datetime.now(),
            'model': model,
            'input_tokens': input_tokens,
            'output_tokens': output_tokens,
            'cost': cost
        })
        logging.info(f"Request: {input_tokens + output_tokens} tokens, ${cost:.4f}")
    def get_summary(self):
        """Get usage summary"""
        return {
            'total_requests': len(self.requests),
            'total_tokens': self.total_tokens,
            'total_cost': self.total_cost,
            'average_cost_per_request': self.total_cost / len(self.requests) if self.requests else 0
        }
# Usage
tracker = TokenUsageTracker()
response = openai.ChatCompletion.create(
    model="gpt-4",
    messages=[{"role": "user", "content": html_content}]
)
tracker.track_request(
    model="gpt-4",
    input_tokens=response.usage.prompt_tokens,
    output_tokens=response.usage.completion_tokens
)
print(tracker.get_summary())
Handling Token Limit Errors
When you exceed token limits, OpenAI returns an error. Here's how to handle it gracefully:
from openai import OpenAI, BadRequestError
client = OpenAI(api_key="your-api-key")
def scrape_with_fallback(html_content, prompt, max_retries=3):
    """
    Attempt scraping with automatic fallback strategies
    """
    strategies = [
        ('full', html_content),
        ('cleaned', clean_html_for_llm(html_content)),
        ('text', html_to_structured_text(html_content)),
    ]
    for strategy_name, content in strategies:
        token_count = count_tokens(content)
        print(f"Trying {strategy_name} strategy ({token_count} tokens)...")
        try:
            response = client.chat.completions.create(
                model="gpt-4-turbo",
                messages=[
                    {"role": "system", "content": "Extract data as JSON."},
                    {"role": "user", "content": f"{prompt}\n\n{content}"}
                ],
                temperature=0
            )
            print(f"Success with {strategy_name} strategy!")
            return response.choices[0].message.content
        except BadRequestError as e:
            if "maximum context length" in str(e):
                print(f"{strategy_name} strategy failed: content too large")
                continue
            else:
                raise
    # If all strategies fail, use chunking
    print("All strategies failed, falling back to chunking...")
    return scrape_with_chunking(html_content, extraction_prompt=prompt)
# Example usage
result = scrape_with_fallback(large_html, "Extract all product data")
Best Practices for Token Management
- Always calculate tokens before API calls: Use tiktokento estimate costs and avoid errors
- Start with smaller models: Use GPT-3.5-Turbo for simple extractions to save costs
- Preprocess HTML aggressively: Remove scripts, styles, and non-essential elements
- Target specific sections: Use CSS selectors to extract only relevant HTML parts
- Monitor usage: Track token consumption and costs across your scraping operations
- Set token budgets: Implement alerts when approaching budget limits
- Cache results: Store extracted data to avoid re-processing identical pages
Combining with Browser Automation
For JavaScript-heavy sites, you can combine browser automation with token-aware scraping. When you need to handle dynamic content, use headless browsers to get the rendered HTML, then apply token optimization strategies:
from playwright.sync_api import sync_playwright
def scrape_dynamic_with_token_optimization(url, selector, prompt):
    """
    Scrape dynamic content with token optimization
    """
    with sync_playwright() as p:
        browser = p.chromium.launch(headless=True)
        page = browser.new_page()
        page.goto(url)
        page.wait_for_selector(selector)
        # Extract only the relevant section
        element = page.query_selector(selector)
        html_content = element.inner_html()
        browser.close()
    # Check token count
    token_count = count_tokens(html_content)
    print(f"Extracted HTML: {token_count} tokens")
    # Clean if necessary
    if token_count > 8000:
        html_content = clean_html_for_llm(html_content)
        print(f"After cleaning: {count_tokens(html_content)} tokens")
    # Send to ChatGPT
    response = openai.ChatCompletion.create(
        model="gpt-4",
        messages=[
            {"role": "system", "content": "Extract data as JSON."},
            {"role": "user", "content": f"{prompt}\n\n{html_content}"}
        ]
    )
    return response.choices[0].message.content
Conclusion
Understanding and managing token limits is essential for successful web scraping with ChatGPT. Key takeaways:
- GPT-3.5-Turbo supports up to 16K tokens, suitable for small to medium pages
- GPT-4 ranges from 8K to 128K tokens depending on the variant
- HTML is token-heavy: Always count tokens before making API calls
- Optimization strategies: Extract relevant sections, clean HTML, convert to text, or use chunking
- Cost management: Choose the right model, monitor usage, and implement fallback strategies
By implementing these token management strategies, you can scrape large websites efficiently while keeping costs under control. Always test your token counting and optimization logic with sample pages before scaling to production.
For more advanced scraping scenarios, consider combining ChatGPT with traditional parsing tools to balance cost, speed, and flexibility based on your specific needs.