How Can I Optimize ChatGPT Token Usage for Web Scraping?
When using ChatGPT or other Large Language Models (LLMs) for web scraping, token consumption directly impacts your costs and API rate limits. Since the OpenAI API charges based on the number of tokens processed (both input and output), optimizing token usage is crucial for building cost-effective scraping solutions.
This guide covers practical techniques to minimize token consumption while maintaining extraction accuracy when using ChatGPT for web scraping tasks.
Understanding Token Costs in Web Scraping
ChatGPT pricing is based on tokens, where approximately 4 characters equal 1 token in English text. For web scraping:
- GPT-4 Turbo: ~$10 per 1M input tokens, ~$30 per 1M output tokens
- GPT-3.5 Turbo: ~$0.50 per 1M input tokens, ~$1.50 per 1M output tokens
A typical webpage HTML can contain 50,000-200,000 characters (12,500-50,000 tokens), making raw HTML extraction extremely expensive. For example, scraping 1,000 pages with 25,000 tokens each using GPT-4 would cost ~$250 just for input tokens.
1. Preprocess and Clean HTML
The most effective optimization is reducing HTML size before sending it to ChatGPT. Raw HTML contains numerous unnecessary elements for data extraction.
Remove Unnecessary Tags
Strip out tags that don't contain useful content:
from bs4 import BeautifulSoup
import re
def clean_html_for_llm(html):
"""Remove unnecessary elements from HTML before sending to LLM"""
soup = BeautifulSoup(html, 'html.parser')
# Remove script and style tags
for tag in soup(['script', 'style', 'meta', 'link', 'noscript']):
tag.decompose()
# Remove comments
for comment in soup.find_all(string=lambda text: isinstance(text, Comment)):
comment.extract()
# Remove empty tags
for tag in soup.find_all():
if not tag.get_text(strip=True) and not tag.find_all(['img', 'input']):
tag.decompose()
return str(soup)
# Example usage
html = """
<html>
<head>
<script>analytics.track()</script>
<style>.hidden { display: none; }</style>
</head>
<body>
<h1>Product Title</h1>
<p class="price">$99.99</p>
</body>
</html>
"""
cleaned_html = clean_html_for_llm(html)
print(f"Original: {len(html)} chars, Cleaned: {len(cleaned_html)} chars")
# Reduction of ~40-60% typical
Extract Only Relevant Sections
Instead of sending the entire page, identify and extract only the relevant section:
def extract_main_content(html, selector='main, article, .content'):
"""Extract only the main content area"""
soup = BeautifulSoup(html, 'html.parser')
# Try to find main content container
main_content = soup.select_one(selector)
if main_content:
return str(main_content)
# Fallback: remove header, footer, nav
for tag in soup(['header', 'footer', 'nav', 'aside']):
tag.decompose()
return str(soup.body) if soup.body else str(soup)
2. Convert HTML to Simplified Text
Converting HTML to clean text dramatically reduces token count while preserving semantic content:
def html_to_clean_text(html):
"""Convert HTML to clean, structured text"""
soup = BeautifulSoup(html, 'html.parser')
# Remove unwanted elements
for tag in soup(['script', 'style', 'meta', 'link']):
tag.decompose()
# Get text and clean whitespace
text = soup.get_text(separator='\n', strip=True)
# Remove excessive newlines
text = re.sub(r'\n\s*\n', '\n\n', text)
return text
# JavaScript equivalent using Cheerio
"""
const cheerio = require('cheerio');
function htmlToCleanText(html) {
const $ = cheerio.load(html);
// Remove unwanted elements
$('script, style, meta, link').remove();
// Get text content
let text = $('body').text();
// Clean up whitespace
text = text.replace(/\s+/g, ' ').trim();
text = text.replace(/\n\s*\n/g, '\n\n');
return text;
}
"""
3. Use Structured Markdown Format
Converting HTML to Markdown provides a compact, structured format that LLMs understand well:
import html2text
def html_to_markdown(html):
"""Convert HTML to Markdown for efficient LLM processing"""
h = html2text.HTML2Text()
h.ignore_links = False
h.ignore_images = True
h.ignore_emphasis = False
h.body_width = 0 # Don't wrap lines
markdown = h.handle(html)
# Further cleanup
markdown = re.sub(r'\n{3,}', '\n\n', markdown)
return markdown.strip()
# Example
html = "<h1>Title</h1><p>Price: <strong>$99</strong></p>"
markdown = html_to_markdown(html)
# Output: "# Title\n\nPrice: **$99**"
# ~60% token reduction compared to HTML
4. Optimize Prompts for Efficiency
Craft concise, specific prompts that guide the model without unnecessary verbosity:
# ❌ Inefficient prompt (verbose)
inefficient_prompt = """
I need you to carefully read through the following HTML content and extract
information about the product. Please look for the product name, the price,
and the description. Make sure to format your response as JSON with fields
for 'name', 'price', and 'description'. Thank you!
HTML:
{html}
"""
# ✅ Optimized prompt (concise)
optimized_prompt = """Extract product data as JSON:
- name: product title
- price: numeric price
- description: product description
{html}"""
# Tokens saved: ~50-70%
5. Use Smaller Context Windows with Chunking
For large pages, extract data in chunks rather than processing everything at once:
def chunk_content(text, max_tokens=2000):
"""Split content into chunks for processing"""
words = text.split()
chunks = []
current_chunk = []
current_tokens = 0
for word in words:
word_tokens = len(word) // 4 + 1
if current_tokens + word_tokens > max_tokens:
chunks.append(' '.join(current_chunk))
current_chunk = [word]
current_tokens = word_tokens
else:
current_chunk.append(word)
current_tokens += word_tokens
if current_chunk:
chunks.append(' '.join(current_chunk))
return chunks
# Process chunks separately
chunks = chunk_content(large_html_text)
results = []
for chunk in chunks:
response = openai.ChatCompletion.create(
model="gpt-3.5-turbo",
messages=[
{"role": "system", "content": "Extract product names from text."},
{"role": "user", "content": chunk}
]
)
results.append(response.choices[0].message.content)
6. Leverage Function Calling for Structured Output
OpenAI's function calling reduces output tokens by enforcing structured responses:
import openai
import json
def extract_with_functions(html_text):
"""Use function calling for token-efficient extraction"""
functions = [
{
"name": "save_product_data",
"description": "Save extracted product information",
"parameters": {
"type": "object",
"properties": {
"name": {"type": "string", "description": "Product name"},
"price": {"type": "number", "description": "Price in USD"},
"rating": {"type": "number", "description": "Rating out of 5"},
"in_stock": {"type": "boolean"}
},
"required": ["name", "price"]
}
}
]
response = openai.ChatCompletion.create(
model="gpt-3.5-turbo",
messages=[
{"role": "user", "content": f"Extract product data:\n\n{html_text}"}
],
functions=functions,
function_call={"name": "save_product_data"}
)
# Parse the function call arguments
function_args = json.loads(
response.choices[0].message.function_call.arguments
)
return function_args
7. Cache Common Processing Results
Implement caching to avoid reprocessing similar content:
import hashlib
import json
from functools import lru_cache
class LLMCache:
def __init__(self, cache_file='llm_cache.json'):
self.cache_file = cache_file
self.cache = self._load_cache()
def _load_cache(self):
try:
with open(self.cache_file, 'r') as f:
return json.load(f)
except FileNotFoundError:
return {}
def _save_cache(self):
with open(self.cache_file, 'w') as f:
json.dump(self.cache, f)
def get_cache_key(self, content, prompt):
"""Generate hash key for content + prompt"""
combined = f"{content}{prompt}"
return hashlib.md5(combined.encode()).hexdigest()
def get(self, content, prompt):
"""Retrieve cached result"""
key = self.get_cache_key(content, prompt)
return self.cache.get(key)
def set(self, content, prompt, result):
"""Cache result"""
key = self.get_cache_key(content, prompt)
self.cache[key] = result
self._save_cache()
# Usage
cache = LLMCache()
cached_result = cache.get(html_content, prompt)
if cached_result:
result = cached_result
else:
result = call_chatgpt(html_content, prompt)
cache.set(html_content, prompt, result)
8. Use GPT-3.5-Turbo for Simple Extractions
For straightforward data extraction, GPT-3.5-Turbo is 10-20x cheaper than GPT-4 and often sufficient:
def choose_model_by_complexity(html_content):
"""Select appropriate model based on complexity"""
# Simple patterns: use GPT-3.5-Turbo
if is_simple_extraction(html_content):
return "gpt-3.5-turbo"
# Complex or ambiguous: use GPT-4
return "gpt-4-turbo-preview"
def is_simple_extraction(html):
"""Determine if extraction is straightforward"""
# Heuristics: short content, clear structure
soup = BeautifulSoup(html, 'html.parser')
# Check for clear product schema
if soup.find(attrs={"itemtype": "http://schema.org/Product"}):
return True
# Simple if under 1000 tokens
if len(html) < 4000:
return True
return False
9. Batch Multiple Extractions
Process multiple similar pages in a single API call when possible:
def batch_extract_products(html_pages):
"""Extract data from multiple pages in one call"""
# Combine multiple pages into one prompt
combined_content = ""
for i, page in enumerate(html_pages):
cleaned = html_to_markdown(page)
combined_content += f"\n\n--- PAGE {i+1} ---\n{cleaned}"
prompt = f"""Extract product data from each page below.
Return JSON array with one object per page.
{combined_content}"""
response = openai.ChatCompletion.create(
model="gpt-3.5-turbo",
messages=[{"role": "user", "content": prompt}]
)
return json.loads(response.choices[0].message.content)
# Process 5 similar pages in one call
results = batch_extract_products(product_pages[:5])
10. Monitor and Track Token Usage
Implement monitoring to identify optimization opportunities:
class TokenTracker:
def __init__(self):
self.total_tokens = 0
self.total_cost = 0
self.calls = []
def track_call(self, response, model="gpt-3.5-turbo"):
"""Track tokens and cost for each API call"""
usage = response.usage
# Calculate cost
if model == "gpt-4-turbo-preview":
cost = (usage.prompt_tokens * 0.01 +
usage.completion_tokens * 0.03) / 1000
else: # gpt-3.5-turbo
cost = (usage.prompt_tokens * 0.0005 +
usage.completion_tokens * 0.0015) / 1000
self.total_tokens += usage.total_tokens
self.total_cost += cost
self.calls.append({
'tokens': usage.total_tokens,
'cost': cost,
'prompt_tokens': usage.prompt_tokens,
'completion_tokens': usage.completion_tokens
})
return cost
def get_stats(self):
"""Get usage statistics"""
return {
'total_calls': len(self.calls),
'total_tokens': self.total_tokens,
'total_cost': self.total_cost,
'avg_tokens_per_call': self.total_tokens / len(self.calls) if self.calls else 0
}
# Usage
tracker = TokenTracker()
response = openai.ChatCompletion.create(...)
tracker.track_call(response)
print(tracker.get_stats())
Complete Optimization Example
Here's a complete example combining multiple optimization techniques:
import openai
from bs4 import BeautifulSoup
import html2text
import json
class OptimizedLLMScraper:
def __init__(self, api_key):
openai.api_key = api_key
self.h2t = html2text.HTML2Text()
self.h2t.ignore_images = True
self.h2t.body_width = 0
def preprocess_html(self, html):
"""Clean and minimize HTML"""
soup = BeautifulSoup(html, 'html.parser')
# Remove unnecessary tags
for tag in soup(['script', 'style', 'meta', 'link', 'nav', 'footer']):
tag.decompose()
# Convert to markdown
markdown = self.h2t.handle(str(soup))
# Clean whitespace
markdown = '\n'.join(line for line in markdown.split('\n') if line.strip())
return markdown
def extract_data(self, html, schema):
"""Extract structured data using optimized approach"""
# Preprocess HTML
content = self.preprocess_html(html)
# Token estimation
estimated_tokens = len(content) // 4
print(f"Estimated tokens: {estimated_tokens}")
# Create function definition from schema
function_def = {
"name": "save_extracted_data",
"parameters": {
"type": "object",
"properties": schema
}
}
# Minimal prompt
response = openai.ChatCompletion.create(
model="gpt-3.5-turbo",
messages=[
{"role": "user", "content": f"Extract data:\n\n{content}"}
],
functions=[function_def],
function_call={"name": "save_extracted_data"},
temperature=0 # Deterministic output
)
# Parse result
result = json.loads(
response.choices[0].message.function_call.arguments
)
print(f"Tokens used: {response.usage.total_tokens}")
return result
# Usage
scraper = OptimizedLLMScraper("your-api-key")
schema = {
"title": {"type": "string"},
"price": {"type": "number"},
"rating": {"type": "number"}
}
data = scraper.extract_data(product_html, schema)
Conclusion
Optimizing ChatGPT token usage for web scraping requires a multi-faceted approach:
- Preprocess HTML to remove 40-70% of unnecessary content
- Convert to Markdown for additional 30-50% token reduction
- Use concise prompts to minimize instruction tokens
- Leverage function calling for structured, token-efficient outputs
- Choose appropriate models (GPT-3.5 vs GPT-4) based on complexity
- Implement caching to avoid redundant processing
- Monitor usage to identify further optimization opportunities
By combining these techniques, you can typically reduce token consumption by 70-90% compared to sending raw HTML, making AI-powered web scraping cost-effective at scale. For scenarios requiring dynamic page rendering before extraction, consider integrating these optimization techniques with tools for handling JavaScript-heavy pages to maximize efficiency.
Remember that the key to sustainable LLM-based web scraping is balancing extraction accuracy with token efficiency—start with aggressive optimization and relax constraints only when accuracy demands it.