How do I Extract Text from HTML Using Claude AI?
Claude AI can extract clean, structured text from HTML content by leveraging its natural language understanding capabilities. Unlike traditional parsing methods that rely on selectors or regex, Claude can intelligently identify and extract relevant text while filtering out navigation, ads, and boilerplate content.
Understanding Claude's Text Extraction Approach
Claude AI processes HTML documents and extracts text based on semantic understanding rather than DOM manipulation. This makes it particularly useful for:
- Extracting main content from articles while ignoring sidebars and footers
- Converting HTML to clean markdown or plain text
- Identifying and organizing hierarchical content structures
- Handling dynamic or inconsistently structured pages
Basic Text Extraction with Claude API
Python Implementation
Here's how to extract text from HTML using Claude's API in Python:
import anthropic
import requests
# Initialize the Claude client
client = anthropic.Anthropic(api_key="your-api-key")
# Fetch HTML content
url = "https://example.com/article"
response = requests.get(url)
html_content = response.text
# Extract text using Claude
message = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=4096,
messages=[
{
"role": "user",
"content": f"""Extract the main text content from this HTML page.
Remove navigation, ads, footers, and other non-essential elements.
Return only the primary article or page content.
HTML:
{html_content}"""
}
]
)
extracted_text = message.content[0].text
print(extracted_text)
JavaScript/Node.js Implementation
For JavaScript applications, use the Anthropic SDK:
const Anthropic = require('@anthropic-ai/sdk');
const axios = require('axios');
const client = new Anthropic({
apiKey: process.env.ANTHROPIC_API_KEY,
});
async function extractTextFromHTML(url) {
// Fetch HTML content
const response = await axios.get(url);
const htmlContent = response.data;
// Extract text using Claude
const message = await client.messages.create({
model: 'claude-3-5-sonnet-20241022',
max_tokens: 4096,
messages: [
{
role: 'user',
content: `Extract the main text content from this HTML page.
Remove navigation, ads, footers, and other non-essential elements.
Return only the primary article or page content.
HTML:
${htmlContent}`
}
]
});
return message.content[0].text;
}
// Usage
extractTextFromHTML('https://example.com/article')
.then(text => console.log(text))
.catch(err => console.error(err));
Advanced Text Extraction Techniques
Extracting Structured Content
Claude can extract text and organize it into structured formats:
import anthropic
import json
client = anthropic.Anthropic(api_key="your-api-key")
def extract_structured_text(html_content):
message = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=4096,
messages=[
{
"role": "user",
"content": f"""Extract text from this HTML and structure it as JSON with the following fields:
- title: The main page title
- headings: Array of all section headings
- paragraphs: Array of main content paragraphs
- lists: Any bulleted or numbered lists
HTML:
{html_content}
Return valid JSON only."""
}
]
)
return json.loads(message.content[0].text)
# Example usage
html = """
<html>
<body>
<h1>Web Scraping Guide</h1>
<p>Web scraping is the process of extracting data from websites.</p>
<h2>Getting Started</h2>
<p>First, choose the right tools for your project.</p>
<ul>
<li>Python libraries</li>
<li>JavaScript frameworks</li>
</ul>
</body>
</html>
"""
structured_data = extract_structured_text(html)
print(json.dumps(structured_data, indent=2))
Converting HTML to Markdown
Claude excels at converting HTML to clean markdown:
def html_to_markdown(html_content):
client = anthropic.Anthropic(api_key="your-api-key")
message = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=4096,
messages=[
{
"role": "user",
"content": f"""Convert this HTML to clean markdown format.
Preserve headings, links, lists, and formatting.
HTML:
{html_content}"""
}
]
)
return message.content[0].text
Handling Large HTML Documents
For large HTML files, you may need to preprocess the content before sending it to Claude:
from bs4 import BeautifulSoup
def extract_text_from_large_html(url):
# Fetch and parse HTML
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
# Remove script and style elements
for script in soup(["script", "style", "nav", "footer"]):
script.decompose()
# Get simplified HTML
simplified_html = str(soup.body) if soup.body else str(soup)
# Use Claude for intelligent text extraction
client = anthropic.Anthropic(api_key="your-api-key")
message = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=4096,
messages=[
{
"role": "user",
"content": f"""Extract the main article text from this HTML.
Focus on the primary content and ignore remaining boilerplate.
HTML:
{simplified_html[:50000]}""" # Limit content size
}
]
)
return message.content[0].text
Best Practices for Text Extraction with Claude
1. Optimize Token Usage
Since Claude has token limits, preprocess HTML to remove obvious non-content elements:
from bs4 import BeautifulSoup
def clean_html_for_claude(html):
soup = BeautifulSoup(html, 'html.parser')
# Remove common non-content elements
for tag in soup(['script', 'style', 'iframe', 'noscript',
'svg', 'path', 'meta', 'link']):
tag.decompose()
# Remove comments
for comment in soup.find_all(text=lambda text: isinstance(text, Comment)):
comment.extract()
return str(soup)
2. Use Specific Prompts
Be explicit about what text you want to extract:
# Good prompt - specific and clear
prompt = """Extract the main article text from this HTML.
Include the headline, author byline, publication date if present,
and all body paragraphs. Exclude navigation menus, related articles,
comments, and advertisements."""
# Less effective - too vague
prompt = "Get the text from this HTML"
3. Handle Different Content Types
Different pages require different extraction strategies. When working with dynamic content, you might need to handle AJAX requests using Puppeteer to retrieve the fully rendered HTML before passing it to Claude.
def extract_by_page_type(html_content, page_type):
prompts = {
'article': """Extract article title, author, date, and main content.""",
'product': """Extract product name, price, description, and specifications.""",
'blog': """Extract blog post title, author, date, content, and tags.""",
'forum': """Extract thread title, original post, and all replies."""
}
prompt = prompts.get(page_type, "Extract main text content")
client = anthropic.Anthropic(api_key="your-api-key")
message = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=4096,
messages=[
{
"role": "user",
"content": f"{prompt}\n\nHTML:\n{html_content}"
}
]
)
return message.content[0].text
Combining Claude with Traditional Scraping Tools
For optimal results, combine Claude's AI capabilities with traditional scraping methods:
from bs4 import BeautifulSoup
import anthropic
def hybrid_text_extraction(url):
# Step 1: Fetch HTML with proper headers
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
}
response = requests.get(url, headers=headers)
# Step 2: Use BeautifulSoup for basic cleanup
soup = BeautifulSoup(response.content, 'html.parser')
# Find main content area (common patterns)
main_content = (
soup.find('article') or
soup.find('main') or
soup.find(class_=['content', 'post', 'article']) or
soup.body
)
# Step 3: Use Claude for intelligent text extraction
client = anthropic.Anthropic(api_key="your-api-key")
message = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=4096,
messages=[
{
"role": "user",
"content": f"""Extract and clean the text from this content section.
HTML:
{str(main_content)}"""
}
]
)
return message.content[0].text
Error Handling and Rate Limiting
Implement robust error handling when extracting text:
import time
from anthropic import APIError, RateLimitError
def extract_with_retry(html_content, max_retries=3):
client = anthropic.Anthropic(api_key="your-api-key")
for attempt in range(max_retries):
try:
message = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=4096,
messages=[
{
"role": "user",
"content": f"Extract main text from:\n{html_content}"
}
]
)
return message.content[0].text
except RateLimitError:
if attempt < max_retries - 1:
wait_time = (attempt + 1) * 2
print(f"Rate limited. Waiting {wait_time}s...")
time.sleep(wait_time)
else:
raise
except APIError as e:
print(f"API error: {e}")
if attempt < max_retries - 1:
time.sleep(1)
else:
raise
return None
Cost Optimization Strategies
To minimize API costs when extracting text from multiple pages:
- Cache results: Store extracted text to avoid reprocessing
- Batch similar pages: Use consistent prompts for similar content types
- Prefilter content: Remove non-content HTML before sending to Claude
- Use appropriate models: Claude Haiku for simple extraction, Sonnet for complex analysis
import hashlib
import json
from pathlib import Path
class CachedTextExtractor:
def __init__(self, cache_dir='./cache'):
self.cache_dir = Path(cache_dir)
self.cache_dir.mkdir(exist_ok=True)
self.client = anthropic.Anthropic(api_key="your-api-key")
def _get_cache_key(self, html_content):
return hashlib.md5(html_content.encode()).hexdigest()
def extract_text(self, html_content):
cache_key = self._get_cache_key(html_content)
cache_file = self.cache_dir / f"{cache_key}.txt"
# Check cache
if cache_file.exists():
return cache_file.read_text()
# Extract using Claude
message = self.client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=4096,
messages=[
{
"role": "user",
"content": f"Extract main text:\n{html_content}"
}
]
)
extracted_text = message.content[0].text
# Cache result
cache_file.write_text(extracted_text)
return extracted_text
Conclusion
Claude AI provides a powerful, flexible approach to extracting text from HTML that goes beyond traditional parsing methods. By understanding semantic context, Claude can intelligently identify and extract relevant content while filtering out noise. When combined with preprocessing techniques and proper error handling, Claude AI for web scraping becomes an invaluable tool for developers working with diverse and complex web content.
For production applications requiring large-scale text extraction, consider using a dedicated web scraping API that combines traditional parsing with AI capabilities for optimal performance and reliability.