Table of contents

How do I Extract Text from HTML Using Claude AI?

Claude AI can extract clean, structured text from HTML content by leveraging its natural language understanding capabilities. Unlike traditional parsing methods that rely on selectors or regex, Claude can intelligently identify and extract relevant text while filtering out navigation, ads, and boilerplate content.

Understanding Claude's Text Extraction Approach

Claude AI processes HTML documents and extracts text based on semantic understanding rather than DOM manipulation. This makes it particularly useful for:

  • Extracting main content from articles while ignoring sidebars and footers
  • Converting HTML to clean markdown or plain text
  • Identifying and organizing hierarchical content structures
  • Handling dynamic or inconsistently structured pages

Basic Text Extraction with Claude API

Python Implementation

Here's how to extract text from HTML using Claude's API in Python:

import anthropic
import requests

# Initialize the Claude client
client = anthropic.Anthropic(api_key="your-api-key")

# Fetch HTML content
url = "https://example.com/article"
response = requests.get(url)
html_content = response.text

# Extract text using Claude
message = client.messages.create(
    model="claude-3-5-sonnet-20241022",
    max_tokens=4096,
    messages=[
        {
            "role": "user",
            "content": f"""Extract the main text content from this HTML page.
            Remove navigation, ads, footers, and other non-essential elements.
            Return only the primary article or page content.

            HTML:
            {html_content}"""
        }
    ]
)

extracted_text = message.content[0].text
print(extracted_text)

JavaScript/Node.js Implementation

For JavaScript applications, use the Anthropic SDK:

const Anthropic = require('@anthropic-ai/sdk');
const axios = require('axios');

const client = new Anthropic({
  apiKey: process.env.ANTHROPIC_API_KEY,
});

async function extractTextFromHTML(url) {
  // Fetch HTML content
  const response = await axios.get(url);
  const htmlContent = response.data;

  // Extract text using Claude
  const message = await client.messages.create({
    model: 'claude-3-5-sonnet-20241022',
    max_tokens: 4096,
    messages: [
      {
        role: 'user',
        content: `Extract the main text content from this HTML page.
        Remove navigation, ads, footers, and other non-essential elements.
        Return only the primary article or page content.

        HTML:
        ${htmlContent}`
      }
    ]
  });

  return message.content[0].text;
}

// Usage
extractTextFromHTML('https://example.com/article')
  .then(text => console.log(text))
  .catch(err => console.error(err));

Advanced Text Extraction Techniques

Extracting Structured Content

Claude can extract text and organize it into structured formats:

import anthropic
import json

client = anthropic.Anthropic(api_key="your-api-key")

def extract_structured_text(html_content):
    message = client.messages.create(
        model="claude-3-5-sonnet-20241022",
        max_tokens=4096,
        messages=[
            {
                "role": "user",
                "content": f"""Extract text from this HTML and structure it as JSON with the following fields:
                - title: The main page title
                - headings: Array of all section headings
                - paragraphs: Array of main content paragraphs
                - lists: Any bulleted or numbered lists

                HTML:
                {html_content}

                Return valid JSON only."""
            }
        ]
    )

    return json.loads(message.content[0].text)

# Example usage
html = """
<html>
  <body>
    <h1>Web Scraping Guide</h1>
    <p>Web scraping is the process of extracting data from websites.</p>
    <h2>Getting Started</h2>
    <p>First, choose the right tools for your project.</p>
    <ul>
      <li>Python libraries</li>
      <li>JavaScript frameworks</li>
    </ul>
  </body>
</html>
"""

structured_data = extract_structured_text(html)
print(json.dumps(structured_data, indent=2))

Converting HTML to Markdown

Claude excels at converting HTML to clean markdown:

def html_to_markdown(html_content):
    client = anthropic.Anthropic(api_key="your-api-key")

    message = client.messages.create(
        model="claude-3-5-sonnet-20241022",
        max_tokens=4096,
        messages=[
            {
                "role": "user",
                "content": f"""Convert this HTML to clean markdown format.
                Preserve headings, links, lists, and formatting.

                HTML:
                {html_content}"""
            }
        ]
    )

    return message.content[0].text

Handling Large HTML Documents

For large HTML files, you may need to preprocess the content before sending it to Claude:

from bs4 import BeautifulSoup

def extract_text_from_large_html(url):
    # Fetch and parse HTML
    response = requests.get(url)
    soup = BeautifulSoup(response.content, 'html.parser')

    # Remove script and style elements
    for script in soup(["script", "style", "nav", "footer"]):
        script.decompose()

    # Get simplified HTML
    simplified_html = str(soup.body) if soup.body else str(soup)

    # Use Claude for intelligent text extraction
    client = anthropic.Anthropic(api_key="your-api-key")
    message = client.messages.create(
        model="claude-3-5-sonnet-20241022",
        max_tokens=4096,
        messages=[
            {
                "role": "user",
                "content": f"""Extract the main article text from this HTML.
                Focus on the primary content and ignore remaining boilerplate.

                HTML:
                {simplified_html[:50000]}"""  # Limit content size
            }
        ]
    )

    return message.content[0].text

Best Practices for Text Extraction with Claude

1. Optimize Token Usage

Since Claude has token limits, preprocess HTML to remove obvious non-content elements:

from bs4 import BeautifulSoup

def clean_html_for_claude(html):
    soup = BeautifulSoup(html, 'html.parser')

    # Remove common non-content elements
    for tag in soup(['script', 'style', 'iframe', 'noscript',
                     'svg', 'path', 'meta', 'link']):
        tag.decompose()

    # Remove comments
    for comment in soup.find_all(text=lambda text: isinstance(text, Comment)):
        comment.extract()

    return str(soup)

2. Use Specific Prompts

Be explicit about what text you want to extract:

# Good prompt - specific and clear
prompt = """Extract the main article text from this HTML.
Include the headline, author byline, publication date if present,
and all body paragraphs. Exclude navigation menus, related articles,
comments, and advertisements."""

# Less effective - too vague
prompt = "Get the text from this HTML"

3. Handle Different Content Types

Different pages require different extraction strategies. When working with dynamic content, you might need to handle AJAX requests using Puppeteer to retrieve the fully rendered HTML before passing it to Claude.

def extract_by_page_type(html_content, page_type):
    prompts = {
        'article': """Extract article title, author, date, and main content.""",
        'product': """Extract product name, price, description, and specifications.""",
        'blog': """Extract blog post title, author, date, content, and tags.""",
        'forum': """Extract thread title, original post, and all replies."""
    }

    prompt = prompts.get(page_type, "Extract main text content")

    client = anthropic.Anthropic(api_key="your-api-key")
    message = client.messages.create(
        model="claude-3-5-sonnet-20241022",
        max_tokens=4096,
        messages=[
            {
                "role": "user",
                "content": f"{prompt}\n\nHTML:\n{html_content}"
            }
        ]
    )

    return message.content[0].text

Combining Claude with Traditional Scraping Tools

For optimal results, combine Claude's AI capabilities with traditional scraping methods:

from bs4 import BeautifulSoup
import anthropic

def hybrid_text_extraction(url):
    # Step 1: Fetch HTML with proper headers
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
    }
    response = requests.get(url, headers=headers)

    # Step 2: Use BeautifulSoup for basic cleanup
    soup = BeautifulSoup(response.content, 'html.parser')

    # Find main content area (common patterns)
    main_content = (
        soup.find('article') or
        soup.find('main') or
        soup.find(class_=['content', 'post', 'article']) or
        soup.body
    )

    # Step 3: Use Claude for intelligent text extraction
    client = anthropic.Anthropic(api_key="your-api-key")
    message = client.messages.create(
        model="claude-3-5-sonnet-20241022",
        max_tokens=4096,
        messages=[
            {
                "role": "user",
                "content": f"""Extract and clean the text from this content section.

                HTML:
                {str(main_content)}"""
            }
        ]
    )

    return message.content[0].text

Error Handling and Rate Limiting

Implement robust error handling when extracting text:

import time
from anthropic import APIError, RateLimitError

def extract_with_retry(html_content, max_retries=3):
    client = anthropic.Anthropic(api_key="your-api-key")

    for attempt in range(max_retries):
        try:
            message = client.messages.create(
                model="claude-3-5-sonnet-20241022",
                max_tokens=4096,
                messages=[
                    {
                        "role": "user",
                        "content": f"Extract main text from:\n{html_content}"
                    }
                ]
            )
            return message.content[0].text

        except RateLimitError:
            if attempt < max_retries - 1:
                wait_time = (attempt + 1) * 2
                print(f"Rate limited. Waiting {wait_time}s...")
                time.sleep(wait_time)
            else:
                raise

        except APIError as e:
            print(f"API error: {e}")
            if attempt < max_retries - 1:
                time.sleep(1)
            else:
                raise

    return None

Cost Optimization Strategies

To minimize API costs when extracting text from multiple pages:

  1. Cache results: Store extracted text to avoid reprocessing
  2. Batch similar pages: Use consistent prompts for similar content types
  3. Prefilter content: Remove non-content HTML before sending to Claude
  4. Use appropriate models: Claude Haiku for simple extraction, Sonnet for complex analysis
import hashlib
import json
from pathlib import Path

class CachedTextExtractor:
    def __init__(self, cache_dir='./cache'):
        self.cache_dir = Path(cache_dir)
        self.cache_dir.mkdir(exist_ok=True)
        self.client = anthropic.Anthropic(api_key="your-api-key")

    def _get_cache_key(self, html_content):
        return hashlib.md5(html_content.encode()).hexdigest()

    def extract_text(self, html_content):
        cache_key = self._get_cache_key(html_content)
        cache_file = self.cache_dir / f"{cache_key}.txt"

        # Check cache
        if cache_file.exists():
            return cache_file.read_text()

        # Extract using Claude
        message = self.client.messages.create(
            model="claude-3-5-sonnet-20241022",
            max_tokens=4096,
            messages=[
                {
                    "role": "user",
                    "content": f"Extract main text:\n{html_content}"
                }
            ]
        )

        extracted_text = message.content[0].text

        # Cache result
        cache_file.write_text(extracted_text)

        return extracted_text

Conclusion

Claude AI provides a powerful, flexible approach to extracting text from HTML that goes beyond traditional parsing methods. By understanding semantic context, Claude can intelligently identify and extract relevant content while filtering out noise. When combined with preprocessing techniques and proper error handling, Claude AI for web scraping becomes an invaluable tool for developers working with diverse and complex web content.

For production applications requiring large-scale text extraction, consider using a dedicated web scraping API that combines traditional parsing with AI capabilities for optimal performance and reliability.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon