Table of contents

What is the Best Way to Structure Prompts for Claude AI When Scraping?

Structuring effective prompts for Claude AI is crucial for successful web scraping projects. A well-crafted prompt can dramatically improve extraction accuracy, reduce token usage, and ensure consistent, structured output. This guide covers proven strategies for prompt engineering specifically designed for web scraping tasks with Claude AI.

Core Principles of Prompt Structuring for Web Scraping

When using Claude AI for web scraping, your prompt should follow a clear three-part structure:

  1. Task Definition: Clearly state what data you need to extract
  2. Context Provision: Supply the HTML or text content to analyze
  3. Output Format: Specify exactly how results should be structured

This separation ensures Claude understands both what to do and how to return the results, minimizing ambiguity and improving accuracy.

Essential Prompt Components

1. Clear Task Instructions

Begin your prompt with explicit instructions about the extraction task:

import anthropic

client = anthropic.Anthropic(api_key="your-api-key")

prompt = """Extract product information from the following HTML.

Your task:
- Extract the product name, price, availability, and description
- If a field is not found, return null
- Ensure prices are extracted as numbers without currency symbols
- Extract availability as a boolean (true if in stock, false otherwise)

HTML Content:
{html_content}

Return the data as valid JSON."""

The key here is specificity. Rather than saying "extract product data," break down exactly which fields you need and how they should be interpreted.

2. Structured Output Schema

Define a clear schema for the output. Claude works exceptionally well with JSON schemas:

const fetch = require('node-fetch');

const prompt = `Extract the following fields from the HTML and return ONLY valid JSON:

{
  "title": "string",
  "price": number,
  "currency": "string (ISO code)",
  "inStock": boolean,
  "rating": number,
  "reviewCount": number,
  "images": ["array of image URLs"],
  "specifications": {
    "key": "value pairs of product specs"
  }
}

HTML:
${htmlContent}

Return only the JSON object, no additional text.`;

const message = await anthropic.messages.create({
  model: "claude-3-5-sonnet-20241022",
  max_tokens: 1024,
  messages: [
    { role: "user", content: prompt }
  ]
});

This approach ensures consistent output structure across multiple scraping requests.

3. HTML Context Optimization

How you provide HTML context significantly impacts performance and cost:

Minimize HTML Size: Strip unnecessary elements before sending to Claude:

from bs4 import BeautifulSoup
import re

def clean_html(html):
    soup = BeautifulSoup(html, 'html.parser')

    # Remove script and style tags
    for tag in soup(['script', 'style', 'noscript', 'svg']):
        tag.decompose()

    # Remove comments
    for comment in soup.find_all(text=lambda text: isinstance(text, Comment)):
        comment.extract()

    # Remove unnecessary attributes
    for tag in soup.find_all(True):
        tag.attrs = {k: v for k, v in tag.attrs.items()
                     if k in ['class', 'id', 'href', 'src', 'alt', 'title']}

    return str(soup)

# Use cleaned HTML in prompt
cleaned = clean_html(raw_html)
prompt = f"""Extract product details from this HTML:

{cleaned}

Return as JSON with fields: name, price, description."""

Focus on Relevant Sections: When dealing with large pages, isolate the relevant section:

def extract_main_content(html):
    soup = BeautifulSoup(html, 'html.parser')

    # Try to find main content area
    main_content = (
        soup.find('main') or
        soup.find('article') or
        soup.find(id='content') or
        soup.find(class_='product-details')
    )

    return str(main_content) if main_content else str(soup)

focused_html = extract_main_content(raw_html)

Advanced Prompt Patterns for Web Scraping

Pattern 1: Multi-Field Extraction with Fallbacks

For complex pages where data might appear in different formats:

prompt = """Extract product information with the following fallback rules:

PRICE EXTRACTION:
1. First, look for elements with class 'price' or 'product-price'
2. If not found, search for currency symbols ($, €, £) followed by numbers
3. If multiple prices exist, extract the largest one (usually the original price)

TITLE EXTRACTION:
1. Check <h1> tags first
2. Fall back to og:title meta tag
3. Finally, use the <title> tag content

IMAGES:
1. Look for product image galleries
2. Extract all img src attributes within product containers
3. Filter out icons and thumbnails (images smaller than 200x200px in filename)

HTML:
{html_content}

Output Format:
{{
  "title": "string",
  "price": {{
    "current": number,
    "original": number,
    "discount_percentage": number
  }},
  "images": ["array of full-size image URLs"]
}}"""

Pattern 2: List Extraction with Pagination Context

When scraping multiple items from a listing page:

const prompt = `Extract all product listings from this search results page.

For EACH product card, extract:
- Product name
- Price
- Rating (as number from 0-5)
- Number of reviews
- Product URL
- Primary image URL

Return as an array of products. If no products found, return empty array.

Example output format:
[
  {
    "name": "Product Name",
    "price": 29.99,
    "rating": 4.5,
    "reviewCount": 128,
    "url": "/product/abc",
    "imageUrl": "https://example.com/image.jpg"
  }
]

HTML:
${listingPageHtml}

Return ONLY the JSON array.`;

Pattern 3: Conditional Extraction Based on Page Type

For scraping different types of pages with a single prompt:

prompt = """Analyze this HTML and determine the page type, then extract relevant data.

PAGE TYPES:
1. Product Page: Extract name, price, description, specifications
2. Category Page: Extract list of product links and names
3. Article Page: Extract title, author, date, content
4. Other: Return page_type as 'unknown'

Always include a 'page_type' field in your response.

HTML:
{html_content}

Output schema:
{{
  "page_type": "product|category|article|unknown",
  "data": {{
    // Type-specific fields here
  }}
}}"""

Token Optimization Strategies

Claude's API charges based on input and output tokens, so optimization is crucial for cost-effective scraping:

1. Use Markdown Instead of HTML

When possible, convert HTML to markdown before sending to Claude. This can reduce token count by 40-60%:

from markdownify import markdownify as md

# Convert HTML to markdown
markdown_content = md(html_content)

prompt = f"""Extract product information from this content:

{markdown_content}

Return as JSON: {{"name": "", "price": 0, "description": ""}}"""

2. Implement Intelligent Chunking

For very large pages, extract data in chunks:

def chunk_and_extract(html, chunk_size=4000):
    soup = BeautifulSoup(html, 'html.parser')
    items = soup.find_all(class_='product-item')

    results = []
    chunk = []

    for item in items:
        chunk.append(str(item))

        if len(' '.join(chunk)) > chunk_size:
            # Process chunk
            prompt = f"Extract products from: {' '.join(chunk)}"
            response = claude_api_call(prompt)
            results.extend(response)
            chunk = []

    # Process remaining items
    if chunk:
        prompt = f"Extract products from: {' '.join(chunk)}"
        response = claude_api_call(prompt)
        results.extend(response)

    return results

Handling Dynamic Content and JavaScript-Rendered Pages

When scraping pages that rely heavily on JavaScript, you'll need to obtain the fully rendered HTML first. This is where tools like Puppeteer for handling AJAX requests become essential:

const puppeteer = require('puppeteer');
const Anthropic = require('@anthropic-ai/sdk');

async function scrapeWithClaude(url) {
  // First, render the page with Puppeteer
  const browser = await puppeteer.launch();
  const page = await browser.newPage();
  await page.goto(url, { waitUntil: 'networkidle0' });

  const html = await page.content();
  await browser.close();

  // Then, extract data with Claude
  const anthropic = new Anthropic({
    apiKey: process.env.ANTHROPIC_API_KEY,
  });

  const message = await anthropic.messages.create({
    model: "claude-3-5-sonnet-20241022",
    max_tokens: 2048,
    messages: [
      {
        role: "user",
        content: `Extract product data from this HTML:

${html}

Return JSON: {"name": "", "price": 0, "features": []}`
      }
    ]
  });

  return JSON.parse(message.content[0].text);
}

For single-page applications, you may also want to explore techniques for crawling SPAs using Puppeteer before passing the content to Claude.

Error Handling and Validation

Always include validation instructions in your prompts:

prompt = """Extract data from the HTML below.

VALIDATION RULES:
- If a required field is missing, set it to null
- If price cannot be extracted as a valid number, return null
- If date format is ambiguous, use ISO 8601 format (YYYY-MM-DD)
- Ensure all URLs are absolute, not relative
- Validate email addresses match standard email format

If the HTML doesn't contain product information, return:
{{"error": "No product data found", "data": null}}

HTML:
{html_content}

Return valid JSON only."""

Then validate the response:

import json

def safe_extract(html_content):
    response = client.messages.create(
        model="claude-3-5-sonnet-20241022",
        max_tokens=1024,
        messages=[{"role": "user", "content": prompt.format(html_content=html_content)}]
    )

    try:
        result = json.loads(response.content[0].text)

        # Validate required fields
        if "error" in result:
            return None

        required_fields = ["name", "price"]
        if all(result.get(field) for field in required_fields):
            return result
        else:
            return None

    except json.JSONDecodeError:
        print("Invalid JSON response from Claude")
        return None

Best Practices Summary

  1. Be Specific: Clearly define what data to extract and in what format
  2. Provide Examples: Show Claude the expected output structure
  3. Minimize Input: Clean and reduce HTML before sending
  4. Use JSON Schema: Define exact output structure
  5. Handle Edge Cases: Include instructions for missing data
  6. Validate Output: Always parse and validate responses
  7. Optimize Tokens: Use markdown, chunk large pages, remove unnecessary HTML
  8. Test Iteratively: Refine prompts based on actual results

Complete Working Example

Here's a full implementation combining all best practices:

import anthropic
import json
from markdownify import markdownify as md
from bs4 import BeautifulSoup

class ClaudeWebScraper:
    def __init__(self, api_key):
        self.client = anthropic.Anthropic(api_key=api_key)

    def clean_html(self, html):
        soup = BeautifulSoup(html, 'html.parser')
        for tag in soup(['script', 'style', 'noscript']):
            tag.decompose()
        return str(soup)

    def extract_product(self, html):
        cleaned = self.clean_html(html)
        markdown = md(cleaned)

        prompt = f"""Extract product information from the following content.

Required fields:
- name: Product name (string)
- price: Current price (number, no currency symbol)
- currency: Currency code (string, e.g., USD, EUR)
- description: Product description (string)
- inStock: Availability (boolean)
- images: Array of image URLs (array of strings)

If any field cannot be found, use null.

Content:
{markdown}

Return ONLY valid JSON matching this structure:
{{
  "name": null,
  "price": null,
  "currency": null,
  "description": null,
  "inStock": null,
  "images": []
}}"""

        response = self.client.messages.create(
            model="claude-3-5-sonnet-20241022",
            max_tokens=2048,
            messages=[{"role": "user", "content": prompt}]
        )

        try:
            return json.loads(response.content[0].text)
        except json.JSONDecodeError:
            return None

# Usage
scraper = ClaudeWebScraper(api_key="your-api-key")
result = scraper.extract_product(html_content)
print(json.dumps(result, indent=2))

Conclusion

Structuring prompts effectively for Claude AI in web scraping requires a balance between clarity, specificity, and token efficiency. By following these patterns—clear task definition, structured output schemas, HTML optimization, and robust error handling—you can build reliable, cost-effective scraping systems that leverage Claude's powerful language understanding capabilities.

Remember to always test your prompts with various edge cases and refine them based on real-world results. The investment in prompt engineering pays dividends in extraction accuracy and reduced API costs.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon