Table of contents

When should I use an LLM for web scraping instead of XPath or CSS selectors?

Choosing between LLM-based web scraping and traditional XPath or CSS selectors depends on several factors including data structure variability, maintenance overhead, and performance requirements. This guide will help you understand when each approach is most appropriate.

Understanding the Fundamental Difference

Traditional web scraping with XPath and CSS selectors relies on the structural consistency of HTML documents. You write precise selectors that target specific DOM elements based on their tags, classes, IDs, or hierarchical position.

LLM-based web scraping, on the other hand, uses language models to understand content semantically. Instead of relying on HTML structure, LLMs interpret the page content and extract information based on meaning and context.

When to Use LLMs for Web Scraping

1. Frequently Changing Website Structures

Use LLMs when: Websites frequently redesign their layout or change their HTML structure.

Traditional selectors break when a site updates its CSS classes or restructures its DOM. LLMs can adapt to structural changes as long as the semantic content remains similar.

# Traditional approach - breaks when class names change
from bs4 import BeautifulSoup
import requests

response = requests.get('https://example.com/products')
soup = BeautifulSoup(response.content, 'html.parser')

# This breaks if "product-price-2024" changes to "product-price-2025"
prices = soup.find_all('span', class_='product-price-2024')
# LLM approach - resilient to structural changes
from openai import OpenAI
import requests

client = OpenAI(api_key='your-api-key')
response = requests.get('https://example.com/products')

completion = client.chat.completions.create(
    model="gpt-4-turbo-preview",
    messages=[
        {
            "role": "system",
            "content": "Extract product prices from the HTML. Return as JSON array."
        },
        {
            "role": "user",
            "content": response.text
        }
    ],
    response_format={"type": "json_object"}
)

prices = completion.choices[0].message.content

2. Inconsistent Data Formatting Across Pages

Use LLMs when: Data appears in different formats across pages or sections of a website.

For example, some product listings might show prices as "$99.99", others as "99.99 USD", and some as "Ninety-nine dollars". LLMs can normalize this data during extraction.

// JavaScript example with Claude API
const Anthropic = require('@anthropic-ai/sdk');
const axios = require('axios');

const anthropic = new Anthropic({
  apiKey: process.env.ANTHROPIC_API_KEY,
});

async function extractProductInfo(url) {
  const response = await axios.get(url);

  const message = await anthropic.messages.create({
    model: 'claude-3-5-sonnet-20241022',
    max_tokens: 1024,
    messages: [{
      role: 'user',
      content: `Extract product name, price (normalize to USD decimal format),
                and availability from this HTML: ${response.data}`
    }]
  });

  return message.content[0].text;
}

3. Complex, Human-Readable Content

Use LLMs when: Extracting information that requires interpretation or context understanding.

Examples include: - Sentiment analysis from reviews - Categorizing products based on descriptions - Extracting dates mentioned in natural language ("next Tuesday", "Q1 2024") - Understanding relationships between entities

# Extracting complex information with function calling
import openai

def extract_review_insights(html_content):
    response = openai.chat.completions.create(
        model="gpt-4-turbo-preview",
        messages=[
            {
                "role": "system",
                "content": "Extract structured review data from product pages."
            },
            {
                "role": "user",
                "content": html_content
            }
        ],
        functions=[
            {
                "name": "save_review_data",
                "description": "Save extracted review information",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "rating": {"type": "number"},
                        "sentiment": {"type": "string", "enum": ["positive", "negative", "neutral"]},
                        "key_points": {"type": "array", "items": {"type": "string"}},
                        "mentioned_features": {"type": "array", "items": {"type": "string"}}
                    },
                    "required": ["rating", "sentiment"]
                }
            }
        ],
        function_call={"name": "save_review_data"}
    )

    return response.choices[0].message.function_call.arguments

4. Multi-Site Scraping with Different Structures

Use LLMs when: Scraping similar data from multiple websites with completely different layouts.

Instead of maintaining separate selectors for each site, you can use a single LLM-based approach with consistent prompts.

# Scraping job listings from multiple sites with one approach
def scrape_job_listings(url, html_content):
    prompt = """
    Extract all job listings from this page. For each job, provide:
    - Job title
    - Company name
    - Location
    - Salary (if mentioned, otherwise null)
    - Job type (full-time, part-time, contract, etc.)

    Return as a JSON array.
    """

    response = client.chat.completions.create(
        model="gpt-4-turbo-preview",
        messages=[
            {"role": "system", "content": "You are a web scraping assistant."},
            {"role": "user", "content": f"{prompt}\n\nHTML:\n{html_content}"}
        ],
        response_format={"type": "json_object"}
    )

    return response.choices[0].message.content

# Works across different job sites without changing code
jobs_indeed = scrape_job_listings("indeed.com", indeed_html)
jobs_linkedin = scrape_job_listings("linkedin.com", linkedin_html)
jobs_glassdoor = scrape_job_listings("glassdoor.com", glassdoor_html)

When to Use XPath and CSS Selectors

1. High-Volume, Cost-Sensitive Projects

Use selectors when: You need to scrape thousands or millions of pages and cost is a concern.

LLM API calls are significantly more expensive than parsing HTML locally. For large-scale projects, traditional selectors are more economical.

# Cost comparison example
# Scraping 1 million product pages

# Traditional approach: ~$0 (just hosting/bandwidth)
# LLM approach (GPT-4): ~$30,000+ (at $0.03 per page)

2. Stable, Well-Structured Websites

Use selectors when: The target website has a consistent structure and rarely changes.

For stable sites, XPath and CSS selectors provide faster, more reliable, and cheaper extraction.

// Efficient scraping with Cheerio (JavaScript)
const cheerio = require('cheerio');
const axios = require('axios');

async function scrapeProducts(url) {
  const response = await axios.get(url);
  const $ = cheerio.load(response.data);

  const products = [];

  $('.product-item').each((i, element) => {
    products.push({
      title: $(element).find('.product-title').text().trim(),
      price: $(element).find('.product-price').text().trim(),
      url: $(element).find('a').attr('href')
    });
  });

  return products;
}

3. Performance-Critical Applications

Use selectors when: Speed is crucial and you need real-time or near-real-time data.

LLM APIs introduce latency (typically 1-5 seconds per request). Traditional parsing is nearly instantaneous.

4. Simple, Predictable Data Extraction

Use selectors when: The data you need is clearly structured and doesn't require interpretation.

For straightforward tasks like extracting all links, images, or clearly marked prices, selectors are simpler and more efficient.

# Simple XPath extraction with lxml
from lxml import html
import requests

response = requests.get('https://example.com/articles')
tree = html.fromstring(response.content)

# Extract all article titles
titles = tree.xpath('//article/h2/text()')

# Extract all article URLs
urls = tree.xpath('//article/a/@href')

Hybrid Approach: Best of Both Worlds

Often, the optimal solution combines both methods:

  1. Use traditional selectors for structural navigation and initial extraction
  2. Use LLMs for complex data interpretation and normalization
# Hybrid approach example
import requests
from bs4 import BeautifulSoup
from openai import OpenAI

def hybrid_product_scraping(url):
    # Step 1: Use traditional scraping for structure
    response = requests.get(url)
    soup = BeautifulSoup(response.content, 'html.parser')

    # Extract product blocks efficiently with selectors
    product_blocks = soup.find_all('div', class_='product-card')

    # Step 2: Use LLM for complex interpretation
    client = OpenAI(api_key='your-api-key')
    products = []

    for block in product_blocks:
        # Extract only the relevant HTML block
        block_html = str(block)

        # Use LLM to interpret complex data
        completion = client.chat.completions.create(
            model="gpt-4-turbo-preview",
            messages=[
                {
                    "role": "user",
                    "content": f"Extract product details (name, price in USD, features list, availability status) from: {block_html}"
                }
            ],
            response_format={"type": "json_object"}
        )

        products.append(completion.choices[0].message.content)

    return products

Decision Framework

Use this quick reference to decide which approach to use:

| Factor | Use LLMs | Use Selectors | |--------|----------|---------------| | Site stability | Frequently changing | Stable structure | | Data complexity | Complex, requires interpretation | Simple, structured | | Scale | < 10,000 pages | > 10,000 pages | | Budget | Flexible budget | Cost-sensitive | | Maintenance | Minimize maintenance | Can maintain selectors | | Speed requirements | Can tolerate latency | Need real-time | | Data consistency | Inconsistent formats | Consistent formats |

Conclusion

AI-powered web scraping with LLMs excels when dealing with changing structures, complex content, or multi-site extraction where maintenance overhead would be high. Traditional XPath and CSS selectors remain superior for high-volume, cost-sensitive, and performance-critical applications on stable websites.

For many real-world projects, a hybrid approach leveraging both technologies provides the optimal balance of reliability, cost, and maintainability. Start with traditional selectors for structural navigation, then apply LLMs where semantic understanding adds value.

Understanding how LLMs work for data extraction will help you make informed decisions about when to invest in AI-powered scraping versus traditional techniques.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon