What is ScrapeGraphAI and how does it use LLMs for web scraping?

ScrapeGraphAI is an innovative open-source Python library that revolutionizes web scraping by leveraging Large Language Models (LLMs) to create intelligent, adaptive scraping pipelines. Unlike traditional web scraping tools that rely on fixed XPath or CSS selectors, ScrapeGraphAI uses AI to understand web page structures dynamically and extract data based on natural language instructions.

What Makes ScrapeGraphAI Different?

Traditional web scraping requires developers to manually inspect web pages, identify selectors, and write code to extract specific elements. When websites change their structure, scrapers break and need manual updates. ScrapeGraphAI solves this problem by using LLMs to:

Understand web page context and structure automatically
Adapt to layout changes without manual intervention
Accept natural language queries instead of CSS selectors
Generate scraping pipelines dynamically based on your requirements

How ScrapeGraphAI Works

ScrapeGraphAI uses a graph-based approach where different nodes in the pipeline perform specific tasks. The library constructs these graphs automatically based on your prompt and the target webpage.

Core Architecture

The framework consists of several key components:

Graph Nodes: Individual processing units (fetch, parse, generate, etc.)
LLM Integration: Connects to various LLM providers (OpenAI, Anthropic, local models)
Scraping Pipeline: Automated flow from URL to structured data

Installation and Setup

First, install ScrapeGraphAI using pip:

pip install scrapegraphai

For using specific LLM providers, you'll need additional dependencies:

# For OpenAI
pip install scrapegraphai[openai]

# For local models with Ollama
pip install scrapegraphai[ollama]

# Install Playwright for browser automation
playwright install

Basic Usage Example

Here's a simple example using ScrapeGraphAI with OpenAI's GPT models:

import os
from scrapegraphai.graphs import SmartScraperGraph

# Set your API key
os.environ["OPENAI_API_KEY"] = "your-api-key-here"

# Define the configuration
graph_config = {
    "llm": {
        "api_key": os.getenv("OPENAI_API_KEY"),
        "model": "gpt-4o-mini",
    },
}

# Create the scraping graph
smart_scraper = SmartScraperGraph(
    prompt="Extract all article titles and their publication dates",
    source="https://example.com/blog",
    config=graph_config
)

# Run the scraper
result = smart_scraper.run()
print(result)

The output will be structured JSON data containing exactly what you requested in plain English.

Advanced Scraping with Multiple Pages

ScrapeGraphAI can scrape multiple pages and aggregate results:

from scrapegraphai.graphs import SmartScraperMultiGraph

# List of URLs to scrape
urls = [
    "https://example.com/page1",
    "https://example.com/page2",
    "https://example.com/page3"
]

# Configuration
graph_config = {
    "llm": {
        "api_key": os.getenv("OPENAI_API_KEY"),
        "model": "gpt-4o-mini",
    },
}

# Create multi-page scraper
multi_scraper = SmartScraperMultiGraph(
    prompt="Find all product names, prices, and ratings",
    source=urls,
    config=graph_config
)

# Execute
results = multi_scraper.run()

# Results will be a list of dictionaries, one per URL
for i, result in enumerate(results):
    print(f"Results from {urls[i]}:")
    print(result)

Using Local LLMs with Ollama

One of ScrapeGraphAI's strengths is support for local LLMs, which reduces costs and improves privacy:

from scrapegraphai.graphs import SmartScraperGraph

# Configuration for local Ollama model
graph_config = {
    "llm": {
        "model": "ollama/llama3",
        "temperature": 0,
        "base_url": "http://localhost:11434",
    },
    "embeddings": {
        "model": "ollama/nomic-embed-text",
        "base_url": "http://localhost:11434",
    }
}

smart_scraper = SmartScraperGraph(
    prompt="Extract all product specifications and features",
    source="https://example.com/product",
    config=graph_config
)

result = smart_scraper.run()
print(result)

JavaScript Rendering and Dynamic Content

For websites that rely heavily on JavaScript, similar to handling AJAX requests using Puppeteer, ScrapeGraphAI can use headless browsers:

from scrapegraphai.graphs import SmartScraperGraph

graph_config = {
    "llm": {
        "api_key": os.getenv("OPENAI_API_KEY"),
        "model": "gpt-4o-mini",
    },
    "headless": True,
    "browser_type": "chromium"
}

# This will use a headless browser to render JavaScript
smart_scraper = SmartScraperGraph(
    prompt="Extract all dynamically loaded product reviews",
    source="https://example.com/product-reviews",
    config=graph_config
)

result = smart_scraper.run()

Search and Scrape Functionality

ScrapeGraphAI can even perform web searches and scrape results:

from scrapegraphai.graphs import SearchGraph

graph_config = {
    "llm": {
        "api_key": os.getenv("OPENAI_API_KEY"),
        "model": "gpt-4o-mini",
    },
}

# Perform a search and scrape results
search_graph = SearchGraph(
    prompt="Find the top 5 Python web scraping libraries and their GitHub stars",
    config=graph_config
)

result = search_graph.run()
print(result)

Custom Graph Pipelines

For advanced use cases, you can create custom scraping graphs:

from scrapegraphai.graphs import BaseGraph
from scrapegraphai.nodes import FetchNode, ParseNode, GenerateAnswerNode

class CustomScraperGraph(BaseGraph):
    def __init__(self, prompt, source, config):
        super().__init__(prompt, config, source)

        self.input_key = "url"

        # Define your custom pipeline
        fetch_node = FetchNode(
            input="url",
            output=["document"]
        )

        parse_node = ParseNode(
            input="document",
            output=["parsed_document"]
        )

        generate_node = GenerateAnswerNode(
            input="parsed_document",
            output=["answer"]
        )

        # Add nodes to the graph
        self.append_node(fetch_node)
        self.append_node(parse_node)
        self.append_node(generate_node)

Handling Errors and Retries

When working with LLM-based scraping, it's important to handle failures gracefully, much like handling errors in Puppeteer:

from scrapegraphai.graphs import SmartScraperGraph
import time

def scrape_with_retry(url, prompt, config, max_retries=3):
    """Scrape with retry logic"""
    for attempt in range(max_retries):
        try:
            scraper = SmartScraperGraph(
                prompt=prompt,
                source=url,
                config=config
            )
            result = scraper.run()
            return result
        except Exception as e:
            print(f"Attempt {attempt + 1} failed: {str(e)}")
            if attempt < max_retries - 1:
                time.sleep(2 ** attempt)  # Exponential backoff
            else:
                raise

# Usage
result = scrape_with_retry(
    url="https://example.com",
    prompt="Extract all product information",
    config=graph_config
)

Cost Optimization Strategies

Since ScrapeGraphAI relies on LLM API calls, costs can add up. Here are optimization strategies:

1. Use Smaller Models

# Instead of GPT-4
graph_config = {
    "llm": {
        "model": "gpt-4o-mini",  # Much cheaper than gpt-4
    }
}

2. Use Local Models

Running models locally via Ollama eliminates API costs entirely.

3. Cache Results

graph_config = {
    "llm": {
        "model": "gpt-4o-mini",
    },
    "cache_path": "./scraping_cache"
}

4. Limit Context Size

Be specific in your prompts to reduce token usage:

# Instead of: "Extract everything from this page"
# Use: "Extract only product name, price, and SKU"

Comparing ScrapeGraphAI to Traditional Methods

| Feature | Traditional Scraping | ScrapeGraphAI | |---------|---------------------|---------------| | Setup complexity | Medium to High | Low | | Adaptability | Manual updates needed | Automatic | | Learning curve | Requires HTML/CSS knowledge | Natural language prompts | | Maintenance | High | Low | | Cost | Low (compute only) | Medium (LLM API calls) | | Accuracy | High (when working) | High | | Speed | Fast | Slower (LLM processing) |

When to Use ScrapeGraphAI

ScrapeGraphAI is ideal for:

Prototype and MVPs: Quickly build scraping solutions without deep technical knowledge
Frequently changing websites: Sites that update their structure regularly
Complex data extraction: When you need semantic understanding of content
One-off scraping tasks: When building a traditional scraper isn't worth the effort
Research projects: Exploratory data collection from various sources

When NOT to Use ScrapeGraphAI

Consider traditional methods when:

High-volume scraping: LLM costs can become prohibitive
Real-time requirements: LLM processing adds latency
Simple, stable websites: Overkill when CSS selectors work fine
Budget constraints: API costs may exceed traditional hosting costs

Integration with Existing Workflows

ScrapeGraphAI can complement traditional tools. For example, you might use Puppeteer for browser automation to handle authentication and navigation, then use ScrapeGraphAI for intelligent data extraction:

from scrapegraphai.graphs import SmartScraperGraph
import asyncio
from playwright.async_api import async_playwright

async def scrape_authenticated_page():
    async with async_playwright() as p:
        browser = await p.chromium.launch()
        page = await browser.new_page()

        # Handle authentication
        await page.goto("https://example.com/login")
        await page.fill("#username", "user")
        await page.fill("#password", "pass")
        await page.click("#login-button")
        await page.wait_for_selector("#dashboard")

        # Get the page content
        content = await page.content()
        await browser.close()

        # Now use ScrapeGraphAI to extract data
        scraper = SmartScraperGraph(
            prompt="Extract all user activity data",
            source=content,
            config=graph_config
        )

        return scraper.run()

result = asyncio.run(scrape_authenticated_page())

Conclusion

ScrapeGraphAI represents a paradigm shift in web scraping, making it accessible to developers without deep web scraping expertise while providing powerful adaptability to changing websites. By combining the flexibility of LLM-powered data extraction with traditional scraping techniques, it offers a compelling solution for modern web data collection challenges.

The library continues to evolve with support for more LLM providers, better optimization, and enhanced features. For production use cases requiring high volume or strict performance requirements, consider hybrid approaches that combine ScrapeGraphAI's intelligence with traditional scraping efficiency.

Table of contents