What is ScrapeGraphAI and how does it use LLMs for web scraping?
ScrapeGraphAI is an innovative open-source Python library that revolutionizes web scraping by leveraging Large Language Models (LLMs) to create intelligent, adaptive scraping pipelines. Unlike traditional web scraping tools that rely on fixed XPath or CSS selectors, ScrapeGraphAI uses AI to understand web page structures dynamically and extract data based on natural language instructions.
What Makes ScrapeGraphAI Different?
Traditional web scraping requires developers to manually inspect web pages, identify selectors, and write code to extract specific elements. When websites change their structure, scrapers break and need manual updates. ScrapeGraphAI solves this problem by using LLMs to:
- Understand web page context and structure automatically
- Adapt to layout changes without manual intervention
- Accept natural language queries instead of CSS selectors
- Generate scraping pipelines dynamically based on your requirements
How ScrapeGraphAI Works
ScrapeGraphAI uses a graph-based approach where different nodes in the pipeline perform specific tasks. The library constructs these graphs automatically based on your prompt and the target webpage.
Core Architecture
The framework consists of several key components:
- Graph Nodes: Individual processing units (fetch, parse, generate, etc.)
- LLM Integration: Connects to various LLM providers (OpenAI, Anthropic, local models)
- Scraping Pipeline: Automated flow from URL to structured data
Installation and Setup
First, install ScrapeGraphAI using pip:
pip install scrapegraphai
For using specific LLM providers, you'll need additional dependencies:
# For OpenAI
pip install scrapegraphai[openai]
# For local models with Ollama
pip install scrapegraphai[ollama]
# Install Playwright for browser automation
playwright install
Basic Usage Example
Here's a simple example using ScrapeGraphAI with OpenAI's GPT models:
import os
from scrapegraphai.graphs import SmartScraperGraph
# Set your API key
os.environ["OPENAI_API_KEY"] = "your-api-key-here"
# Define the configuration
graph_config = {
"llm": {
"api_key": os.getenv("OPENAI_API_KEY"),
"model": "gpt-4o-mini",
},
}
# Create the scraping graph
smart_scraper = SmartScraperGraph(
prompt="Extract all article titles and their publication dates",
source="https://example.com/blog",
config=graph_config
)
# Run the scraper
result = smart_scraper.run()
print(result)
The output will be structured JSON data containing exactly what you requested in plain English.
Advanced Scraping with Multiple Pages
ScrapeGraphAI can scrape multiple pages and aggregate results:
from scrapegraphai.graphs import SmartScraperMultiGraph
# List of URLs to scrape
urls = [
"https://example.com/page1",
"https://example.com/page2",
"https://example.com/page3"
]
# Configuration
graph_config = {
"llm": {
"api_key": os.getenv("OPENAI_API_KEY"),
"model": "gpt-4o-mini",
},
}
# Create multi-page scraper
multi_scraper = SmartScraperMultiGraph(
prompt="Find all product names, prices, and ratings",
source=urls,
config=graph_config
)
# Execute
results = multi_scraper.run()
# Results will be a list of dictionaries, one per URL
for i, result in enumerate(results):
print(f"Results from {urls[i]}:")
print(result)
Using Local LLMs with Ollama
One of ScrapeGraphAI's strengths is support for local LLMs, which reduces costs and improves privacy:
from scrapegraphai.graphs import SmartScraperGraph
# Configuration for local Ollama model
graph_config = {
"llm": {
"model": "ollama/llama3",
"temperature": 0,
"base_url": "http://localhost:11434",
},
"embeddings": {
"model": "ollama/nomic-embed-text",
"base_url": "http://localhost:11434",
}
}
smart_scraper = SmartScraperGraph(
prompt="Extract all product specifications and features",
source="https://example.com/product",
config=graph_config
)
result = smart_scraper.run()
print(result)
JavaScript Rendering and Dynamic Content
For websites that rely heavily on JavaScript, similar to handling AJAX requests using Puppeteer, ScrapeGraphAI can use headless browsers:
from scrapegraphai.graphs import SmartScraperGraph
graph_config = {
"llm": {
"api_key": os.getenv("OPENAI_API_KEY"),
"model": "gpt-4o-mini",
},
"headless": True,
"browser_type": "chromium"
}
# This will use a headless browser to render JavaScript
smart_scraper = SmartScraperGraph(
prompt="Extract all dynamically loaded product reviews",
source="https://example.com/product-reviews",
config=graph_config
)
result = smart_scraper.run()
Search and Scrape Functionality
ScrapeGraphAI can even perform web searches and scrape results:
from scrapegraphai.graphs import SearchGraph
graph_config = {
"llm": {
"api_key": os.getenv("OPENAI_API_KEY"),
"model": "gpt-4o-mini",
},
}
# Perform a search and scrape results
search_graph = SearchGraph(
prompt="Find the top 5 Python web scraping libraries and their GitHub stars",
config=graph_config
)
result = search_graph.run()
print(result)
Custom Graph Pipelines
For advanced use cases, you can create custom scraping graphs:
from scrapegraphai.graphs import BaseGraph
from scrapegraphai.nodes import FetchNode, ParseNode, GenerateAnswerNode
class CustomScraperGraph(BaseGraph):
def __init__(self, prompt, source, config):
super().__init__(prompt, config, source)
self.input_key = "url"
# Define your custom pipeline
fetch_node = FetchNode(
input="url",
output=["document"]
)
parse_node = ParseNode(
input="document",
output=["parsed_document"]
)
generate_node = GenerateAnswerNode(
input="parsed_document",
output=["answer"]
)
# Add nodes to the graph
self.append_node(fetch_node)
self.append_node(parse_node)
self.append_node(generate_node)
Handling Errors and Retries
When working with LLM-based scraping, it's important to handle failures gracefully, much like handling errors in Puppeteer:
from scrapegraphai.graphs import SmartScraperGraph
import time
def scrape_with_retry(url, prompt, config, max_retries=3):
"""Scrape with retry logic"""
for attempt in range(max_retries):
try:
scraper = SmartScraperGraph(
prompt=prompt,
source=url,
config=config
)
result = scraper.run()
return result
except Exception as e:
print(f"Attempt {attempt + 1} failed: {str(e)}")
if attempt < max_retries - 1:
time.sleep(2 ** attempt) # Exponential backoff
else:
raise
# Usage
result = scrape_with_retry(
url="https://example.com",
prompt="Extract all product information",
config=graph_config
)
Cost Optimization Strategies
Since ScrapeGraphAI relies on LLM API calls, costs can add up. Here are optimization strategies:
1. Use Smaller Models
# Instead of GPT-4
graph_config = {
"llm": {
"model": "gpt-4o-mini", # Much cheaper than gpt-4
}
}
2. Use Local Models
Running models locally via Ollama eliminates API costs entirely.
3. Cache Results
graph_config = {
"llm": {
"model": "gpt-4o-mini",
},
"cache_path": "./scraping_cache"
}
4. Limit Context Size
Be specific in your prompts to reduce token usage:
# Instead of: "Extract everything from this page"
# Use: "Extract only product name, price, and SKU"
Comparing ScrapeGraphAI to Traditional Methods
| Feature | Traditional Scraping | ScrapeGraphAI | |---------|---------------------|---------------| | Setup complexity | Medium to High | Low | | Adaptability | Manual updates needed | Automatic | | Learning curve | Requires HTML/CSS knowledge | Natural language prompts | | Maintenance | High | Low | | Cost | Low (compute only) | Medium (LLM API calls) | | Accuracy | High (when working) | High | | Speed | Fast | Slower (LLM processing) |
When to Use ScrapeGraphAI
ScrapeGraphAI is ideal for:
- Prototype and MVPs: Quickly build scraping solutions without deep technical knowledge
- Frequently changing websites: Sites that update their structure regularly
- Complex data extraction: When you need semantic understanding of content
- One-off scraping tasks: When building a traditional scraper isn't worth the effort
- Research projects: Exploratory data collection from various sources
When NOT to Use ScrapeGraphAI
Consider traditional methods when:
- High-volume scraping: LLM costs can become prohibitive
- Real-time requirements: LLM processing adds latency
- Simple, stable websites: Overkill when CSS selectors work fine
- Budget constraints: API costs may exceed traditional hosting costs
Integration with Existing Workflows
ScrapeGraphAI can complement traditional tools. For example, you might use Puppeteer for browser automation to handle authentication and navigation, then use ScrapeGraphAI for intelligent data extraction:
from scrapegraphai.graphs import SmartScraperGraph
import asyncio
from playwright.async_api import async_playwright
async def scrape_authenticated_page():
async with async_playwright() as p:
browser = await p.chromium.launch()
page = await browser.new_page()
# Handle authentication
await page.goto("https://example.com/login")
await page.fill("#username", "user")
await page.fill("#password", "pass")
await page.click("#login-button")
await page.wait_for_selector("#dashboard")
# Get the page content
content = await page.content()
await browser.close()
# Now use ScrapeGraphAI to extract data
scraper = SmartScraperGraph(
prompt="Extract all user activity data",
source=content,
config=graph_config
)
return scraper.run()
result = asyncio.run(scrape_authenticated_page())
Conclusion
ScrapeGraphAI represents a paradigm shift in web scraping, making it accessible to developers without deep web scraping expertise while providing powerful adaptability to changing websites. By combining the flexibility of LLM-powered data extraction with traditional scraping techniques, it offers a compelling solution for modern web data collection challenges.
The library continues to evolve with support for more LLM providers, better optimization, and enhanced features. For production use cases requiring high volume or strict performance requirements, consider hybrid approaches that combine ScrapeGraphAI's intelligence with traditional scraping efficiency.