How do I use Crawlee with Python for web scraping?
Crawlee for Python is a powerful web scraping and browser automation library that helps developers build reliable crawlers. It provides automatic scaling, proxy rotation, storage management, and handles common crawling challenges like retries and rate limiting. This guide shows you how to get started with Crawlee for Python and build production-ready web scrapers.
What is Crawlee for Python?
Crawlee for Python is the Python implementation of the popular Crawlee framework, originally developed for Node.js. It provides a robust toolkit for web scraping that includes:
- Automatic retries and error handling
- Request queue management
- Proxy rotation and session management
- Data storage and export
- Browser automation support (Playwright, Selenium)
- HTTP client crawling for faster static page scraping
- Crawling context for managing state between requests
Installation and Setup
Installing Crawlee for Python
To install Crawlee for Python, you need Python 3.8 or higher. Install the base package using pip:
pip install crawlee
For browser automation with Playwright:
pip install 'crawlee[playwright]'
playwright install
For Selenium support:
pip install 'crawlee[selenium]'
For additional features like HTTP client crawling:
pip install 'crawlee[httpx]'
Setting Up Your First Crawler
Create a new Python file for your crawler. Here's a basic example:
import asyncio
from crawlee.playwright_crawler import PlaywrightCrawler, PlaywrightCrawlingContext
async def main() -> None:
# Create a crawler instance
crawler = PlaywrightCrawler(
max_requests_per_crawl=50,
headless=True,
browser_type='chromium'
)
# Define the default request handler
@crawler.router.default_handler
async def request_handler(context: PlaywrightCrawlingContext) -> None:
# Extract data from the page
title = await context.page.title()
url = context.request.url
# Save the extracted data
await context.push_data({
'url': url,
'title': title
})
print(f'Scraped: {title} from {url}')
# Start crawling
await crawler.run(['https://example.com'])
if __name__ == '__main__':
asyncio.run(main())
Building a Web Scraper with Crawlee
HTTP Client Crawler (Faster for Static Pages)
For static websites that don't require JavaScript rendering, use the HTTP client crawler for better performance:
import asyncio
from crawlee.http_crawler import HttpCrawler, HttpCrawlingContext
from bs4 import BeautifulSoup
async def main() -> None:
crawler = HttpCrawler(
max_requests_per_crawl=100,
max_request_retries=3
)
@crawler.router.default_handler
async def request_handler(context: HttpCrawlingContext) -> None:
# Parse HTML with BeautifulSoup
soup = BeautifulSoup(context.http_response.read(), 'html.parser')
# Extract data
title = soup.find('h1').text if soup.find('h1') else 'No title'
paragraphs = [p.text for p in soup.find_all('p')]
# Save data
await context.push_data({
'url': context.request.url,
'title': title,
'paragraphs': paragraphs
})
# Enqueue new URLs
links = soup.find_all('a', href=True)
for link in links[:10]: # Limit to first 10 links
absolute_url = context.request.url_join(link['href'])
await context.add_requests([absolute_url])
await crawler.run(['https://example.com'])
if __name__ == '__main__':
asyncio.run(main())
Browser-Based Crawler with Playwright
For dynamic websites that require JavaScript rendering, use the Playwright crawler, which is similar to handling browser sessions in Puppeteer:
import asyncio
from crawlee.playwright_crawler import PlaywrightCrawler, PlaywrightCrawlingContext
async def main() -> None:
crawler = PlaywrightCrawler(
headless=True,
browser_type='chromium',
max_requests_per_crawl=50
)
@crawler.router.default_handler
async def request_handler(context: PlaywrightCrawlingContext) -> None:
page = context.page
# Wait for content to load
await page.wait_for_selector('h1')
# Extract data using Playwright selectors
title = await page.locator('h1').inner_text()
# Get all article titles
articles = await page.locator('article').all()
article_data = []
for article in articles:
article_title = await article.locator('h2').inner_text()
article_link = await article.locator('a').get_attribute('href')
article_data.append({
'title': article_title,
'link': article_link
})
# Save scraped data
await context.push_data({
'url': context.request.url,
'page_title': title,
'articles': article_data
})
# Click and navigate (if needed)
# next_button = page.locator('button.next')
# if await next_button.count() > 0:
# await next_button.click()
# await page.wait_for_load_state('networkidle')
await crawler.run(['https://news.ycombinator.com'])
if __name__ == '__main__':
asyncio.run(main())
Advanced Crawlee Features
Request Routing with Multiple Handlers
Crawlee supports routing different URL patterns to different handlers:
import asyncio
from crawlee.playwright_crawler import PlaywrightCrawler, PlaywrightCrawlingContext
async def main() -> None:
crawler = PlaywrightCrawler()
# Handler for product listing pages
@crawler.router.handler('listing')
async def listing_handler(context: PlaywrightCrawlingContext) -> None:
page = context.page
# Find all product links
product_links = await page.locator('a.product-link').all()
for link in product_links:
url = await link.get_attribute('href')
# Enqueue product pages with 'product' label
await context.add_requests([{
'url': context.request.url_join(url),
'label': 'product'
}])
# Handler for individual product pages
@crawler.router.handler('product')
async def product_handler(context: PlaywrightCrawlingContext) -> None:
page = context.page
# Extract product details
product_name = await page.locator('h1.product-name').inner_text()
price = await page.locator('span.price').inner_text()
description = await page.locator('div.description').inner_text()
await context.push_data({
'name': product_name,
'price': price,
'description': description,
'url': context.request.url
})
# Start with listing pages
await crawler.run([{
'url': 'https://example.com/products',
'label': 'listing'
}])
if __name__ == '__main__':
asyncio.run(main())
Proxy Configuration
Crawlee makes it easy to configure proxies for your crawlers:
from crawlee.playwright_crawler import PlaywrightCrawler
from crawlee.proxy_configuration import ProxyConfiguration
async def main() -> None:
# Configure proxy
proxy_configuration = ProxyConfiguration(
proxy_urls=[
'http://proxy1.example.com:8000',
'http://proxy2.example.com:8000',
]
)
crawler = PlaywrightCrawler(
proxy_configuration=proxy_configuration,
max_requests_per_crawl=100
)
# Your request handler here...
Data Storage and Export
Crawlee automatically stores scraped data in the storage
directory. You can access and export this data:
import asyncio
from crawlee.playwright_crawler import PlaywrightCrawler, PlaywrightCrawlingContext
from crawlee.storages import Dataset
async def main() -> None:
crawler = PlaywrightCrawler()
@crawler.router.default_handler
async def request_handler(context: PlaywrightCrawlingContext) -> None:
# Extract and save data
data = {
'url': context.request.url,
'title': await context.page.title()
}
await context.push_data(data)
await crawler.run(['https://example.com'])
# Export data after crawling
dataset = await Dataset.open()
data = await dataset.get_data()
# Access the scraped data
for item in data.items:
print(item)
# Export to JSON
await dataset.export_to('results.json')
# Export to CSV
await dataset.export_to('results.csv')
if __name__ == '__main__':
asyncio.run(main())
Handling Dynamic Content and Waiting
When scraping dynamic websites similar to handling AJAX requests using Puppeteer, you need to wait for content to load:
import asyncio
from crawlee.playwright_crawler import PlaywrightCrawler, PlaywrightCrawlingContext
async def main() -> None:
crawler = PlaywrightCrawler()
@crawler.router.default_handler
async def request_handler(context: PlaywrightCrawlingContext) -> None:
page = context.page
# Wait for specific selector
await page.wait_for_selector('div.loaded-content')
# Wait for network to be idle
await page.wait_for_load_state('networkidle')
# Wait for a specific timeout
await page.wait_for_timeout(2000) # Wait 2 seconds
# Wait for a function to return true
await page.wait_for_function('window.dataLoaded === true')
# Extract data after everything is loaded
content = await page.locator('div.loaded-content').inner_text()
await context.push_data({'content': content})
await crawler.run(['https://dynamic-site.example.com'])
if __name__ == '__main__':
asyncio.run(main())
Error Handling and Retries
Crawlee includes built-in error handling and retry mechanisms:
import asyncio
from crawlee.playwright_crawler import PlaywrightCrawler, PlaywrightCrawlingContext
from crawlee.errors import SessionError
async def main() -> None:
crawler = PlaywrightCrawler(
max_request_retries=5, # Retry failed requests up to 5 times
max_requests_per_crawl=100,
request_handler_timeout_secs=60 # Timeout after 60 seconds
)
@crawler.router.default_handler
async def request_handler(context: PlaywrightCrawlingContext) -> None:
try:
page = context.page
await page.wait_for_selector('h1', timeout=30000)
title = await page.locator('h1').inner_text()
await context.push_data({'title': title})
except Exception as e:
context.log.error(f'Error processing {context.request.url}: {str(e)}')
# Mark request as failed to trigger retry
raise SessionError(f'Failed to process page: {str(e)}')
# Add failed request handler
@crawler.failed_request_handler
async def failed_handler(context: PlaywrightCrawlingContext, error: Exception) -> None:
context.log.error(f'Request {context.request.url} failed after retries: {error}')
await crawler.run(['https://example.com'])
if __name__ == '__main__':
asyncio.run(main())
Performance Optimization
Concurrent Crawling
Crawlee automatically manages concurrent requests. You can control the concurrency level:
from crawlee.playwright_crawler import PlaywrightCrawler
async def main() -> None:
crawler = PlaywrightCrawler(
max_concurrency=5, # Maximum 5 concurrent requests
min_concurrency=1, # Minimum 1 concurrent request
max_requests_per_minute=60 # Rate limiting
)
# Your request handler here...
Session Management
For websites that require maintaining state across requests:
from crawlee.playwright_crawler import PlaywrightCrawler, PlaywrightCrawlingContext
async def main() -> None:
crawler = PlaywrightCrawler(
use_session_pool=True,
persist_cookies_per_session=True
)
@crawler.router.default_handler
async def request_handler(context: PlaywrightCrawlingContext) -> None:
# Sessions are automatically managed
# Cookies persist across requests in the same session
page = context.page
# Your scraping logic here...
Complete Example: E-commerce Scraper
Here's a complete example that demonstrates many Crawlee features when navigating to different pages:
import asyncio
from crawlee.playwright_crawler import PlaywrightCrawler, PlaywrightCrawlingContext
from crawlee.storages import Dataset
async def main() -> None:
crawler = PlaywrightCrawler(
max_requests_per_crawl=200,
max_concurrency=3,
headless=True
)
@crawler.router.handler('category')
async def category_handler(context: PlaywrightCrawlingContext) -> None:
page = context.page
context.log.info(f'Processing category: {context.request.url}')
# Wait for products to load
await page.wait_for_selector('.product-card')
# Extract product URLs
product_cards = await page.locator('.product-card').all()
for card in product_cards:
product_url = await card.locator('a').get_attribute('href')
await context.add_requests([{
'url': context.request.url_join(product_url),
'label': 'product'
}])
# Handle pagination
next_button = page.locator('a.next-page')
if await next_button.count() > 0:
next_url = await next_button.get_attribute('href')
await context.add_requests([{
'url': context.request.url_join(next_url),
'label': 'category'
}])
@crawler.router.handler('product')
async def product_handler(context: PlaywrightCrawlingContext) -> None:
page = context.page
context.log.info(f'Scraping product: {context.request.url}')
# Extract product information
title = await page.locator('h1.product-title').inner_text()
price = await page.locator('span.price').inner_text()
# Extract all images
images = await page.locator('img.product-image').all()
image_urls = []
for img in images:
src = await img.get_attribute('src')
if src:
image_urls.append(src)
# Extract description
description = await page.locator('div.description').inner_text()
# Save product data
await context.push_data({
'url': context.request.url,
'title': title,
'price': price,
'images': image_urls,
'description': description
})
# Start crawling from category pages
await crawler.run([{
'url': 'https://example-shop.com/category/electronics',
'label': 'category'
}])
# Export results
dataset = await Dataset.open()
await dataset.export_to('products.json')
context.log.info('Crawling completed! Data exported to products.json')
if __name__ == '__main__':
asyncio.run(main())
Best Practices
- Use HTTP crawlers for static content: They're much faster than browser-based crawlers
- Implement proper error handling: Always catch and log exceptions
- Respect rate limits: Use
max_requests_per_minute
to avoid overwhelming servers - Use request labels: Route different page types to appropriate handlers
- Clean your data: Validate and sanitize scraped data before storage
- Test incrementally: Start with a small
max_requests_per_crawl
value - Monitor resource usage: Browser crawlers consume more memory
- Use sessions wisely: Enable session pooling only when needed
Troubleshooting Common Issues
Issue: Crawler Hangs or Doesn't Complete
Solution: Add timeouts and reduce concurrency:
crawler = PlaywrightCrawler(
max_concurrency=2,
request_handler_timeout_secs=60,
navigation_timeout_secs=30
)
Issue: Data Not Being Saved
Solution: Ensure you're using await context.push_data()
and check the storage directory:
ls -la storage/datasets/default/
Issue: Too Many Requests Being Made
Solution: Set appropriate limits:
crawler = PlaywrightCrawler(
max_requests_per_crawl=100,
max_requests_per_minute=30
)
Conclusion
Crawlee for Python provides a comprehensive solution for web scraping that handles many common challenges automatically. Whether you're building a simple scraper or a complex crawling system, Crawlee's features like automatic retries, proxy rotation, and data management make it an excellent choice for production web scraping projects. Start with the examples above and gradually add more sophisticated features as your needs grow.