Can I Migrate My Python Web Scraping Code to Crawlee?
Yes, you can migrate your Python web scraping code to Crawlee! Crawlee for Python (also known as crawlee-python
) is a relatively new addition to the Crawlee ecosystem, bringing the powerful features of the JavaScript version to Python developers. Whether you're coming from BeautifulSoup, Scrapy, Selenium, or other Python scraping frameworks, migrating to Crawlee can provide you with better scalability, built-in request management, and modern async/await patterns.
Understanding Crawlee for Python
Crawlee for Python is a web scraping and browser automation library that offers:
- Automatic request management: Built-in retry logic and request queuing
- Browser automation support: Integration with Playwright for JavaScript-heavy sites
- HTTP crawling: Fast HTML parsing for static content
- Proxy rotation: Built-in support for proxy management
- Session management: Automatic cookie and session handling
- Storage: Built-in data storage capabilities
Installing Crawlee for Python
Before migrating your code, install Crawlee:
# Basic installation
pip install crawlee
# With Playwright support for browser automation
pip install 'crawlee[playwright]'
# Install Playwright browsers
playwright install
Migration Patterns from Common Python Libraries
Migrating from BeautifulSoup + Requests
If you're using BeautifulSoup with requests, you can migrate to Crawlee's HTTP crawler for better performance and built-in features.
Before (BeautifulSoup + Requests):
import requests
from bs4 import BeautifulSoup
url = 'https://example.com'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
# Extract data
titles = soup.find_all('h2', class_='title')
for title in titles:
print(title.text)
After (Crawlee):
from crawlee.beautifulsoup_crawler import BeautifulSoupCrawler, BeautifulSoupCrawlingContext
async def main():
crawler = BeautifulSoupCrawler()
@crawler.router.default_handler
async def request_handler(context: BeautifulSoupCrawlingContext):
soup = context.soup
# Extract data
titles = soup.find_all('h2', class_='title')
for title in titles:
print(title.text)
# Save data
await context.push_data({
'url': context.request.url,
'titles': [title.text for title in titles]
})
await crawler.run(['https://example.com'])
if __name__ == '__main__':
import asyncio
asyncio.run(main())
Migrating from Scrapy
Scrapy users will find Crawlee's structure familiar, but with modern async/await syntax and better browser automation support.
Before (Scrapy):
import scrapy
class ProductSpider(scrapy.Spider):
name = 'products'
start_urls = ['https://example.com/products']
def parse(self, response):
for product in response.css('div.product'):
yield {
'name': product.css('h2::text').get(),
'price': product.css('span.price::text').get(),
}
# Follow pagination
next_page = response.css('a.next::attr(href)').get()
if next_page:
yield response.follow(next_page, self.parse)
After (Crawlee):
from crawlee.beautifulsoup_crawler import BeautifulSoupCrawler, BeautifulSoupCrawlingContext
async def main():
crawler = BeautifulSoupCrawler(
max_requests_per_crawl=100,
request_handler_timeout_secs=60,
)
@crawler.router.default_handler
async def request_handler(context: BeautifulSoupCrawlingContext):
soup = context.soup
# Extract data
for product in soup.select('div.product'):
await context.push_data({
'name': product.select_one('h2').text if product.select_one('h2') else None,
'price': product.select_one('span.price').text if product.select_one('span.price') else None,
})
# Follow pagination
next_page = soup.select_one('a.next')
if next_page and next_page.get('href'):
await context.add_requests([next_page['href']])
await crawler.run(['https://example.com/products'])
if __name__ == '__main__':
import asyncio
asyncio.run(main())
Migrating from Selenium to Crawlee with Playwright
If you're using Selenium for browser automation, Crawlee with Playwright offers better performance and a more modern API, similar to how to handle browser sessions in Puppeteer.
Before (Selenium):
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Chrome()
driver.get('https://example.com')
# Wait for element
element = WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.CLASS_NAME, 'dynamic-content'))
)
# Extract data
title = driver.find_element(By.TAG_NAME, 'h1').text
print(title)
driver.quit()
After (Crawlee with Playwright):
from crawlee.playwright_crawler import PlaywrightCrawler, PlaywrightCrawlingContext
async def main():
crawler = PlaywrightCrawler(
headless=True,
browser_type='chromium',
)
@crawler.router.default_handler
async def request_handler(context: PlaywrightCrawlingContext):
page = context.page
# Wait for element
await page.wait_for_selector('.dynamic-content')
# Extract data
title = await page.locator('h1').text_content()
await context.push_data({
'url': context.request.url,
'title': title,
})
await crawler.run(['https://example.com'])
if __name__ == '__main__':
import asyncio
asyncio.run(main())
Key Migration Considerations
1. Async/Await Pattern
Crawlee for Python uses async/await, which means you'll need to adapt synchronous code:
# Always wrap your crawler in an async main function
async def main():
crawler = BeautifulSoupCrawler()
await crawler.run(['https://example.com'])
if __name__ == '__main__':
import asyncio
asyncio.run(main())
2. Request Queue Management
Crawlee automatically manages your request queue, which replaces manual URL tracking:
@crawler.router.default_handler
async def request_handler(context: BeautifulSoupCrawlingContext):
# Add new URLs to crawl
await context.add_requests([
'https://example.com/page1',
'https://example.com/page2',
])
3. Data Storage
Instead of manual file writing or database connections, use Crawlee's built-in storage:
# Automatically saves to ./storage/datasets/default/
await context.push_data({
'title': 'Product Name',
'price': '$29.99',
})
4. Error Handling and Retries
Crawlee includes automatic retry logic, but you can customize it:
crawler = BeautifulSoupCrawler(
max_request_retries=3,
request_handler_timeout_secs=30,
)
@crawler.router.default_handler
async def request_handler(context: BeautifulSoupCrawlingContext):
try:
# Your scraping logic
pass
except Exception as e:
context.log.error(f'Error processing {context.request.url}: {e}')
# Crawlee will automatically retry
Advanced Migration Patterns
Handling Pagination
Crawlee makes pagination straightforward, especially when dealing with complex navigation patterns similar to how to navigate to different pages using Puppeteer:
@crawler.router.default_handler
async def request_handler(context: BeautifulSoupCrawlingContext):
soup = context.soup
# Extract data from current page
items = soup.select('.item')
for item in items:
await context.push_data({
'title': item.select_one('.title').text,
'description': item.select_one('.description').text,
})
# Handle multiple pagination patterns
next_buttons = soup.select('a.next, a[rel="next"], button.next-page')
for button in next_buttons:
href = button.get('href')
if href:
await context.add_requests([href])
Using Proxies
Crawlee simplifies proxy rotation:
from crawlee.proxy_configuration import ProxyConfiguration
proxy_config = ProxyConfiguration(
proxy_urls=[
'http://proxy1.example.com:8000',
'http://proxy2.example.com:8000',
]
)
crawler = BeautifulSoupCrawler(
proxy_configuration=proxy_config,
)
Session Management
For sites requiring authentication or session handling:
from crawlee.sessions import SessionPool
crawler = PlaywrightCrawler(
use_session_pool=True,
persist_cookies_per_session=True,
)
@crawler.router.default_handler
async def request_handler(context: PlaywrightCrawlingContext):
page = context.page
session = context.session
# Login if needed
if not session.user_data.get('logged_in'):
await page.fill('input[name="username"]', 'user')
await page.fill('input[name="password"]', 'pass')
await page.click('button[type="submit"]')
session.user_data['logged_in'] = True
# Continue scraping
Custom Request Routing
Handle different page types with separate handlers:
crawler = BeautifulSoupCrawler()
@crawler.router.handler('product_list')
async def list_handler(context: BeautifulSoupCrawlingContext):
soup = context.soup
products = soup.select('a.product-link')
# Add product detail pages to queue
for product in products:
await context.add_requests([{
'url': product['href'],
'label': 'product_detail'
}])
@crawler.router.handler('product_detail')
async def detail_handler(context: BeautifulSoupCrawlingContext):
soup = context.soup
await context.push_data({
'name': soup.select_one('h1.product-name').text,
'price': soup.select_one('span.price').text,
'description': soup.select_one('div.description').text,
})
# Start with product list pages
await crawler.run([{
'url': 'https://example.com/products',
'label': 'product_list'
}])
Performance Optimization Tips
1. Configure Concurrency
crawler = BeautifulSoupCrawler(
max_requests_per_crawl=1000,
max_requests_per_minute=60,
min_concurrency=5,
max_concurrency=20,
)
2. Use Request Fingerprinting
Avoid duplicate requests automatically:
from crawlee.request import Request
# Crawlee automatically fingerprints requests to avoid duplicates
await context.add_requests([
Request.from_url('https://example.com/page1'),
Request.from_url('https://example.com/page1'), # Will be skipped
])
3. Configure Browser Pool (for Playwright)
crawler = PlaywrightCrawler(
browser_pool_options={
'max_open_pages_per_browser': 5,
'close_inactive_browser_after_secs': 300,
}
)
Handling JavaScript-Heavy Sites
For sites with dynamic content, use Crawlee's Playwright integration which provides capabilities similar to how to handle AJAX requests using Puppeteer:
from crawlee.playwright_crawler import PlaywrightCrawler, PlaywrightCrawlingContext
async def main():
crawler = PlaywrightCrawler()
@crawler.router.default_handler
async def request_handler(context: PlaywrightCrawlingContext):
page = context.page
# Wait for dynamic content to load
await page.wait_for_selector('.dynamic-content')
# Wait for network to be idle
await page.wait_for_load_state('networkidle')
# Interact with JavaScript elements
await page.click('button.load-more')
await page.wait_for_timeout(1000)
# Extract data after JS execution
products = await page.query_selector_all('.product')
for product in products:
title = await product.query_selector('.title')
title_text = await title.text_content() if title else None
await context.push_data({
'title': title_text,
})
await crawler.run(['https://example.com'])
Testing Your Migrated Code
Create a simple test to verify your migration:
import asyncio
from crawlee.beautifulsoup_crawler import BeautifulSoupCrawler
async def test_crawler():
results = []
crawler = BeautifulSoupCrawler()
@crawler.router.default_handler
async def handler(context):
soup = context.soup
title = soup.find('h1')
results.append(title.text if title else None)
await crawler.run(['https://example.com'])
assert len(results) > 0
print(f"Successfully extracted {len(results)} items")
if __name__ == '__main__':
asyncio.run(test_crawler())
Conclusion
Migrating Python web scraping code to Crawlee is straightforward and brings significant benefits in terms of scalability, maintainability, and built-in features. The main changes involve adopting async/await patterns and leveraging Crawlee's request queue, storage, and session management capabilities. Whether you're coming from BeautifulSoup, Scrapy, or Selenium, Crawlee provides a modern, efficient framework for your web scraping needs.
Start by migrating a simple scraper first to familiarize yourself with Crawlee's patterns, then gradually move more complex scrapers as you become comfortable with the framework. The investment in migration pays off through reduced boilerplate code, better error handling, and improved performance at scale.