Can I Migrate My Python Web Scraping Code to Crawlee?

Yes, you can migrate your Python web scraping code to Crawlee! Crawlee for Python (also known as crawlee-python) is a relatively new addition to the Crawlee ecosystem, bringing the powerful features of the JavaScript version to Python developers. Whether you're coming from BeautifulSoup, Scrapy, Selenium, or other Python scraping frameworks, migrating to Crawlee can provide you with better scalability, built-in request management, and modern async/await patterns.

Understanding Crawlee for Python

Crawlee for Python is a web scraping and browser automation library that offers:

Automatic request management: Built-in retry logic and request queuing
Browser automation support: Integration with Playwright for JavaScript-heavy sites
HTTP crawling: Fast HTML parsing for static content
Proxy rotation: Built-in support for proxy management
Session management: Automatic cookie and session handling
Storage: Built-in data storage capabilities

Installing Crawlee for Python

Before migrating your code, install Crawlee:

# Basic installation
pip install crawlee

# With Playwright support for browser automation
pip install 'crawlee[playwright]'

# Install Playwright browsers
playwright install

Migration Patterns from Common Python Libraries

Migrating from BeautifulSoup + Requests

If you're using BeautifulSoup with requests, you can migrate to Crawlee's HTTP crawler for better performance and built-in features.

Before (BeautifulSoup + Requests):

import requests
from bs4 import BeautifulSoup

url = 'https://example.com'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')

# Extract data
titles = soup.find_all('h2', class_='title')
for title in titles:
    print(title.text)

After (Crawlee):

from crawlee.beautifulsoup_crawler import BeautifulSoupCrawler, BeautifulSoupCrawlingContext

async def main():
    crawler = BeautifulSoupCrawler()

    @crawler.router.default_handler
    async def request_handler(context: BeautifulSoupCrawlingContext):
        soup = context.soup

        # Extract data
        titles = soup.find_all('h2', class_='title')
        for title in titles:
            print(title.text)

        # Save data
        await context.push_data({
            'url': context.request.url,
            'titles': [title.text for title in titles]
        })

    await crawler.run(['https://example.com'])

if __name__ == '__main__':
    import asyncio
    asyncio.run(main())

Migrating from Scrapy

Scrapy users will find Crawlee's structure familiar, but with modern async/await syntax and better browser automation support.

Before (Scrapy):

import scrapy

class ProductSpider(scrapy.Spider):
    name = 'products'
    start_urls = ['https://example.com/products']

    def parse(self, response):
        for product in response.css('div.product'):
            yield {
                'name': product.css('h2::text').get(),
                'price': product.css('span.price::text').get(),
            }

        # Follow pagination
        next_page = response.css('a.next::attr(href)').get()
        if next_page:
            yield response.follow(next_page, self.parse)

After (Crawlee):

from crawlee.beautifulsoup_crawler import BeautifulSoupCrawler, BeautifulSoupCrawlingContext

async def main():
    crawler = BeautifulSoupCrawler(
        max_requests_per_crawl=100,
        request_handler_timeout_secs=60,
    )

    @crawler.router.default_handler
    async def request_handler(context: BeautifulSoupCrawlingContext):
        soup = context.soup

        # Extract data
        for product in soup.select('div.product'):
            await context.push_data({
                'name': product.select_one('h2').text if product.select_one('h2') else None,
                'price': product.select_one('span.price').text if product.select_one('span.price') else None,
            })

        # Follow pagination
        next_page = soup.select_one('a.next')
        if next_page and next_page.get('href'):
            await context.add_requests([next_page['href']])

    await crawler.run(['https://example.com/products'])

if __name__ == '__main__':
    import asyncio
    asyncio.run(main())

Migrating from Selenium to Crawlee with Playwright

If you're using Selenium for browser automation, Crawlee with Playwright offers better performance and a more modern API, similar to how to handle browser sessions in Puppeteer.

Before (Selenium):

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

driver = webdriver.Chrome()
driver.get('https://example.com')

# Wait for element
element = WebDriverWait(driver, 10).until(
    EC.presence_of_element_located((By.CLASS_NAME, 'dynamic-content'))
)

# Extract data
title = driver.find_element(By.TAG_NAME, 'h1').text
print(title)

driver.quit()

After (Crawlee with Playwright):

from crawlee.playwright_crawler import PlaywrightCrawler, PlaywrightCrawlingContext

async def main():
    crawler = PlaywrightCrawler(
        headless=True,
        browser_type='chromium',
    )

    @crawler.router.default_handler
    async def request_handler(context: PlaywrightCrawlingContext):
        page = context.page

        # Wait for element
        await page.wait_for_selector('.dynamic-content')

        # Extract data
        title = await page.locator('h1').text_content()

        await context.push_data({
            'url': context.request.url,
            'title': title,
        })

    await crawler.run(['https://example.com'])

if __name__ == '__main__':
    import asyncio
    asyncio.run(main())

Key Migration Considerations

1. Async/Await Pattern

Crawlee for Python uses async/await, which means you'll need to adapt synchronous code:

# Always wrap your crawler in an async main function
async def main():
    crawler = BeautifulSoupCrawler()
    await crawler.run(['https://example.com'])

if __name__ == '__main__':
    import asyncio
    asyncio.run(main())

2. Request Queue Management

Crawlee automatically manages your request queue, which replaces manual URL tracking:

@crawler.router.default_handler
async def request_handler(context: BeautifulSoupCrawlingContext):
    # Add new URLs to crawl
    await context.add_requests([
        'https://example.com/page1',
        'https://example.com/page2',
    ])

3. Data Storage

Instead of manual file writing or database connections, use Crawlee's built-in storage:

# Automatically saves to ./storage/datasets/default/
await context.push_data({
    'title': 'Product Name',
    'price': '$29.99',
})

4. Error Handling and Retries

Crawlee includes automatic retry logic, but you can customize it:

crawler = BeautifulSoupCrawler(
    max_request_retries=3,
    request_handler_timeout_secs=30,
)

@crawler.router.default_handler
async def request_handler(context: BeautifulSoupCrawlingContext):
    try:
        # Your scraping logic
        pass
    except Exception as e:
        context.log.error(f'Error processing {context.request.url}: {e}')
        # Crawlee will automatically retry

Advanced Migration Patterns

Handling Pagination

Crawlee makes pagination straightforward, especially when dealing with complex navigation patterns similar to how to navigate to different pages using Puppeteer:

@crawler.router.default_handler
async def request_handler(context: BeautifulSoupCrawlingContext):
    soup = context.soup

    # Extract data from current page
    items = soup.select('.item')
    for item in items:
        await context.push_data({
            'title': item.select_one('.title').text,
            'description': item.select_one('.description').text,
        })

    # Handle multiple pagination patterns
    next_buttons = soup.select('a.next, a[rel="next"], button.next-page')
    for button in next_buttons:
        href = button.get('href')
        if href:
            await context.add_requests([href])

Using Proxies

Crawlee simplifies proxy rotation:

from crawlee.proxy_configuration import ProxyConfiguration

proxy_config = ProxyConfiguration(
    proxy_urls=[
        'http://proxy1.example.com:8000',
        'http://proxy2.example.com:8000',
    ]
)

crawler = BeautifulSoupCrawler(
    proxy_configuration=proxy_config,
)

Session Management

For sites requiring authentication or session handling:

from crawlee.sessions import SessionPool

crawler = PlaywrightCrawler(
    use_session_pool=True,
    persist_cookies_per_session=True,
)

@crawler.router.default_handler
async def request_handler(context: PlaywrightCrawlingContext):
    page = context.page
    session = context.session

    # Login if needed
    if not session.user_data.get('logged_in'):
        await page.fill('input[name="username"]', 'user')
        await page.fill('input[name="password"]', 'pass')
        await page.click('button[type="submit"]')
        session.user_data['logged_in'] = True

    # Continue scraping

Custom Request Routing

Handle different page types with separate handlers:

crawler = BeautifulSoupCrawler()

@crawler.router.handler('product_list')
async def list_handler(context: BeautifulSoupCrawlingContext):
    soup = context.soup
    products = soup.select('a.product-link')

    # Add product detail pages to queue
    for product in products:
        await context.add_requests([{
            'url': product['href'],
            'label': 'product_detail'
        }])

@crawler.router.handler('product_detail')
async def detail_handler(context: BeautifulSoupCrawlingContext):
    soup = context.soup
    await context.push_data({
        'name': soup.select_one('h1.product-name').text,
        'price': soup.select_one('span.price').text,
        'description': soup.select_one('div.description').text,
    })

# Start with product list pages
await crawler.run([{
    'url': 'https://example.com/products',
    'label': 'product_list'
}])

Performance Optimization Tips

1. Configure Concurrency

crawler = BeautifulSoupCrawler(
    max_requests_per_crawl=1000,
    max_requests_per_minute=60,
    min_concurrency=5,
    max_concurrency=20,
)

2. Use Request Fingerprinting

Avoid duplicate requests automatically:

from crawlee.request import Request

# Crawlee automatically fingerprints requests to avoid duplicates
await context.add_requests([
    Request.from_url('https://example.com/page1'),
    Request.from_url('https://example.com/page1'),  # Will be skipped
])

3. Configure Browser Pool (for Playwright)

crawler = PlaywrightCrawler(
    browser_pool_options={
        'max_open_pages_per_browser': 5,
        'close_inactive_browser_after_secs': 300,
    }
)

Handling JavaScript-Heavy Sites

For sites with dynamic content, use Crawlee's Playwright integration which provides capabilities similar to how to handle AJAX requests using Puppeteer:

from crawlee.playwright_crawler import PlaywrightCrawler, PlaywrightCrawlingContext

async def main():
    crawler = PlaywrightCrawler()

    @crawler.router.default_handler
    async def request_handler(context: PlaywrightCrawlingContext):
        page = context.page

        # Wait for dynamic content to load
        await page.wait_for_selector('.dynamic-content')

        # Wait for network to be idle
        await page.wait_for_load_state('networkidle')

        # Interact with JavaScript elements
        await page.click('button.load-more')
        await page.wait_for_timeout(1000)

        # Extract data after JS execution
        products = await page.query_selector_all('.product')
        for product in products:
            title = await product.query_selector('.title')
            title_text = await title.text_content() if title else None

            await context.push_data({
                'title': title_text,
            })

    await crawler.run(['https://example.com'])

Testing Your Migrated Code

Create a simple test to verify your migration:

import asyncio
from crawlee.beautifulsoup_crawler import BeautifulSoupCrawler

async def test_crawler():
    results = []

    crawler = BeautifulSoupCrawler()

    @crawler.router.default_handler
    async def handler(context):
        soup = context.soup
        title = soup.find('h1')
        results.append(title.text if title else None)

    await crawler.run(['https://example.com'])

    assert len(results) > 0
    print(f"Successfully extracted {len(results)} items")

if __name__ == '__main__':
    asyncio.run(test_crawler())

Conclusion

Migrating Python web scraping code to Crawlee is straightforward and brings significant benefits in terms of scalability, maintainability, and built-in features. The main changes involve adopting async/await patterns and leveraging Crawlee's request queue, storage, and session management capabilities. Whether you're coming from BeautifulSoup, Scrapy, or Selenium, Crawlee provides a modern, efficient framework for your web scraping needs.

Start by migrating a simple scraper first to familiarize yourself with Crawlee's patterns, then gradually move more complex scrapers as you become comfortable with the framework. The investment in migration pays off through reduced boilerplate code, better error handling, and improved performance at scale.

Table of contents