Table of contents

How do I use Crawlee with Python for web scraping?

Crawlee for Python is a powerful web scraping and browser automation library that helps developers build reliable crawlers. It provides automatic scaling, proxy rotation, storage management, and handles common crawling challenges like retries and rate limiting. This guide shows you how to get started with Crawlee for Python and build production-ready web scrapers.

What is Crawlee for Python?

Crawlee for Python is the Python implementation of the popular Crawlee framework, originally developed for Node.js. It provides a robust toolkit for web scraping that includes:

  • Automatic retries and error handling
  • Request queue management
  • Proxy rotation and session management
  • Data storage and export
  • Browser automation support (Playwright, Selenium)
  • HTTP client crawling for faster static page scraping
  • Crawling context for managing state between requests

Installation and Setup

Installing Crawlee for Python

To install Crawlee for Python, you need Python 3.8 or higher. Install the base package using pip:

pip install crawlee

For browser automation with Playwright:

pip install 'crawlee[playwright]'
playwright install

For Selenium support:

pip install 'crawlee[selenium]'

For additional features like HTTP client crawling:

pip install 'crawlee[httpx]'

Setting Up Your First Crawler

Create a new Python file for your crawler. Here's a basic example:

import asyncio
from crawlee.playwright_crawler import PlaywrightCrawler, PlaywrightCrawlingContext

async def main() -> None:
    # Create a crawler instance
    crawler = PlaywrightCrawler(
        max_requests_per_crawl=50,
        headless=True,
        browser_type='chromium'
    )

    # Define the default request handler
    @crawler.router.default_handler
    async def request_handler(context: PlaywrightCrawlingContext) -> None:
        # Extract data from the page
        title = await context.page.title()
        url = context.request.url

        # Save the extracted data
        await context.push_data({
            'url': url,
            'title': title
        })

        print(f'Scraped: {title} from {url}')

    # Start crawling
    await crawler.run(['https://example.com'])

if __name__ == '__main__':
    asyncio.run(main())

Building a Web Scraper with Crawlee

HTTP Client Crawler (Faster for Static Pages)

For static websites that don't require JavaScript rendering, use the HTTP client crawler for better performance:

import asyncio
from crawlee.http_crawler import HttpCrawler, HttpCrawlingContext
from bs4 import BeautifulSoup

async def main() -> None:
    crawler = HttpCrawler(
        max_requests_per_crawl=100,
        max_request_retries=3
    )

    @crawler.router.default_handler
    async def request_handler(context: HttpCrawlingContext) -> None:
        # Parse HTML with BeautifulSoup
        soup = BeautifulSoup(context.http_response.read(), 'html.parser')

        # Extract data
        title = soup.find('h1').text if soup.find('h1') else 'No title'
        paragraphs = [p.text for p in soup.find_all('p')]

        # Save data
        await context.push_data({
            'url': context.request.url,
            'title': title,
            'paragraphs': paragraphs
        })

        # Enqueue new URLs
        links = soup.find_all('a', href=True)
        for link in links[:10]:  # Limit to first 10 links
            absolute_url = context.request.url_join(link['href'])
            await context.add_requests([absolute_url])

    await crawler.run(['https://example.com'])

if __name__ == '__main__':
    asyncio.run(main())

Browser-Based Crawler with Playwright

For dynamic websites that require JavaScript rendering, use the Playwright crawler, which is similar to handling browser sessions in Puppeteer:

import asyncio
from crawlee.playwright_crawler import PlaywrightCrawler, PlaywrightCrawlingContext

async def main() -> None:
    crawler = PlaywrightCrawler(
        headless=True,
        browser_type='chromium',
        max_requests_per_crawl=50
    )

    @crawler.router.default_handler
    async def request_handler(context: PlaywrightCrawlingContext) -> None:
        page = context.page

        # Wait for content to load
        await page.wait_for_selector('h1')

        # Extract data using Playwright selectors
        title = await page.locator('h1').inner_text()

        # Get all article titles
        articles = await page.locator('article').all()
        article_data = []

        for article in articles:
            article_title = await article.locator('h2').inner_text()
            article_link = await article.locator('a').get_attribute('href')
            article_data.append({
                'title': article_title,
                'link': article_link
            })

        # Save scraped data
        await context.push_data({
            'url': context.request.url,
            'page_title': title,
            'articles': article_data
        })

        # Click and navigate (if needed)
        # next_button = page.locator('button.next')
        # if await next_button.count() > 0:
        #     await next_button.click()
        #     await page.wait_for_load_state('networkidle')

    await crawler.run(['https://news.ycombinator.com'])

if __name__ == '__main__':
    asyncio.run(main())

Advanced Crawlee Features

Request Routing with Multiple Handlers

Crawlee supports routing different URL patterns to different handlers:

import asyncio
from crawlee.playwright_crawler import PlaywrightCrawler, PlaywrightCrawlingContext

async def main() -> None:
    crawler = PlaywrightCrawler()

    # Handler for product listing pages
    @crawler.router.handler('listing')
    async def listing_handler(context: PlaywrightCrawlingContext) -> None:
        page = context.page

        # Find all product links
        product_links = await page.locator('a.product-link').all()

        for link in product_links:
            url = await link.get_attribute('href')
            # Enqueue product pages with 'product' label
            await context.add_requests([{
                'url': context.request.url_join(url),
                'label': 'product'
            }])

    # Handler for individual product pages
    @crawler.router.handler('product')
    async def product_handler(context: PlaywrightCrawlingContext) -> None:
        page = context.page

        # Extract product details
        product_name = await page.locator('h1.product-name').inner_text()
        price = await page.locator('span.price').inner_text()
        description = await page.locator('div.description').inner_text()

        await context.push_data({
            'name': product_name,
            'price': price,
            'description': description,
            'url': context.request.url
        })

    # Start with listing pages
    await crawler.run([{
        'url': 'https://example.com/products',
        'label': 'listing'
    }])

if __name__ == '__main__':
    asyncio.run(main())

Proxy Configuration

Crawlee makes it easy to configure proxies for your crawlers:

from crawlee.playwright_crawler import PlaywrightCrawler
from crawlee.proxy_configuration import ProxyConfiguration

async def main() -> None:
    # Configure proxy
    proxy_configuration = ProxyConfiguration(
        proxy_urls=[
            'http://proxy1.example.com:8000',
            'http://proxy2.example.com:8000',
        ]
    )

    crawler = PlaywrightCrawler(
        proxy_configuration=proxy_configuration,
        max_requests_per_crawl=100
    )

    # Your request handler here...

Data Storage and Export

Crawlee automatically stores scraped data in the storage directory. You can access and export this data:

import asyncio
from crawlee.playwright_crawler import PlaywrightCrawler, PlaywrightCrawlingContext
from crawlee.storages import Dataset

async def main() -> None:
    crawler = PlaywrightCrawler()

    @crawler.router.default_handler
    async def request_handler(context: PlaywrightCrawlingContext) -> None:
        # Extract and save data
        data = {
            'url': context.request.url,
            'title': await context.page.title()
        }
        await context.push_data(data)

    await crawler.run(['https://example.com'])

    # Export data after crawling
    dataset = await Dataset.open()
    data = await dataset.get_data()

    # Access the scraped data
    for item in data.items:
        print(item)

    # Export to JSON
    await dataset.export_to('results.json')

    # Export to CSV
    await dataset.export_to('results.csv')

if __name__ == '__main__':
    asyncio.run(main())

Handling Dynamic Content and Waiting

When scraping dynamic websites similar to handling AJAX requests using Puppeteer, you need to wait for content to load:

import asyncio
from crawlee.playwright_crawler import PlaywrightCrawler, PlaywrightCrawlingContext

async def main() -> None:
    crawler = PlaywrightCrawler()

    @crawler.router.default_handler
    async def request_handler(context: PlaywrightCrawlingContext) -> None:
        page = context.page

        # Wait for specific selector
        await page.wait_for_selector('div.loaded-content')

        # Wait for network to be idle
        await page.wait_for_load_state('networkidle')

        # Wait for a specific timeout
        await page.wait_for_timeout(2000)  # Wait 2 seconds

        # Wait for a function to return true
        await page.wait_for_function('window.dataLoaded === true')

        # Extract data after everything is loaded
        content = await page.locator('div.loaded-content').inner_text()
        await context.push_data({'content': content})

    await crawler.run(['https://dynamic-site.example.com'])

if __name__ == '__main__':
    asyncio.run(main())

Error Handling and Retries

Crawlee includes built-in error handling and retry mechanisms:

import asyncio
from crawlee.playwright_crawler import PlaywrightCrawler, PlaywrightCrawlingContext
from crawlee.errors import SessionError

async def main() -> None:
    crawler = PlaywrightCrawler(
        max_request_retries=5,  # Retry failed requests up to 5 times
        max_requests_per_crawl=100,
        request_handler_timeout_secs=60  # Timeout after 60 seconds
    )

    @crawler.router.default_handler
    async def request_handler(context: PlaywrightCrawlingContext) -> None:
        try:
            page = context.page
            await page.wait_for_selector('h1', timeout=30000)

            title = await page.locator('h1').inner_text()
            await context.push_data({'title': title})

        except Exception as e:
            context.log.error(f'Error processing {context.request.url}: {str(e)}')
            # Mark request as failed to trigger retry
            raise SessionError(f'Failed to process page: {str(e)}')

    # Add failed request handler
    @crawler.failed_request_handler
    async def failed_handler(context: PlaywrightCrawlingContext, error: Exception) -> None:
        context.log.error(f'Request {context.request.url} failed after retries: {error}')

    await crawler.run(['https://example.com'])

if __name__ == '__main__':
    asyncio.run(main())

Performance Optimization

Concurrent Crawling

Crawlee automatically manages concurrent requests. You can control the concurrency level:

from crawlee.playwright_crawler import PlaywrightCrawler

async def main() -> None:
    crawler = PlaywrightCrawler(
        max_concurrency=5,  # Maximum 5 concurrent requests
        min_concurrency=1,  # Minimum 1 concurrent request
        max_requests_per_minute=60  # Rate limiting
    )

    # Your request handler here...

Session Management

For websites that require maintaining state across requests:

from crawlee.playwright_crawler import PlaywrightCrawler, PlaywrightCrawlingContext

async def main() -> None:
    crawler = PlaywrightCrawler(
        use_session_pool=True,
        persist_cookies_per_session=True
    )

    @crawler.router.default_handler
    async def request_handler(context: PlaywrightCrawlingContext) -> None:
        # Sessions are automatically managed
        # Cookies persist across requests in the same session
        page = context.page

        # Your scraping logic here...

Complete Example: E-commerce Scraper

Here's a complete example that demonstrates many Crawlee features when navigating to different pages:

import asyncio
from crawlee.playwright_crawler import PlaywrightCrawler, PlaywrightCrawlingContext
from crawlee.storages import Dataset

async def main() -> None:
    crawler = PlaywrightCrawler(
        max_requests_per_crawl=200,
        max_concurrency=3,
        headless=True
    )

    @crawler.router.handler('category')
    async def category_handler(context: PlaywrightCrawlingContext) -> None:
        page = context.page
        context.log.info(f'Processing category: {context.request.url}')

        # Wait for products to load
        await page.wait_for_selector('.product-card')

        # Extract product URLs
        product_cards = await page.locator('.product-card').all()

        for card in product_cards:
            product_url = await card.locator('a').get_attribute('href')
            await context.add_requests([{
                'url': context.request.url_join(product_url),
                'label': 'product'
            }])

        # Handle pagination
        next_button = page.locator('a.next-page')
        if await next_button.count() > 0:
            next_url = await next_button.get_attribute('href')
            await context.add_requests([{
                'url': context.request.url_join(next_url),
                'label': 'category'
            }])

    @crawler.router.handler('product')
    async def product_handler(context: PlaywrightCrawlingContext) -> None:
        page = context.page
        context.log.info(f'Scraping product: {context.request.url}')

        # Extract product information
        title = await page.locator('h1.product-title').inner_text()
        price = await page.locator('span.price').inner_text()

        # Extract all images
        images = await page.locator('img.product-image').all()
        image_urls = []
        for img in images:
            src = await img.get_attribute('src')
            if src:
                image_urls.append(src)

        # Extract description
        description = await page.locator('div.description').inner_text()

        # Save product data
        await context.push_data({
            'url': context.request.url,
            'title': title,
            'price': price,
            'images': image_urls,
            'description': description
        })

    # Start crawling from category pages
    await crawler.run([{
        'url': 'https://example-shop.com/category/electronics',
        'label': 'category'
    }])

    # Export results
    dataset = await Dataset.open()
    await dataset.export_to('products.json')
    context.log.info('Crawling completed! Data exported to products.json')

if __name__ == '__main__':
    asyncio.run(main())

Best Practices

  1. Use HTTP crawlers for static content: They're much faster than browser-based crawlers
  2. Implement proper error handling: Always catch and log exceptions
  3. Respect rate limits: Use max_requests_per_minute to avoid overwhelming servers
  4. Use request labels: Route different page types to appropriate handlers
  5. Clean your data: Validate and sanitize scraped data before storage
  6. Test incrementally: Start with a small max_requests_per_crawl value
  7. Monitor resource usage: Browser crawlers consume more memory
  8. Use sessions wisely: Enable session pooling only when needed

Troubleshooting Common Issues

Issue: Crawler Hangs or Doesn't Complete

Solution: Add timeouts and reduce concurrency:

crawler = PlaywrightCrawler(
    max_concurrency=2,
    request_handler_timeout_secs=60,
    navigation_timeout_secs=30
)

Issue: Data Not Being Saved

Solution: Ensure you're using await context.push_data() and check the storage directory:

ls -la storage/datasets/default/

Issue: Too Many Requests Being Made

Solution: Set appropriate limits:

crawler = PlaywrightCrawler(
    max_requests_per_crawl=100,
    max_requests_per_minute=30
)

Conclusion

Crawlee for Python provides a comprehensive solution for web scraping that handles many common challenges automatically. Whether you're building a simple scraper or a complex crawling system, Crawlee's features like automatic retries, proxy rotation, and data management make it an excellent choice for production web scraping projects. Start with the examples above and gradually add more sophisticated features as your needs grow.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon