How do I install Crawlee for my web scraping project?
Crawlee is a powerful web scraping and browser automation library available for both Node.js and Python. It provides a robust framework for building reliable crawlers with automatic scaling, proxy rotation, and smart request management. This guide covers all the installation methods and initial setup steps.
Prerequisites
Before installing Crawlee, ensure you have the following prerequisites installed on your system:
For Node.js (Crawlee for JavaScript/TypeScript)
- Node.js: Version 16.0 or higher
- npm or yarn: Package manager (comes with Node.js)
Check your Node.js version:
node --version
For Python (Crawlee for Python)
- Python: Version 3.8 or higher
- pip: Python package manager
Check your Python version:
python --version
# or
python3 --version
Installing Crawlee for Node.js
There are several ways to install and use Crawlee in your Node.js projects.
Method 1: Using NPM (Recommended for Projects)
If you're building a complete web scraping project, install Crawlee as a dependency:
npm install crawlee
For TypeScript projects, Crawlee comes with built-in TypeScript definitions, so no additional type packages are needed.
If you need to use Playwright for browser automation (recommended for modern websites):
npm install crawlee playwright
Or if you prefer Puppeteer:
npm install crawlee puppeteer
For lightweight HTTP requests without a browser:
npm install crawlee cheerio
Method 2: Using Yarn
yarn add crawlee
With browser automation:
yarn add crawlee playwright
# or
yarn add crawlee puppeteer
Method 3: Using npx (Quick Start)
For quick experimentation or one-off scripts, you can create a Crawlee project using the official template:
npx crawlee create my-crawler
This interactive command will: 1. Ask you to choose a template (Playwright, Puppeteer, Cheerio, or Jsdom) 2. Set up a complete project structure 3. Install all dependencies automatically
Method 4: Using Crawlee CLI
Install Crawlee CLI globally for easy project creation:
npm install -g crawlee-cli
Then create new projects:
crawlee create my-crawler
Installing Crawlee for Python
Crawlee for Python provides a Pythonic interface to the same powerful crawling features.
Using pip
pip install crawlee
If you want to use Playwright for browser automation:
pip install crawlee[playwright]
After installation, install Playwright browsers:
playwright install
For HTTP-only crawling with BeautifulSoup:
pip install crawlee[beautifulsoup]
Using pip with Virtual Environment (Recommended)
It's best practice to use a virtual environment:
# Create virtual environment
python -m venv venv
# Activate virtual environment
# On Windows:
venv\Scripts\activate
# On macOS/Linux:
source venv/bin/activate
# Install Crawlee
pip install crawlee[playwright]
# Install Playwright browsers
playwright install
Installing with Docker
For containerized deployments, you can use Docker to run Crawlee projects.
Node.js Dockerfile Example
FROM node:18-slim
# Install Playwright dependencies
RUN apt-get update && apt-get install -y \
wget \
gnupg \
ca-certificates \
fonts-liberation \
libappindicator3-1 \
libasound2 \
libatk-bridge2.0-0 \
libatk1.0-0 \
libcups2 \
libdbus-1-3 \
libgdk-pixbuf2.0-0 \
libnspr4 \
libnss3 \
libx11-xcb1 \
libxcomposite1 \
libxdamage1 \
libxrandr2 \
xdg-utils \
&& rm -rf /var/lib/apt/lists/*
WORKDIR /app
COPY package*.json ./
RUN npm install
COPY . .
CMD ["node", "src/main.js"]
Python Dockerfile Example
FROM python:3.11-slim
# Install Playwright dependencies
RUN apt-get update && apt-get install -y \
wget \
gnupg \
&& rm -rf /var/lib/apt/lists/*
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
# Install Playwright browsers
RUN playwright install --with-deps chromium
COPY . .
CMD ["python", "main.py"]
Verifying Your Installation
After installation, verify that Crawlee is working correctly.
Node.js Verification
Create a test file test.js
:
import { PlaywrightCrawler } from 'crawlee';
const crawler = new PlaywrightCrawler({
requestHandler: async ({ page, request }) => {
console.log(`Processing ${request.url}...`);
const title = await page.title();
console.log(`Page title: ${title}`);
},
});
await crawler.run(['https://crawlee.dev']);
console.log('Crawlee is working!');
Run it:
node test.js
Python Verification
Create a test file test.py
:
import asyncio
from crawlee.playwright_crawler import PlaywrightCrawler
async def main():
crawler = PlaywrightCrawler()
@crawler.router.default_handler
async def request_handler(context):
print(f'Processing {context.request.url}...')
title = await context.page.title()
print(f'Page title: {title}')
await crawler.run(['https://crawlee.dev'])
print('Crawlee is working!')
if __name__ == '__main__':
asyncio.run(main())
Run it:
python test.py
Choosing the Right Crawlee Configuration
Crawlee supports multiple HTTP clients and browser automation tools. Here's when to use each:
Cheerio Crawler (Node.js) / BeautifulSoup Crawler (Python)
Best for: - Static HTML websites - Fast, lightweight scraping - Sites that don't require JavaScript execution
Installation:
# Node.js
npm install crawlee cheerio
# Python
pip install crawlee[beautifulsoup]
Puppeteer Crawler
Best for: - JavaScript-heavy websites - Chrome/Chromium-specific features - When you need DevTools Protocol access
Installation:
npm install crawlee puppeteer
Playwright Crawler
Best for: - Modern web applications - Cross-browser testing (Chrome, Firefox, Safari) - Advanced browser automation features - Sites with complex JavaScript rendering
Installation:
# Node.js
npm install crawlee playwright
# Python
pip install crawlee[playwright]
playwright install
Similar to handling browser sessions in Puppeteer, Crawlee manages browser instances efficiently with automatic session rotation and fingerprint management.
Initial Project Structure
After installation, here's a recommended project structure:
Node.js Project
my-crawler/
├── src/
│ ├── main.js # Main crawler script
│ └── routes.js # Request handlers
├── storage/ # Data storage (auto-created)
├── package.json
└── .gitignore
Python Project
my-crawler/
├── main.py # Main crawler script
├── requirements.txt
├── storage/ # Data storage (auto-created)
└── .gitignore
Common Installation Issues and Solutions
Issue: Playwright browsers not installing
Solution: Manually install browsers after installing Crawlee:
# Node.js
npx playwright install
# Python
playwright install
Issue: Permission errors on Linux/macOS
Solution: Use pip with the --user
flag or use a virtual environment:
pip install --user crawlee[playwright]
Issue: Missing system dependencies
Solution: Install system dependencies for Playwright:
# Ubuntu/Debian
npx playwright install-deps
# Or manually
sudo apt-get install -y \
libnss3 \
libxss1 \
libasound2 \
libatk-bridge2.0-0 \
libgtk-3-0
Issue: Node.js version too old
Solution: Update Node.js using nvm:
# Install nvm
curl -o- https://raw.githubusercontent.com/nvm-sh/nvm/v0.39.0/install.sh | bash
# Install latest Node.js
nvm install node
nvm use node
Next Steps After Installation
Once Crawlee is installed, you can:
- Configure request routing: Set up different handlers for different URL patterns
- Add proxy support: Implement proxy rotation for large-scale scraping
- Set up data storage: Configure datasets, key-value stores, or custom storage
- Handle pagination: Implement logic to navigate to different pages using Puppeteer or Playwright
- Implement error handling: Add retry logic and error recovery mechanisms
Here's a complete starter example in JavaScript:
import { PlaywrightCrawler, Dataset } from 'crawlee';
const crawler = new PlaywrightCrawler({
maxRequestsPerCrawl: 50,
requestHandler: async ({ request, page, enqueueLinks, log }) => {
log.info(`Processing: ${request.url}`);
// Extract data
const title = await page.title();
const data = await page.$$eval('article', articles =>
articles.map(article => ({
heading: article.querySelector('h2')?.textContent,
text: article.querySelector('p')?.textContent,
}))
);
// Save data
await Dataset.pushData({ url: request.url, title, items: data });
// Enqueue new links
await enqueueLinks({
selector: 'a[href]',
strategy: 'same-domain',
});
},
});
await crawler.run(['https://example.com']);
And in Python:
import asyncio
from crawlee.playwright_crawler import PlaywrightCrawler, PlaywrightCrawlingContext
async def main():
crawler = PlaywrightCrawler(
max_requests_per_crawl=50,
)
@crawler.router.default_handler
async def request_handler(context: PlaywrightCrawlingContext) -> None:
context.log.info(f'Processing: {context.request.url}')
# Extract data
title = await context.page.title()
articles = await context.page.query_selector_all('article')
data = []
for article in articles:
heading = await article.query_selector('h2')
text = await article.query_selector('p')
data.append({
'heading': await heading.text_content() if heading else None,
'text': await text.text_content() if text else None,
})
# Save data
await context.push_data({
'url': context.request.url,
'title': title,
'items': data,
})
# Enqueue new links
await context.enqueue_links()
await crawler.run(['https://example.com'])
if __name__ == '__main__':
asyncio.run(main())
Conclusion
Installing Crawlee is straightforward whether you're using Node.js or Python. The library's flexibility allows you to choose between lightweight HTTP crawling and full browser automation depending on your scraping needs. With proper installation and configuration, Crawlee provides a robust foundation for building scalable web scraping projects with features like automatic retries, proxy rotation, and intelligent request management.
For production deployments, consider using Docker containers and implementing proper error handling in Puppeteer-style approaches to ensure your crawlers run reliably at scale.