Can I use Crawlee with TypeScript for type-safe web scraping?
Yes, Crawlee fully supports TypeScript and is actually built with TypeScript as a first-class citizen. In fact, Crawlee is written in TypeScript, making it an excellent choice for developers who want type safety, better IDE support, and fewer runtime errors in their web scraping projects.
Why Use TypeScript with Crawlee?
TypeScript offers several advantages for web scraping projects:
- Type Safety: Catch errors at compile-time rather than runtime
- Better IDE Support: Enhanced autocomplete, intellisense, and refactoring tools
- Self-Documenting Code: Type definitions serve as inline documentation
- Easier Maintenance: Types make it easier to understand and modify code over time
- Reduced Bugs: Type checking prevents many common programming errors
Installing Crawlee for TypeScript
To get started with Crawlee and TypeScript, you'll need Node.js installed. Then create a new TypeScript project:
# Create a new directory for your project
mkdir my-crawler
cd my-crawler
# Initialize a new npm project
npm init -y
# Install Crawlee and TypeScript dependencies
npm install crawlee
npm install -D typescript @types/node ts-node
# Initialize TypeScript configuration
npx tsc --init
Update your tsconfig.json
with recommended settings:
{
"compilerOptions": {
"target": "ES2022",
"module": "commonjs",
"lib": ["ES2022"],
"outDir": "./dist",
"rootDir": "./src",
"strict": true,
"esModuleInterop": true,
"skipLibCheck": true,
"forceConsistentCasingInFileNames": true,
"resolveJsonModule": true,
"moduleResolution": "node"
},
"include": ["src/**/*"],
"exclude": ["node_modules"]
}
Basic TypeScript Crawler Example
Here's a complete example of a type-safe crawler using Crawlee with TypeScript:
// src/crawler.ts
import { CheerioCrawler, ProxyConfiguration } from 'crawlee';
interface ProductData {
title: string;
price: number;
url: string;
inStock: boolean;
}
const crawler = new CheerioCrawler({
async requestHandler({ request, $, enqueueLinks, log }) {
const title = $('h1.product-title').text().trim();
// TypeScript ensures we handle the price correctly
const priceText = $('.price').text().replace(/[^0-9.]/g, '');
const price = parseFloat(priceText);
const inStock = $('.availability').text().includes('In Stock');
// Type-safe data structure
const product: ProductData = {
title,
price,
url: request.url,
inStock,
};
log.info(`Scraped product: ${product.title} - $${product.price}`);
// Store the data with type safety
await crawler.pushData(product);
// Enqueue more links
await enqueueLinks({
globs: ['https://example.com/products/*'],
label: 'PRODUCT',
});
},
maxRequestsPerCrawl: 100,
maxConcurrency: 5,
});
await crawler.run(['https://example.com/products']);
Using Crawlee with Puppeteer and TypeScript
For handling browser automation tasks, you can use Crawlee's PuppeteerCrawler with full TypeScript support:
import { PuppeteerCrawler, Dataset } from 'crawlee';
import type { Page } from 'puppeteer';
interface ScrapedArticle {
heading: string;
author: string;
publishDate: Date;
content: string;
tags: string[];
}
const crawler = new PuppeteerCrawler({
async requestHandler({ page, request, enqueueLinks, log }) {
log.info(`Processing: ${request.url}`);
// Type-safe page interactions
const heading = await page.$eval(
'h1.article-title',
(el) => el.textContent?.trim() ?? ''
);
const author = await page.$eval(
'.author-name',
(el) => el.textContent?.trim() ?? 'Unknown'
);
const dateText = await page.$eval(
'time[datetime]',
(el) => el.getAttribute('datetime') ?? ''
);
const content = await page.$eval(
'.article-content',
(el) => el.textContent?.trim() ?? ''
);
const tags = await page.$$eval(
'.tag',
(elements) => elements.map(el => el.textContent?.trim() ?? '')
);
// Create type-safe article object
const article: ScrapedArticle = {
heading,
author,
publishDate: new Date(dateText),
content,
tags: tags.filter(Boolean),
};
await Dataset.pushData(article);
// Enqueue pagination links
await enqueueLinks({
selector: '.pagination a',
label: 'LIST',
});
},
headless: true,
maxRequestsPerCrawl: 50,
});
await crawler.run(['https://example.com/blog']);
Advanced TypeScript Features with Crawlee
Custom Request Context Types
You can define custom types for your request context to maintain type safety throughout your crawler:
import { CheerioCrawler, Request } from 'crawlee';
interface CustomUserData {
category: string;
depth: number;
parentUrl?: string;
}
const crawler = new CheerioCrawler<{ userData: CustomUserData }>({
async requestHandler({ request, $, enqueueLinks, log }) {
const { category, depth, parentUrl } = request.userData;
log.info(`Scraping ${category} at depth ${depth}`);
if (depth < 3) {
await enqueueLinks({
globs: [`https://example.com/${category}/*`],
userData: {
category,
depth: depth + 1,
parentUrl: request.url,
},
});
}
},
});
const initialRequest = new Request<CustomUserData>({
url: 'https://example.com/electronics',
userData: {
category: 'electronics',
depth: 0,
},
});
await crawler.run([initialRequest]);
Type-Safe Router Pattern
Crawlee's router pattern works seamlessly with TypeScript:
import { CheerioCrawler, createCheerioRouter } from 'crawlee';
interface CategoryData {
name: string;
productCount: number;
}
interface ProductDetailData {
name: string;
price: number;
description: string;
images: string[];
}
const router = createCheerioRouter();
router.addHandler('CATEGORY', async ({ request, $, enqueueLinks }) => {
const categoryName = $('h1.category-name').text().trim();
const productCount = $('.product-item').length;
const categoryData: CategoryData = {
name: categoryName,
productCount,
};
await Dataset.pushData(categoryData);
await enqueueLinks({
selector: '.product-item a',
label: 'PRODUCT',
});
});
router.addHandler('PRODUCT', async ({ request, $, crawler }) => {
const name = $('h1.product-name').text().trim();
const priceText = $('.price').text().replace(/[^0-9.]/g, '');
const description = $('.description').text().trim();
const images = $('.product-image img')
.map((_, el) => $(el).attr('src'))
.get()
.filter((src): src is string => typeof src === 'string');
const productData: ProductDetailData = {
name,
price: parseFloat(priceText),
description,
images,
};
await crawler.pushData(productData);
});
const crawler = new CheerioCrawler({
requestHandler: router,
maxRequestsPerCrawl: 100,
});
await crawler.run([{
url: 'https://example.com/categories/electronics',
label: 'CATEGORY',
}]);
Working with Datasets and Type Safety
Crawlee's Dataset API maintains type safety when storing and retrieving data:
import { Dataset } from 'crawlee';
interface Product {
id: string;
name: string;
price: number;
rating: number;
}
// Create a typed dataset
const dataset = await Dataset.open<Product>('products');
// Push data with type checking
await dataset.pushData({
id: 'prod-123',
name: 'Laptop',
price: 999.99,
rating: 4.5,
});
// Get data with proper typing
const data = await dataset.getData();
data.items.forEach((product: Product) => {
console.log(`${product.name}: $${product.price} (${product.rating}★)`);
});
// Map over items with type safety
await dataset.map((item, index) => {
return {
...item,
discountedPrice: item.price * 0.9,
};
});
Error Handling with TypeScript
TypeScript helps you write more robust error handling for managing browser automation challenges:
import { PuppeteerCrawler } from 'crawlee';
class ScrapingError extends Error {
constructor(
message: string,
public url: string,
public statusCode?: number
) {
super(message);
this.name = 'ScrapingError';
}
}
const crawler = new PuppeteerCrawler({
async requestHandler({ page, request, log }) {
try {
const response = await page.goto(request.url);
if (!response) {
throw new ScrapingError('No response received', request.url);
}
if (response.status() !== 200) {
throw new ScrapingError(
`HTTP ${response.status()}`,
request.url,
response.status()
);
}
// Your scraping logic here
} catch (error) {
if (error instanceof ScrapingError) {
log.error(`Scraping error for ${error.url}: ${error.message}`);
if (error.statusCode && error.statusCode >= 500) {
// Retry server errors
throw error;
}
} else if (error instanceof Error) {
log.error(`Unexpected error: ${error.message}`);
}
}
},
maxRequestRetries: 3,
requestHandlerTimeoutSecs: 60,
});
Configuration with TypeScript
Define your crawler configuration in a type-safe way:
import { PuppeteerCrawlerOptions } from 'crawlee';
interface CrawlerConfig {
maxConcurrency: number;
maxRequestsPerCrawl: number;
headless: boolean;
userAgent: string;
}
const config: CrawlerConfig = {
maxConcurrency: 10,
maxRequestsPerCrawl: 1000,
headless: true,
userAgent: 'Mozilla/5.0 (compatible; MyCrawler/1.0)',
};
const crawlerOptions: PuppeteerCrawlerOptions = {
maxConcurrency: config.maxConcurrency,
maxRequestsPerCrawl: config.maxRequestsPerCrawl,
launchContext: {
launchOptions: {
headless: config.headless,
},
},
preNavigationHooks: [
async ({ page }) => {
await page.setUserAgent(config.userAgent);
},
],
requestHandler: async ({ page, request, log }) => {
// Your handler logic
},
};
Running Your TypeScript Crawler
To run your TypeScript crawler, add these scripts to your package.json
:
{
"scripts": {
"start": "ts-node src/crawler.ts",
"build": "tsc",
"dev": "ts-node-dev --respawn src/crawler.ts"
}
}
Then run your crawler:
# Run directly with ts-node
npm start
# Or build and run
npm run build
node dist/crawler.js
# Development mode with auto-restart
npm run dev
Best Practices for TypeScript and Crawlee
- Define Clear Interfaces: Always define interfaces for your scraped data structures
- Use Strict Mode: Enable
"strict": true
in yourtsconfig.json
- Type Your Selectors: Use type guards when working with nullable DOM selections
- Leverage Generics: Use Crawlee's generic types for request handlers and datasets
- Error Types: Create custom error classes for different scraping scenarios
- Config Objects: Define configuration interfaces for reusable crawler setups
Conclusion
Crawlee's first-class TypeScript support makes it an excellent choice for building type-safe, maintainable web scraping applications. The combination of Crawlee's powerful features with TypeScript's type system helps you catch errors early, improve code quality, and build more robust scrapers. Whether you're navigating complex page structures or processing large-scale data extraction tasks, TypeScript ensures your code remains reliable and easy to maintain.
By following the examples and best practices outlined in this guide, you can leverage the full power of TypeScript in your Crawlee projects and build production-ready web scrapers with confidence.