Can I use Crawlee with TypeScript for type-safe web scraping?

Yes, Crawlee fully supports TypeScript and is actually built with TypeScript as a first-class citizen. In fact, Crawlee is written in TypeScript, making it an excellent choice for developers who want type safety, better IDE support, and fewer runtime errors in their web scraping projects.

Why Use TypeScript with Crawlee?

TypeScript offers several advantages for web scraping projects:

Type Safety: Catch errors at compile-time rather than runtime
Better IDE Support: Enhanced autocomplete, intellisense, and refactoring tools
Self-Documenting Code: Type definitions serve as inline documentation
Easier Maintenance: Types make it easier to understand and modify code over time
Reduced Bugs: Type checking prevents many common programming errors

Installing Crawlee for TypeScript

To get started with Crawlee and TypeScript, you'll need Node.js installed. Then create a new TypeScript project:

# Create a new directory for your project
mkdir my-crawler
cd my-crawler

# Initialize a new npm project
npm init -y

# Install Crawlee and TypeScript dependencies
npm install crawlee
npm install -D typescript @types/node ts-node

# Initialize TypeScript configuration
npx tsc --init

Update your tsconfig.json with recommended settings:

{
  "compilerOptions": {
    "target": "ES2022",
    "module": "commonjs",
    "lib": ["ES2022"],
    "outDir": "./dist",
    "rootDir": "./src",
    "strict": true,
    "esModuleInterop": true,
    "skipLibCheck": true,
    "forceConsistentCasingInFileNames": true,
    "resolveJsonModule": true,
    "moduleResolution": "node"
  },
  "include": ["src/**/*"],
  "exclude": ["node_modules"]
}

Basic TypeScript Crawler Example

Here's a complete example of a type-safe crawler using Crawlee with TypeScript:

// src/crawler.ts
import { CheerioCrawler, ProxyConfiguration } from 'crawlee';

interface ProductData {
  title: string;
  price: number;
  url: string;
  inStock: boolean;
}

const crawler = new CheerioCrawler({
  async requestHandler({ request, $, enqueueLinks, log }) {
    const title = $('h1.product-title').text().trim();

    // TypeScript ensures we handle the price correctly
    const priceText = $('.price').text().replace(/[^0-9.]/g, '');
    const price = parseFloat(priceText);

    const inStock = $('.availability').text().includes('In Stock');

    // Type-safe data structure
    const product: ProductData = {
      title,
      price,
      url: request.url,
      inStock,
    };

    log.info(`Scraped product: ${product.title} - $${product.price}`);

    // Store the data with type safety
    await crawler.pushData(product);

    // Enqueue more links
    await enqueueLinks({
      globs: ['https://example.com/products/*'],
      label: 'PRODUCT',
    });
  },

  maxRequestsPerCrawl: 100,
  maxConcurrency: 5,
});

await crawler.run(['https://example.com/products']);

Using Crawlee with Puppeteer and TypeScript

For handling browser automation tasks, you can use Crawlee's PuppeteerCrawler with full TypeScript support:

import { PuppeteerCrawler, Dataset } from 'crawlee';
import type { Page } from 'puppeteer';

interface ScrapedArticle {
  heading: string;
  author: string;
  publishDate: Date;
  content: string;
  tags: string[];
}

const crawler = new PuppeteerCrawler({
  async requestHandler({ page, request, enqueueLinks, log }) {
    log.info(`Processing: ${request.url}`);

    // Type-safe page interactions
    const heading = await page.$eval(
      'h1.article-title',
      (el) => el.textContent?.trim() ?? ''
    );

    const author = await page.$eval(
      '.author-name',
      (el) => el.textContent?.trim() ?? 'Unknown'
    );

    const dateText = await page.$eval(
      'time[datetime]',
      (el) => el.getAttribute('datetime') ?? ''
    );

    const content = await page.$eval(
      '.article-content',
      (el) => el.textContent?.trim() ?? ''
    );

    const tags = await page.$$eval(
      '.tag',
      (elements) => elements.map(el => el.textContent?.trim() ?? '')
    );

    // Create type-safe article object
    const article: ScrapedArticle = {
      heading,
      author,
      publishDate: new Date(dateText),
      content,
      tags: tags.filter(Boolean),
    };

    await Dataset.pushData(article);

    // Enqueue pagination links
    await enqueueLinks({
      selector: '.pagination a',
      label: 'LIST',
    });
  },

  headless: true,
  maxRequestsPerCrawl: 50,
});

await crawler.run(['https://example.com/blog']);

Advanced TypeScript Features with Crawlee

Custom Request Context Types

You can define custom types for your request context to maintain type safety throughout your crawler:

import { CheerioCrawler, Request } from 'crawlee';

interface CustomUserData {
  category: string;
  depth: number;
  parentUrl?: string;
}

const crawler = new CheerioCrawler<{ userData: CustomUserData }>({
  async requestHandler({ request, $, enqueueLinks, log }) {
    const { category, depth, parentUrl } = request.userData;

    log.info(`Scraping ${category} at depth ${depth}`);

    if (depth < 3) {
      await enqueueLinks({
        globs: [`https://example.com/${category}/*`],
        userData: {
          category,
          depth: depth + 1,
          parentUrl: request.url,
        },
      });
    }
  },
});

const initialRequest = new Request<CustomUserData>({
  url: 'https://example.com/electronics',
  userData: {
    category: 'electronics',
    depth: 0,
  },
});

await crawler.run([initialRequest]);

Type-Safe Router Pattern

Crawlee's router pattern works seamlessly with TypeScript:

import { CheerioCrawler, createCheerioRouter } from 'crawlee';

interface CategoryData {
  name: string;
  productCount: number;
}

interface ProductDetailData {
  name: string;
  price: number;
  description: string;
  images: string[];
}

const router = createCheerioRouter();

router.addHandler('CATEGORY', async ({ request, $, enqueueLinks }) => {
  const categoryName = $('h1.category-name').text().trim();
  const productCount = $('.product-item').length;

  const categoryData: CategoryData = {
    name: categoryName,
    productCount,
  };

  await Dataset.pushData(categoryData);

  await enqueueLinks({
    selector: '.product-item a',
    label: 'PRODUCT',
  });
});

router.addHandler('PRODUCT', async ({ request, $, crawler }) => {
  const name = $('h1.product-name').text().trim();
  const priceText = $('.price').text().replace(/[^0-9.]/g, '');
  const description = $('.description').text().trim();
  const images = $('.product-image img')
    .map((_, el) => $(el).attr('src'))
    .get()
    .filter((src): src is string => typeof src === 'string');

  const productData: ProductDetailData = {
    name,
    price: parseFloat(priceText),
    description,
    images,
  };

  await crawler.pushData(productData);
});

const crawler = new CheerioCrawler({
  requestHandler: router,
  maxRequestsPerCrawl: 100,
});

await crawler.run([{
  url: 'https://example.com/categories/electronics',
  label: 'CATEGORY',
}]);

Working with Datasets and Type Safety

Crawlee's Dataset API maintains type safety when storing and retrieving data:

import { Dataset } from 'crawlee';

interface Product {
  id: string;
  name: string;
  price: number;
  rating: number;
}

// Create a typed dataset
const dataset = await Dataset.open<Product>('products');

// Push data with type checking
await dataset.pushData({
  id: 'prod-123',
  name: 'Laptop',
  price: 999.99,
  rating: 4.5,
});

// Get data with proper typing
const data = await dataset.getData();
data.items.forEach((product: Product) => {
  console.log(`${product.name}: $${product.price} (${product.rating}★)`);
});

// Map over items with type safety
await dataset.map((item, index) => {
  return {
    ...item,
    discountedPrice: item.price * 0.9,
  };
});

Error Handling with TypeScript

TypeScript helps you write more robust error handling for managing browser automation challenges:

import { PuppeteerCrawler } from 'crawlee';

class ScrapingError extends Error {
  constructor(
    message: string,
    public url: string,
    public statusCode?: number
  ) {
    super(message);
    this.name = 'ScrapingError';
  }
}

const crawler = new PuppeteerCrawler({
  async requestHandler({ page, request, log }) {
    try {
      const response = await page.goto(request.url);

      if (!response) {
        throw new ScrapingError('No response received', request.url);
      }

      if (response.status() !== 200) {
        throw new ScrapingError(
          `HTTP ${response.status()}`,
          request.url,
          response.status()
        );
      }

      // Your scraping logic here

    } catch (error) {
      if (error instanceof ScrapingError) {
        log.error(`Scraping error for ${error.url}: ${error.message}`);
        if (error.statusCode && error.statusCode >= 500) {
          // Retry server errors
          throw error;
        }
      } else if (error instanceof Error) {
        log.error(`Unexpected error: ${error.message}`);
      }
    }
  },

  maxRequestRetries: 3,
  requestHandlerTimeoutSecs: 60,
});

Configuration with TypeScript

Define your crawler configuration in a type-safe way:

import { PuppeteerCrawlerOptions } from 'crawlee';

interface CrawlerConfig {
  maxConcurrency: number;
  maxRequestsPerCrawl: number;
  headless: boolean;
  userAgent: string;
}

const config: CrawlerConfig = {
  maxConcurrency: 10,
  maxRequestsPerCrawl: 1000,
  headless: true,
  userAgent: 'Mozilla/5.0 (compatible; MyCrawler/1.0)',
};

const crawlerOptions: PuppeteerCrawlerOptions = {
  maxConcurrency: config.maxConcurrency,
  maxRequestsPerCrawl: config.maxRequestsPerCrawl,
  launchContext: {
    launchOptions: {
      headless: config.headless,
    },
  },
  preNavigationHooks: [
    async ({ page }) => {
      await page.setUserAgent(config.userAgent);
    },
  ],
  requestHandler: async ({ page, request, log }) => {
    // Your handler logic
  },
};

Running Your TypeScript Crawler

To run your TypeScript crawler, add these scripts to your package.json:

{
  "scripts": {
    "start": "ts-node src/crawler.ts",
    "build": "tsc",
    "dev": "ts-node-dev --respawn src/crawler.ts"
  }
}

Then run your crawler:

# Run directly with ts-node
npm start

# Or build and run
npm run build
node dist/crawler.js

# Development mode with auto-restart
npm run dev

Best Practices for TypeScript and Crawlee

Define Clear Interfaces: Always define interfaces for your scraped data structures
Use Strict Mode: Enable "strict": true in your tsconfig.json
Type Your Selectors: Use type guards when working with nullable DOM selections
Leverage Generics: Use Crawlee's generic types for request handlers and datasets
Error Types: Create custom error classes for different scraping scenarios
Config Objects: Define configuration interfaces for reusable crawler setups

Conclusion

Crawlee's first-class TypeScript support makes it an excellent choice for building type-safe, maintainable web scraping applications. The combination of Crawlee's powerful features with TypeScript's type system helps you catch errors early, improve code quality, and build more robust scrapers. Whether you're navigating complex page structures or processing large-scale data extraction tasks, TypeScript ensures your code remains reliable and easy to maintain.

By following the examples and best practices outlined in this guide, you can leverage the full power of TypeScript in your Crawlee projects and build production-ready web scrapers with confidence.

Table of contents

Can I use Crawlee with TypeScript for type-safe web scraping?

Why Use TypeScript with Crawlee?

Installing Crawlee for TypeScript

Basic TypeScript Crawler Example

Using Crawlee with Puppeteer and TypeScript

Advanced TypeScript Features with Crawlee

Custom Request Context Types

Type-Safe Router Pattern

Working with Datasets and Type Safety

Error Handling with TypeScript

Configuration with TypeScript

Running Your TypeScript Crawler

Best Practices for TypeScript and Crawlee

Conclusion

Try WebScraping.AI for Your Web Scraping Needs

Key Features:

Getting Started:

Related Questions

What is the best way to build a JavaScript web scraper with Crawlee?

How do I set up a Node.js web crawler using Crawlee?

Does Crawlee work with modern JavaScript frameworks?

Get Started Now

Support