Table of contents

Can I use Crawlee with TypeScript for type-safe web scraping?

Yes, Crawlee fully supports TypeScript and is actually built with TypeScript as a first-class citizen. In fact, Crawlee is written in TypeScript, making it an excellent choice for developers who want type safety, better IDE support, and fewer runtime errors in their web scraping projects.

Why Use TypeScript with Crawlee?

TypeScript offers several advantages for web scraping projects:

  • Type Safety: Catch errors at compile-time rather than runtime
  • Better IDE Support: Enhanced autocomplete, intellisense, and refactoring tools
  • Self-Documenting Code: Type definitions serve as inline documentation
  • Easier Maintenance: Types make it easier to understand and modify code over time
  • Reduced Bugs: Type checking prevents many common programming errors

Installing Crawlee for TypeScript

To get started with Crawlee and TypeScript, you'll need Node.js installed. Then create a new TypeScript project:

# Create a new directory for your project
mkdir my-crawler
cd my-crawler

# Initialize a new npm project
npm init -y

# Install Crawlee and TypeScript dependencies
npm install crawlee
npm install -D typescript @types/node ts-node

# Initialize TypeScript configuration
npx tsc --init

Update your tsconfig.json with recommended settings:

{
  "compilerOptions": {
    "target": "ES2022",
    "module": "commonjs",
    "lib": ["ES2022"],
    "outDir": "./dist",
    "rootDir": "./src",
    "strict": true,
    "esModuleInterop": true,
    "skipLibCheck": true,
    "forceConsistentCasingInFileNames": true,
    "resolveJsonModule": true,
    "moduleResolution": "node"
  },
  "include": ["src/**/*"],
  "exclude": ["node_modules"]
}

Basic TypeScript Crawler Example

Here's a complete example of a type-safe crawler using Crawlee with TypeScript:

// src/crawler.ts
import { CheerioCrawler, ProxyConfiguration } from 'crawlee';

interface ProductData {
  title: string;
  price: number;
  url: string;
  inStock: boolean;
}

const crawler = new CheerioCrawler({
  async requestHandler({ request, $, enqueueLinks, log }) {
    const title = $('h1.product-title').text().trim();

    // TypeScript ensures we handle the price correctly
    const priceText = $('.price').text().replace(/[^0-9.]/g, '');
    const price = parseFloat(priceText);

    const inStock = $('.availability').text().includes('In Stock');

    // Type-safe data structure
    const product: ProductData = {
      title,
      price,
      url: request.url,
      inStock,
    };

    log.info(`Scraped product: ${product.title} - $${product.price}`);

    // Store the data with type safety
    await crawler.pushData(product);

    // Enqueue more links
    await enqueueLinks({
      globs: ['https://example.com/products/*'],
      label: 'PRODUCT',
    });
  },

  maxRequestsPerCrawl: 100,
  maxConcurrency: 5,
});

await crawler.run(['https://example.com/products']);

Using Crawlee with Puppeteer and TypeScript

For handling browser automation tasks, you can use Crawlee's PuppeteerCrawler with full TypeScript support:

import { PuppeteerCrawler, Dataset } from 'crawlee';
import type { Page } from 'puppeteer';

interface ScrapedArticle {
  heading: string;
  author: string;
  publishDate: Date;
  content: string;
  tags: string[];
}

const crawler = new PuppeteerCrawler({
  async requestHandler({ page, request, enqueueLinks, log }) {
    log.info(`Processing: ${request.url}`);

    // Type-safe page interactions
    const heading = await page.$eval(
      'h1.article-title',
      (el) => el.textContent?.trim() ?? ''
    );

    const author = await page.$eval(
      '.author-name',
      (el) => el.textContent?.trim() ?? 'Unknown'
    );

    const dateText = await page.$eval(
      'time[datetime]',
      (el) => el.getAttribute('datetime') ?? ''
    );

    const content = await page.$eval(
      '.article-content',
      (el) => el.textContent?.trim() ?? ''
    );

    const tags = await page.$$eval(
      '.tag',
      (elements) => elements.map(el => el.textContent?.trim() ?? '')
    );

    // Create type-safe article object
    const article: ScrapedArticle = {
      heading,
      author,
      publishDate: new Date(dateText),
      content,
      tags: tags.filter(Boolean),
    };

    await Dataset.pushData(article);

    // Enqueue pagination links
    await enqueueLinks({
      selector: '.pagination a',
      label: 'LIST',
    });
  },

  headless: true,
  maxRequestsPerCrawl: 50,
});

await crawler.run(['https://example.com/blog']);

Advanced TypeScript Features with Crawlee

Custom Request Context Types

You can define custom types for your request context to maintain type safety throughout your crawler:

import { CheerioCrawler, Request } from 'crawlee';

interface CustomUserData {
  category: string;
  depth: number;
  parentUrl?: string;
}

const crawler = new CheerioCrawler<{ userData: CustomUserData }>({
  async requestHandler({ request, $, enqueueLinks, log }) {
    const { category, depth, parentUrl } = request.userData;

    log.info(`Scraping ${category} at depth ${depth}`);

    if (depth < 3) {
      await enqueueLinks({
        globs: [`https://example.com/${category}/*`],
        userData: {
          category,
          depth: depth + 1,
          parentUrl: request.url,
        },
      });
    }
  },
});

const initialRequest = new Request<CustomUserData>({
  url: 'https://example.com/electronics',
  userData: {
    category: 'electronics',
    depth: 0,
  },
});

await crawler.run([initialRequest]);

Type-Safe Router Pattern

Crawlee's router pattern works seamlessly with TypeScript:

import { CheerioCrawler, createCheerioRouter } from 'crawlee';

interface CategoryData {
  name: string;
  productCount: number;
}

interface ProductDetailData {
  name: string;
  price: number;
  description: string;
  images: string[];
}

const router = createCheerioRouter();

router.addHandler('CATEGORY', async ({ request, $, enqueueLinks }) => {
  const categoryName = $('h1.category-name').text().trim();
  const productCount = $('.product-item').length;

  const categoryData: CategoryData = {
    name: categoryName,
    productCount,
  };

  await Dataset.pushData(categoryData);

  await enqueueLinks({
    selector: '.product-item a',
    label: 'PRODUCT',
  });
});

router.addHandler('PRODUCT', async ({ request, $, crawler }) => {
  const name = $('h1.product-name').text().trim();
  const priceText = $('.price').text().replace(/[^0-9.]/g, '');
  const description = $('.description').text().trim();
  const images = $('.product-image img')
    .map((_, el) => $(el).attr('src'))
    .get()
    .filter((src): src is string => typeof src === 'string');

  const productData: ProductDetailData = {
    name,
    price: parseFloat(priceText),
    description,
    images,
  };

  await crawler.pushData(productData);
});

const crawler = new CheerioCrawler({
  requestHandler: router,
  maxRequestsPerCrawl: 100,
});

await crawler.run([{
  url: 'https://example.com/categories/electronics',
  label: 'CATEGORY',
}]);

Working with Datasets and Type Safety

Crawlee's Dataset API maintains type safety when storing and retrieving data:

import { Dataset } from 'crawlee';

interface Product {
  id: string;
  name: string;
  price: number;
  rating: number;
}

// Create a typed dataset
const dataset = await Dataset.open<Product>('products');

// Push data with type checking
await dataset.pushData({
  id: 'prod-123',
  name: 'Laptop',
  price: 999.99,
  rating: 4.5,
});

// Get data with proper typing
const data = await dataset.getData();
data.items.forEach((product: Product) => {
  console.log(`${product.name}: $${product.price} (${product.rating}★)`);
});

// Map over items with type safety
await dataset.map((item, index) => {
  return {
    ...item,
    discountedPrice: item.price * 0.9,
  };
});

Error Handling with TypeScript

TypeScript helps you write more robust error handling for managing browser automation challenges:

import { PuppeteerCrawler } from 'crawlee';

class ScrapingError extends Error {
  constructor(
    message: string,
    public url: string,
    public statusCode?: number
  ) {
    super(message);
    this.name = 'ScrapingError';
  }
}

const crawler = new PuppeteerCrawler({
  async requestHandler({ page, request, log }) {
    try {
      const response = await page.goto(request.url);

      if (!response) {
        throw new ScrapingError('No response received', request.url);
      }

      if (response.status() !== 200) {
        throw new ScrapingError(
          `HTTP ${response.status()}`,
          request.url,
          response.status()
        );
      }

      // Your scraping logic here

    } catch (error) {
      if (error instanceof ScrapingError) {
        log.error(`Scraping error for ${error.url}: ${error.message}`);
        if (error.statusCode && error.statusCode >= 500) {
          // Retry server errors
          throw error;
        }
      } else if (error instanceof Error) {
        log.error(`Unexpected error: ${error.message}`);
      }
    }
  },

  maxRequestRetries: 3,
  requestHandlerTimeoutSecs: 60,
});

Configuration with TypeScript

Define your crawler configuration in a type-safe way:

import { PuppeteerCrawlerOptions } from 'crawlee';

interface CrawlerConfig {
  maxConcurrency: number;
  maxRequestsPerCrawl: number;
  headless: boolean;
  userAgent: string;
}

const config: CrawlerConfig = {
  maxConcurrency: 10,
  maxRequestsPerCrawl: 1000,
  headless: true,
  userAgent: 'Mozilla/5.0 (compatible; MyCrawler/1.0)',
};

const crawlerOptions: PuppeteerCrawlerOptions = {
  maxConcurrency: config.maxConcurrency,
  maxRequestsPerCrawl: config.maxRequestsPerCrawl,
  launchContext: {
    launchOptions: {
      headless: config.headless,
    },
  },
  preNavigationHooks: [
    async ({ page }) => {
      await page.setUserAgent(config.userAgent);
    },
  ],
  requestHandler: async ({ page, request, log }) => {
    // Your handler logic
  },
};

Running Your TypeScript Crawler

To run your TypeScript crawler, add these scripts to your package.json:

{
  "scripts": {
    "start": "ts-node src/crawler.ts",
    "build": "tsc",
    "dev": "ts-node-dev --respawn src/crawler.ts"
  }
}

Then run your crawler:

# Run directly with ts-node
npm start

# Or build and run
npm run build
node dist/crawler.js

# Development mode with auto-restart
npm run dev

Best Practices for TypeScript and Crawlee

  1. Define Clear Interfaces: Always define interfaces for your scraped data structures
  2. Use Strict Mode: Enable "strict": true in your tsconfig.json
  3. Type Your Selectors: Use type guards when working with nullable DOM selections
  4. Leverage Generics: Use Crawlee's generic types for request handlers and datasets
  5. Error Types: Create custom error classes for different scraping scenarios
  6. Config Objects: Define configuration interfaces for reusable crawler setups

Conclusion

Crawlee's first-class TypeScript support makes it an excellent choice for building type-safe, maintainable web scraping applications. The combination of Crawlee's powerful features with TypeScript's type system helps you catch errors early, improve code quality, and build more robust scrapers. Whether you're navigating complex page structures or processing large-scale data extraction tasks, TypeScript ensures your code remains reliable and easy to maintain.

By following the examples and best practices outlined in this guide, you can leverage the full power of TypeScript in your Crawlee projects and build production-ready web scrapers with confidence.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon