Table of contents

How do you use Cheerio with TypeScript?

Cheerio is a powerful server-side HTML parsing library that brings jQuery-like functionality to Node.js environments. When combined with TypeScript, it provides excellent type safety and developer experience for web scraping and HTML manipulation tasks. This guide will show you how to properly set up and use Cheerio with TypeScript.

Installing Cheerio with TypeScript Support

First, install Cheerio and its TypeScript definitions:

npm install cheerio
npm install --save-dev @types/cheerio typescript

For newer projects, you might want to use the latest version which includes built-in TypeScript support:

npm install cheerio@latest

Basic TypeScript Configuration

Create or update your tsconfig.json file:

{
  "compilerOptions": {
    "target": "ES2020",
    "module": "commonjs",
    "lib": ["ES2020"],
    "outDir": "./dist",
    "rootDir": "./src",
    "strict": true,
    "esModuleInterop": true,
    "skipLibCheck": true,
    "forceConsistentCasingInFileNames": true,
    "resolveJsonModule": true
  },
  "include": ["src/**/*"],
  "exclude": ["node_modules", "dist"]
}

Basic Cheerio Usage with TypeScript

Here's how to get started with Cheerio in TypeScript:

import * as cheerio from 'cheerio';
import axios from 'axios';

interface ScrapedData {
  title: string;
  links: string[];
  paragraphs: string[];
}

async function scrapeWebsite(url: string): Promise<ScrapedData> {
  try {
    // Fetch the HTML content
    const response = await axios.get(url);
    const html: string = response.data;

    // Load HTML into Cheerio
    const $: cheerio.CheerioAPI = cheerio.load(html);

    // Extract data with type safety
    const title: string = $('title').text().trim();

    const links: string[] = [];
    $('a[href]').each((index: number, element: cheerio.Element) => {
      const href = $(element).attr('href');
      if (href) {
        links.push(href);
      }
    });

    const paragraphs: string[] = [];
    $('p').each((index: number, element: cheerio.Element) => {
      paragraphs.push($(element).text().trim());
    });

    return { title, links, paragraphs };
  } catch (error) {
    throw new Error(`Failed to scrape website: ${error}`);
  }
}

// Usage example
scrapeWebsite('https://example.com')
  .then((data: ScrapedData) => {
    console.log('Title:', data.title);
    console.log('Links found:', data.links.length);
    console.log('Paragraphs:', data.paragraphs.length);
  })
  .catch((error: Error) => {
    console.error('Scraping failed:', error.message);
  });

Advanced TypeScript Features with Cheerio

Custom Type Definitions

Create custom interfaces for structured data extraction:

interface Product {
  name: string;
  price: number;
  description: string;
  imageUrl?: string;
  inStock: boolean;
}

interface ProductPage {
  products: Product[];
  totalCount: number;
  currentPage: number;
}

class ProductScraper {
  private $: cheerio.CheerioAPI;

  constructor(html: string) {
    this.$ = cheerio.load(html);
  }

  public extractProducts(): Product[] {
    const products: Product[] = [];

    this.$('.product-item').each((index: number, element: cheerio.Element) => {
      const $product = this.$(element);

      const name: string = $product.find('.product-name').text().trim();
      const priceText: string = $product.find('.price').text().trim();
      const price: number = parseFloat(priceText.replace(/[^0-9.]/g, '')) || 0;
      const description: string = $product.find('.description').text().trim();
      const imageUrl: string | undefined = $product.find('img').attr('src');
      const inStock: boolean = !$product.hasClass('out-of-stock');

      if (name && price > 0) {
        products.push({
          name,
          price,
          description,
          imageUrl,
          inStock
        });
      }
    });

    return products;
  }

  public getPageInfo(): { currentPage: number; totalPages: number } {
    const currentPage: number = parseInt(this.$('.pagination .current').text()) || 1;
    const totalPages: number = parseInt(this.$('.pagination .page').last().text()) || 1;

    return { currentPage, totalPages };
  }
}

Generic Helper Functions

Create reusable, type-safe helper functions:

function extractTextArray($: cheerio.CheerioAPI, selector: string): string[] {
  const results: string[] = [];
  $(selector).each((index: number, element: cheerio.Element) => {
    const text = $(element).text().trim();
    if (text) {
      results.push(text);
    }
  });
  return results;
}

function extractAttributes<T extends string>(
  $: cheerio.CheerioAPI, 
  selector: string, 
  attribute: T
): string[] {
  const results: string[] = [];
  $(selector).each((index: number, element: cheerio.Element) => {
    const attr = $(element).attr(attribute);
    if (attr) {
      results.push(attr);
    }
  });
  return results;
}

// Usage examples
const $ = cheerio.load('<div><p>Text 1</p><p>Text 2</p><a href="/link1">Link</a></div>');

const paragraphTexts: string[] = extractTextArray($, 'p');
const linkHrefs: string[] = extractAttributes($, 'a', 'href');

Error Handling and Type Safety

Implement robust error handling with TypeScript:

type ScrapeResult<T> = {
  success: true;
  data: T;
} | {
  success: false;
  error: string;
};

async function safeScrape<T>(
  url: string, 
  extractor: (html: string) => T
): Promise<ScrapeResult<T>> {
  try {
    const response = await axios.get(url, {
      timeout: 10000,
      headers: {
        'User-Agent': 'Mozilla/5.0 (compatible; TypeScript-Scraper/1.0)'
      }
    });

    if (response.status !== 200) {
      return {
        success: false,
        error: `HTTP ${response.status}: ${response.statusText}`
      };
    }

    const data = extractor(response.data);
    return { success: true, data };

  } catch (error) {
    return {
      success: false,
      error: error instanceof Error ? error.message : 'Unknown error occurred'
    };
  }
}

// Usage with type safety
const result = await safeScrape('https://example.com', (html: string) => {
  const $ = cheerio.load(html);
  return {
    title: $('title').text(),
    headings: extractTextArray($, 'h1, h2, h3')
  };
});

if (result.success) {
  console.log('Title:', result.data.title);
  console.log('Headings:', result.data.headings);
} else {
  console.error('Scraping failed:', result.error);
}

Working with Forms and Complex Structures

Handle complex HTML structures with type safety:

interface FormField {
  name: string;
  type: string;
  value?: string;
  required: boolean;
  options?: string[];
}

interface FormData {
  action: string;
  method: string;
  fields: FormField[];
}

function extractFormData($: cheerio.CheerioAPI, formSelector: string): FormData | null {
  const $form = $(formSelector).first();

  if ($form.length === 0) {
    return null;
  }

  const action: string = $form.attr('action') || '';
  const method: string = $form.attr('method')?.toUpperCase() || 'GET';
  const fields: FormField[] = [];

  $form.find('input, select, textarea').each((index: number, element: cheerio.Element) => {
    const $field = $(element);
    const name: string = $field.attr('name') || '';
    const type: string = $field.attr('type') || element.tagName.toLowerCase();
    const value: string | undefined = $field.attr('value') || $field.text();
    const required: boolean = $field.attr('required') !== undefined;

    let options: string[] | undefined;
    if (element.tagName.toLowerCase() === 'select') {
      options = [];
      $field.find('option').each((i: number, opt: cheerio.Element) => {
        const optionText = $(opt).text().trim();
        if (optionText) {
          options!.push(optionText);
        }
      });
    }

    if (name) {
      fields.push({ name, type, value, required, options });
    }
  });

  return { action, method, fields };
}

Integration with Modern TypeScript Patterns

Use modern TypeScript features for better code organization:

// Using async/await with proper typing
class WebScraper {
  private readonly baseUrl: string;
  private readonly timeout: number;

  constructor(baseUrl: string, timeout: number = 5000) {
    this.baseUrl = baseUrl;
    this.timeout = timeout;
  }

  async scrapeMultiplePages<T>(
    paths: string[], 
    extractor: (html: string, url: string) => T
  ): Promise<T[]> {
    const promises = paths.map(async (path: string): Promise<T> => {
      const url = new URL(path, this.baseUrl).toString();
      const response = await axios.get(url, { timeout: this.timeout });
      return extractor(response.data, url);
    });

    return Promise.all(promises);
  }
}

// Usage example
const scraper = new WebScraper('https://example.com');

const results = await scraper.scrapeMultiplePages(
  ['/page1', '/page2', '/page3'],
  (html: string, url: string) => {
    const $ = cheerio.load(html);
    return {
      url,
      title: $('title').text(),
      contentLength: $.text().length
    };
  }
);

Best Practices for TypeScript and Cheerio

1. Type Your Selectors

// Create constants for commonly used selectors
const SELECTORS = {
  TITLE: 'title',
  LINKS: 'a[href]',
  IMAGES: 'img[src]',
  PARAGRAPHS: 'p'
} as const;

// Use them consistently
const title: string = $(SELECTORS.TITLE).text();

2. Validate Data Types

function parseNumber(text: string): number {
  const num = parseFloat(text.replace(/[^0-9.-]/g, ''));
  return isNaN(num) ? 0 : num;
}

function parseBoolean(text: string): boolean {
  return /^(true|yes|1|on)$/i.test(text.trim());
}

3. Use Strict Type Checking

Enable strict mode in your TypeScript configuration and handle null/undefined cases:

function safeText($element: cheerio.Cheerio<cheerio.Element>): string {
  const text = $element.text();
  return text ? text.trim() : '';
}

function safeAttr($element: cheerio.Cheerio<cheerio.Element>, attr: string): string | null {
  return $element.attr(attr) || null;
}

Testing Cheerio with TypeScript

Set up proper testing with Jest and TypeScript:

// scraper.test.ts
import * as cheerio from 'cheerio';
import { extractProductData } from './scraper';

describe('Product Scraper', () => {
  const mockHtml = `
    <div class="product">
      <h2 class="name">Test Product</h2>
      <span class="price">$29.99</span>
      <p class="description">Great product</p>
    </div>
  `;

  test('should extract product data correctly', () => {
    const $ = cheerio.load(mockHtml);
    const product = extractProductData($, '.product');

    expect(product).toEqual({
      name: 'Test Product',
      price: 29.99,
      description: 'Great product'
    });
  });

  test('should handle missing elements gracefully', () => {
    const $ = cheerio.load('<div></div>');
    const product = extractProductData($, '.product');

    expect(product).toBeNull();
  });
});

Conclusion

Using Cheerio with TypeScript provides excellent type safety and developer experience for server-side HTML parsing and web scraping tasks. The combination allows you to catch errors at compile time, get better IntelliSense support, and write more maintainable code.

Key benefits include: - Type Safety: Catch errors before runtime - Better IDE Support: Auto-completion and refactoring tools - Code Documentation: Interfaces serve as documentation - Maintainability: Easier to refactor and extend

For more complex scraping scenarios involving JavaScript-heavy websites, consider combining Cheerio with browser automation tools or exploring how to handle dynamic content that loads after page load using browser automation frameworks.

When working with large-scale scraping projects, you might also want to implement proper error handling patterns and monitoring to ensure reliable data extraction.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon