Table of contents

What Data Export Formats Does Crawlee Support?

Crawlee provides comprehensive support for exporting scraped data in multiple formats, making it easy to integrate your web scraping results with various downstream tools and applications. The framework includes built-in export methods for JSON, CSV, Excel (XLSX), HTML, XML, and RSS formats, all accessible through the Dataset API.

Overview of Crawlee Export Formats

Crawlee's dataset system supports six primary export formats:

  • JSON: Standard JSON format for programmatic processing
  • CSV: Comma-separated values for spreadsheet applications
  • XLSX: Microsoft Excel format with full spreadsheet support
  • HTML: Human-readable HTML tables
  • XML: Structured XML documents
  • RSS: RSS feed format for syndication

Each format can be exported with a single method call, and Crawlee handles the conversion automatically based on your dataset structure.

Exporting Data to JSON

JSON is the most commonly used format for web scraping data. Crawlee provides multiple ways to work with JSON exports:

Basic JSON Export

import { CheerioCrawler, Dataset } from 'crawlee';

const crawler = new CheerioCrawler({
    async requestHandler({ request, $, log }) {
        const data = {
            url: request.loadedUrl,
            title: $('title').text(),
            heading: $('h1').first().text(),
            timestamp: new Date().toISOString(),
        };

        await crawler.pushData(data);
        log.info(`Scraped: ${data.title}`);
    },
});

await crawler.run(['https://example.com', 'https://example.com/about']);

// Export to JSON file
const dataset = await Dataset.open();
await dataset.exportToJSON('output');

// This creates: ./storage/datasets/default/output.json

Retrieving JSON Data Programmatically

You can also retrieve data as JSON objects without writing to a file:

import { Dataset } from 'crawlee';

const dataset = await Dataset.open();
const { items } = await dataset.getData();

// items is an array of JavaScript objects
console.log(`Total records: ${items.length}`);
console.log(JSON.stringify(items, null, 2));

// Process data further
const processed = items.map(item => ({
    ...item,
    titleLength: item.title?.length || 0,
    domain: new URL(item.url).hostname,
}));

Python JSON Export

When using Crawlee with Python for web scraping, JSON export works similarly:

from crawlee.beautifulsoup_crawler import BeautifulSoupCrawler, BeautifulSoupCrawlingContext
from crawlee.storages import Dataset

async def main():
    crawler = BeautifulSoupCrawler()

    @crawler.router.default_handler
    async def request_handler(context: BeautifulSoupCrawlingContext) -> None:
        soup = context.soup

        data = {
            'url': context.request.url,
            'title': soup.find('title').get_text() if soup.find('title') else '',
            'heading': soup.find('h1').get_text() if soup.find('h1') else '',
        }

        await context.push_data(data)

    await crawler.run(['https://example.com'])

    # Export to JSON
    dataset = await Dataset.open()
    await dataset.export_to_json('output')

if __name__ == '__main__':
    import asyncio
    asyncio.run(main())

Exporting Data to CSV

CSV format is ideal for importing data into spreadsheet applications like Excel, Google Sheets, or for data analysis with tools like pandas:

import { CheerioCrawler, Dataset } from 'crawlee';

const crawler = new CheerioCrawler({
    async requestHandler({ request, $, log }) {
        // Extract structured product data
        $('.product').each((index, element) => {
            crawler.pushData({
                productName: $(element).find('.name').text(),
                price: $(element).find('.price').text(),
                category: $(element).find('.category').text(),
                rating: $(element).find('.rating').text(),
                inStock: $(element).find('.in-stock').length > 0,
            });
        });
    },
});

await crawler.run(['https://example-shop.com/products']);

// Export to CSV
const dataset = await Dataset.open();
await dataset.exportToCSV('products');

// This creates: ./storage/datasets/default/products.csv

The CSV export automatically: - Converts object keys to column headers - Flattens nested objects (up to a certain depth) - Handles boolean values and arrays - Escapes special characters

CSV Export with Custom Options

You can customize CSV export behavior:

import { Dataset } from 'crawlee';

const dataset = await Dataset.open();

// Export with custom delimiter and encoding
await dataset.exportToCSV('output', {
    delimiter: ';',      // Use semicolon instead of comma
    encoding: 'utf-8',   // Specify encoding
});

Exporting Data to Excel (XLSX)

Excel format is useful for sharing data with non-technical stakeholders or for advanced spreadsheet analysis:

import { PlaywrightCrawler, Dataset } from 'crawlee';

const crawler = new PlaywrightCrawler({
    async requestHandler({ request, page, log }) {
        // Wait for dynamic content
        await page.waitForSelector('.data-table');

        // Extract tabular data
        const tableData = await page.$$eval('table.data-table tr', rows =>
            rows.map(row => {
                const cells = Array.from(row.querySelectorAll('td, th'));
                return {
                    column1: cells[0]?.textContent?.trim(),
                    column2: cells[1]?.textContent?.trim(),
                    column3: cells[2]?.textContent?.trim(),
                };
            })
        );

        await crawler.pushData(tableData);
        log.info(`Extracted ${tableData.length} rows`);
    },
});

await crawler.run(['https://example.com/data-table']);

// Export to Excel
const dataset = await Dataset.open();
await dataset.exportToXLSX('report');

// This creates: ./storage/datasets/default/report.xlsx

The XLSX export: - Preserves data types (numbers, dates, booleans) - Supports multiple sheets (for complex data structures) - Includes proper column formatting - Can be opened directly in Microsoft Excel or Google Sheets

Exporting Data to HTML

HTML export creates a formatted table view of your data, perfect for quick visual inspection or embedding in web pages:

import { CheerioCrawler, Dataset } from 'crawlee';

const crawler = new CheerioCrawler({
    async requestHandler({ request, $, log }) {
        await crawler.pushData({
            url: request.loadedUrl,
            title: $('title').text(),
            metaDescription: $('meta[name="description"]').attr('content'),
            h1Count: $('h1').length,
            h2Count: $('h2').length,
        });
    },
});

await crawler.run(['https://example.com']);

// Export to HTML
const dataset = await Dataset.open();
await dataset.exportToHTML('report');

// This creates: ./storage/datasets/default/report.html

The generated HTML file includes: - A formatted table with headers - Proper HTML structure - Basic CSS styling - Sortable columns (in some versions)

Exporting Data to XML

XML format is useful for integrating with enterprise systems or APIs that expect XML input:

import { CheerioCrawler, Dataset } from 'crawlee';

const crawler = new CheerioCrawler({
    async requestHandler({ request, $, log }) {
        const article = {
            url: request.loadedUrl,
            title: $('article h1').text(),
            author: $('article .author').text(),
            publishDate: $('article .date').text(),
            content: $('article .content').text(),
            tags: $('article .tag').map((i, el) => $(el).text()).get(),
        };

        await crawler.pushData(article);
    },
});

await crawler.run(['https://example-blog.com/article']);

// Export to XML
const dataset = await Dataset.open();
await dataset.exportToXML('articles');

// This creates: ./storage/datasets/default/articles.xml

The XML export: - Converts object structure to XML elements - Handles nested objects and arrays - Uses proper XML escaping - Includes XML declaration and root element

Exporting Data to RSS

RSS export is specifically designed for creating news feeds or content syndication:

import { CheerioCrawler, Dataset } from 'crawlee';

const crawler = new CheerioCrawler({
    async requestHandler({ request, $, log }) {
        // Extract blog post data in RSS-compatible format
        const post = {
            title: $('article h1').text(),
            link: request.loadedUrl,
            description: $('article .excerpt').text(),
            pubDate: new Date($('article .date').text()).toISOString(),
            author: $('article .author').text(),
        };

        await crawler.pushData(post);
    },
});

await crawler.run(['https://example-blog.com/posts']);

// Export to RSS feed
const dataset = await Dataset.open();
await dataset.exportToRSS('feed', {
    title: 'Example Blog Feed',
    description: 'Latest posts from Example Blog',
    link: 'https://example-blog.com',
});

// This creates: ./storage/datasets/default/feed.rss

Working with Multiple Export Formats

You can export the same dataset to multiple formats for different use cases:

import { CheerioCrawler, Dataset } from 'crawlee';

const crawler = new CheerioCrawler({
    async requestHandler({ request, $, log }) {
        await crawler.pushData({
            url: request.loadedUrl,
            title: $('title').text(),
            wordCount: $('body').text().split(/\s+/).length,
            scrapedAt: new Date().toISOString(),
        });
    },
});

await crawler.run(['https://example.com']);

// Export to multiple formats
const dataset = await Dataset.open();

await Promise.all([
    dataset.exportToJSON('data'),      // For programmatic access
    dataset.exportToCSV('data'),       // For spreadsheet analysis
    dataset.exportToXLSX('report'),    // For stakeholder reporting
    dataset.exportToHTML('preview'),   // For quick browser preview
]);

console.log('Data exported to all formats');

Custom Storage Locations

You can specify custom storage directories for your exports:

import { CheerioCrawler, Configuration, Dataset } from 'crawlee';

const config = new Configuration({
    storageDir: './my-scraping-results',
});

const crawler = new CheerioCrawler({
    async requestHandler({ request, $, log }) {
        await crawler.pushData({
            url: request.loadedUrl,
            data: $('body').text(),
        });
    },
}, config);

await crawler.run(['https://example.com']);

const dataset = await Dataset.open(undefined, { config });
await dataset.exportToJSON('results');

// Creates: ./my-scraping-results/datasets/default/results.json

Best Practices for Data Export

1. Structure Data Consistently

Ensure all records have the same structure for clean exports:

const crawler = new CheerioCrawler({
    async requestHandler({ request, $, log }) {
        // Always include all fields, even if empty
        const data = {
            url: request.loadedUrl || '',
            title: $('title').text() || 'N/A',
            description: $('meta[name="description"]').attr('content') || 'N/A',
            imageCount: $('img').length,
            scrapedAt: new Date().toISOString(),
        };

        await crawler.pushData(data);
    },
});

2. Choose the Right Format

Select export formats based on your use case:

  • JSON: Use for APIs, databases, or programmatic processing
  • CSV: Use for simple tabular data and spreadsheet analysis
  • XLSX: Use for complex reports with formatting requirements
  • HTML: Use for quick visual inspection or embedding in web pages
  • XML: Use for enterprise integrations or systems requiring XML
  • RSS: Use for content syndication or feed creation

3. Handle Large Datasets

For large datasets, consider exporting in chunks or using streaming:

import { Dataset } from 'crawlee';

const dataset = await Dataset.open();
const totalCount = await dataset.getInfo().then(info => info.itemCount);

if (totalCount > 100000) {
    console.log('Large dataset detected, consider processing in chunks');

    // Process in batches
    for (let offset = 0; offset < totalCount; offset += 10000) {
        const { items } = await dataset.getData({ offset, limit: 10000 });
        // Process this batch
        console.log(`Processed ${offset} to ${offset + items.length} items`);
    }
}

4. Validate Before Export

Validate your data structure before exporting:

import { CheerioCrawler, Dataset } from 'crawlee';

const crawler = new CheerioCrawler({
    async requestHandler({ request, $, log }) {
        const title = $('title').text().trim();
        const price = $('.price').text().trim();

        // Validate data before storing
        if (title && price) {
            await crawler.pushData({
                url: request.loadedUrl,
                title,
                price,
                valid: true,
            });
        } else {
            log.warning(`Invalid data on ${request.loadedUrl}`);
            await crawler.pushData({
                url: request.loadedUrl,
                title: title || 'MISSING',
                price: price || 'MISSING',
                valid: false,
            });
        }
    },
});

await crawler.run(['https://example-shop.com']);

// Filter valid records before export
const dataset = await Dataset.open();
const { items } = await dataset.getData();
const validItems = items.filter(item => item.valid);

console.log(`Valid records: ${validItems.length}/${items.length}`);

Integration with Browser Automation

Export functionality works seamlessly with all Crawlee crawler types, including browser-based crawlers:

import { PlaywrightCrawler, Dataset } from 'crawlee';

const crawler = new PlaywrightCrawler({
    async requestHandler({ request, page, log }) {
        // Wait for dynamic content to load
        await page.waitForSelector('.dynamic-content');

        // Extract data after JavaScript execution
        const data = await page.evaluate(() => ({
            title: document.title,
            dynamicData: document.querySelector('.dynamic-content')?.textContent,
            timestamp: Date.now(),
        }));

        await crawler.pushData(data);
    },
});

await crawler.run(['https://example.com']);

// Export works the same regardless of crawler type
const dataset = await Dataset.open();
await dataset.exportToJSON('scraped-data');
await dataset.exportToCSV('scraped-data');

Conclusion

Crawlee's comprehensive export functionality makes it easy to store and export scraped data in multiple formats without additional libraries or complex conversion code. Whether you need JSON for API integration, CSV for data analysis, Excel for reporting, or HTML for quick previews, Crawlee handles the conversion automatically.

The unified Dataset API works consistently across all crawler types, from lightweight HTTP crawlers to full browser automation with PuppeteerCrawler, making it simple to build flexible web scraping solutions that can output data in the format that best suits your downstream processing needs.

By following best practices like maintaining consistent data structures, validating before export, and choosing appropriate formats for your use case, you can create robust web scraping pipelines that seamlessly integrate with your data analysis and business intelligence workflows.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon