Can I Export Crawlee Data to CSV or Excel?

Yes, Crawlee provides built-in support for exporting scraped data to multiple formats, including CSV, Excel, JSON, and more. The framework makes it straightforward to save your scraped data in structured formats that are easy to analyze, share, and integrate with other tools.

In this guide, we'll explore various methods to export your Crawlee data to CSV and Excel formats, covering both JavaScript and Python implementations.

Using Crawlee's Built-in Dataset Export

Crawlee's Dataset class provides a exportToCSV() method that makes exporting data incredibly simple. Here's how to use it:

JavaScript/TypeScript Example

import { CheerioCrawler, Dataset } from 'crawlee';

const crawler = new CheerioCrawler({
    async requestHandler({ request, $, enqueueLinks }) {
        const title = $('title').text();
        const heading = $('h1').text();
        const description = $('meta[name="description"]').attr('content');

        // Push data to the default dataset
        await Dataset.pushData({
            url: request.url,
            title,
            heading,
            description,
            scrapedAt: new Date().toISOString(),
        });
    },
});

await crawler.run(['https://example.com']);

// Export to CSV
const dataset = await Dataset.open();
await dataset.exportToCSV('output');

This will create a file named output.csv in your storage directory with all the scraped data.

Python Example

from crawlee.playwright_crawler import PlaywrightCrawler, PlaywrightCrawlingContext
from crawlee.storages import Dataset

async def main():
    crawler = PlaywrightCrawler()

    @crawler.router.default_handler
    async def request_handler(context: PlaywrightCrawlingContext) -> None:
        page = context.page

        # Extract data
        title = await page.title()
        heading = await page.locator('h1').text_content()

        # Push to dataset
        await context.push_data({
            'url': context.request.url,
            'title': title,
            'heading': heading,
            'scraped_at': datetime.now().isoformat(),
        })

    await crawler.run(['https://example.com'])

    # Export to CSV
    dataset = await Dataset.open()
    await dataset.export_to_csv('output')

if __name__ == '__main__':
    asyncio.run(main())

Exporting to Excel Format

While Crawlee doesn't have built-in Excel export functionality, you can easily convert your scraped data to Excel format using popular libraries.

JavaScript Excel Export

import { CheerioCrawler, Dataset } from 'crawlee';
import * as XLSX from 'xlsx';

const crawler = new CheerioCrawler({
    async requestHandler({ request, $, enqueueLinks }) {
        await Dataset.pushData({
            url: request.url,
            title: $('title').text(),
            price: $('.price').text(),
            availability: $('.stock-status').text(),
        });
    },
});

await crawler.run(['https://example.com/products']);

// Get all data from dataset
const dataset = await Dataset.open();
const data = await dataset.getData();

// Convert to Excel
const worksheet = XLSX.utils.json_to_sheet(data.items);
const workbook = XLSX.utils.book_new();
XLSX.utils.book_append_sheet(workbook, worksheet, 'Scraped Data');

// Write to file
XLSX.writeFile(workbook, 'output.xlsx');

First, install the required package:

npm install xlsx

Python Excel Export with Pandas

from crawlee.beautifulsoup_crawler import BeautifulSoupCrawler, BeautifulSoupCrawlingContext
from crawlee.storages import Dataset
import pandas as pd

async def main():
    crawler = BeautifulSoupCrawler()

    @crawler.router.default_handler
    async def request_handler(context: BeautifulSoupCrawlingContext) -> None:
        soup = context.soup

        await context.push_data({
            'url': context.request.url,
            'title': soup.find('title').text if soup.find('title') else '',
            'price': soup.select_one('.price').text if soup.select_one('.price') else '',
        })

    await crawler.run(['https://example.com'])

    # Export to Excel
    dataset = await Dataset.open()
    data = await dataset.get_data()

    df = pd.DataFrame(data.items)
    df.to_excel('output.xlsx', index=False, engine='openpyxl')

if __name__ == '__main__':
    asyncio.run(main())

Install the required packages:

pip install pandas openpyxl

Advanced CSV Export Options

You can customize your CSV export with additional options:

Custom CSV Formatting

import { CheerioCrawler, Dataset } from 'crawlee';
import { createObjectCsvWriter } from 'csv-writer';

const crawler = new CheerioCrawler({
    async requestHandler({ request, $, enqueueLinks }) {
        await Dataset.pushData({
            url: request.url,
            title: $('title').text(),
            timestamp: Date.now(),
        });
    },
});

await crawler.run(['https://example.com']);

// Custom CSV writer with specific headers and formatting
const dataset = await Dataset.open();
const data = await dataset.getData();

const csvWriter = createObjectCsvWriter({
    path: 'output.csv',
    header: [
        { id: 'url', title: 'URL' },
        { id: 'title', title: 'Page Title' },
        { id: 'timestamp', title: 'Scraped At' },
    ],
    append: false,
});

await csvWriter.writeRecords(data.items);

Handling Large Datasets

When working with large amounts of scraped data, it's important to export data in batches to avoid memory issues:

import { CheerioCrawler, Dataset } from 'crawlee';
import { createWriteStream } from 'fs';
import { stringify } from 'csv-stringify';

const crawler = new CheerioCrawler({
    maxRequestsPerCrawl: 10000,
    async requestHandler({ request, $, enqueueLinks }) {
        await Dataset.pushData({
            url: request.url,
            title: $('title').text(),
            content: $('article').text().substring(0, 500),
        });
    },
});

await crawler.run(['https://example.com']);

// Stream large dataset to CSV
const dataset = await Dataset.open();
const writeStream = createWriteStream('output.csv');
const stringifier = stringify({ header: true });

stringifier.pipe(writeStream);

// Process in batches
let offset = 0;
const limit = 1000;

while (true) {
    const batch = await dataset.getData({ offset, limit });

    if (batch.items.length === 0) break;

    batch.items.forEach(item => stringifier.write(item));
    offset += limit;
}

stringifier.end();

Export Formats Comparison

Crawlee supports multiple export formats out of the box:

| Format | Method | Best For | |--------|--------|----------| | CSV | exportToCSV() | Tabular data, spreadsheet analysis, database imports | | JSON | exportToJSON() | Hierarchical data, API integration, web applications | | JSONL | exportToJSONL() | Large datasets, streaming processing, log analysis | | RSS | exportToRSS() | Content feeds, syndication | | XML | exportToXML() | Legacy systems, specific enterprise requirements |

Exporting Named Datasets

If you're using multiple datasets in your crawler, you can export specific ones:

import { CheerioCrawler, Dataset } from 'crawlee';

const crawler = new CheerioCrawler({
    async requestHandler({ request, $, enqueueLinks }) {
        // Push to different datasets based on content type
        if (request.url.includes('/products/')) {
            await Dataset.pushData({
                type: 'product',
                name: $('.product-name').text(),
                price: $('.price').text(),
            }, { datasetName: 'products' });
        } else {
            await Dataset.pushData({
                type: 'page',
                title: $('title').text(),
            }, { datasetName: 'pages' });
        }
    },
});

await crawler.run(['https://example.com']);

// Export each dataset separately
const productsDataset = await Dataset.open('products');
await productsDataset.exportToCSV('products');

const pagesDataset = await Dataset.open('pages');
await pagesDataset.exportToCSV('pages');

Best Practices for Data Export

1. Data Validation Before Export

Always validate your data before exporting to ensure quality:

import { CheerioCrawler, Dataset } from 'crawlee';

const crawler = new CheerioCrawler({
    async requestHandler({ request, $, enqueueLinks }) {
        const title = $('title').text().trim();
        const price = $('.price').text().trim();

        // Validate data before pushing
        if (title && price) {
            await Dataset.pushData({
                url: request.url,
                title,
                price: parseFloat(price.replace(/[^0-9.]/g, '')),
                isValid: true,
            });
        } else {
            // Log invalid data
            console.log(`Invalid data at ${request.url}`);
        }
    },
});

2. Handle Special Characters

Ensure proper encoding for special characters in CSV files:

import { Dataset } from 'crawlee';

const dataset = await Dataset.open();
await dataset.exportToCSV('output', {
    // Ensure UTF-8 encoding
    bom: true, // Add BOM for Excel compatibility
});

3. Add Timestamps

Include timestamps to track when data was scraped:

await Dataset.pushData({
    url: request.url,
    title: $('title').text(),
    scrapedAt: new Date().toISOString(),
    unixTimestamp: Math.floor(Date.now() / 1000),
});

Integrating with Data Analysis Tools

Once exported, your CSV or Excel files can be easily integrated with various data analysis tools:

Excel/Google Sheets: Open CSV/Excel files directly for manual analysis
Python Pandas: Load CSV files for advanced data analysis
SQL Databases: Import CSV files into databases for querying
Power BI/Tableau: Connect to exported files for visualization

Loading Crawlee CSV in Pandas

import pandas as pd

# Load exported CSV
df = pd.read_csv('output.csv')

# Analyze data
print(df.describe())
print(df.groupby('category').count())

# Clean and transform
df['price'] = df['price'].str.replace('$', '').astype(float)
df['date'] = pd.to_datetime(df['scraped_at'])

# Export to Excel with formatting
with pd.ExcelWriter('analyzed_output.xlsx', engine='openpyxl') as writer:
    df.to_excel(writer, sheet_name='Data', index=False)

Troubleshooting Common Export Issues

CSV File Not Created

Ensure the crawler has completed before exporting:

await crawler.run(['https://example.com']);
// Wait for all requests to complete
await crawler.teardown();

// Now export
const dataset = await Dataset.open();
await dataset.exportToCSV('output');

Empty Export Files

Check if data is being pushed to the dataset correctly:

const dataset = await Dataset.open();
const info = await dataset.getInfo();
console.log(`Dataset contains ${info.itemCount} items`);

if (info.itemCount > 0) {
    await dataset.exportToCSV('output');
} else {
    console.log('No data to export');
}

Memory Issues with Large Exports

For very large datasets, consider streaming the export or using batch processing techniques to handle data more efficiently.

Conclusion

Crawlee makes it simple to export scraped data to CSV and Excel formats, whether you're working with JavaScript or Python. The built-in exportToCSV() method handles most use cases, while third-party libraries like xlsx and pandas provide additional flexibility for Excel exports and advanced data manipulation.

By following the best practices outlined in this guide—including data validation, proper encoding, and batch processing for large datasets—you can ensure your exported data is clean, properly formatted, and ready for analysis.

Whether you're building a web scraping pipeline with browser automation or processing static HTML content, Crawlee's flexible export options make it easy to get your data into the format you need.

Table of contents