Table of contents

Is Firecrawl Open Source and Can I Self-Host It?

Yes, Firecrawl is an open source web scraping and crawling tool available under the AGPL-3.0 license. The complete source code is publicly available on GitHub, and you can self-host your own instance for free. This makes Firecrawl an attractive option for developers who want full control over their web scraping infrastructure, need to comply with data privacy regulations, or want to customize the scraping behavior to meet specific requirements.

Understanding Firecrawl's Open Source Model

Firecrawl uses a dual licensing model that allows both open source self-hosting and commercial cloud usage:

Open Source License (AGPL-3.0)

The AGPL-3.0 (GNU Affero General Public License) is a copyleft license that requires you to share your modifications if you deploy Firecrawl as a network service. Key points about this license:

  • Free to use: You can download, modify, and deploy Firecrawl at no cost
  • Source code access: Full access to the codebase for customization
  • Copyleft requirement: If you modify Firecrawl and offer it as a service, you must share your changes
  • Community contributions: You can contribute improvements back to the project

Commercial Cloud Service

Firecrawl also offers a managed cloud service with additional features, support, and no licensing obligations. This is ideal for teams that prefer managed infrastructure over self-hosting.

Self-Hosting Firecrawl: Getting Started

Self-hosting Firecrawl gives you complete control over your web scraping infrastructure. Here's how to get started with different deployment methods.

Prerequisites

Before self-hosting Firecrawl, ensure you have:

  • Docker and Docker Compose installed
  • Node.js 18+ (for local development)
  • PostgreSQL database
  • Redis instance
  • Sufficient server resources (minimum 2GB RAM recommended)

Quick Start with Docker

The easiest way to self-host Firecrawl is using Docker Compose. This method handles all dependencies automatically:

# Clone the Firecrawl repository
git clone https://github.com/mendableai/firecrawl.git
cd firecrawl

# Copy the example environment file
cp .env.example .env

# Edit the .env file with your configuration
nano .env

# Start all services with Docker Compose
docker-compose up -d

The Docker setup includes: - Firecrawl API server - PostgreSQL database - Redis for queue management - Playwright for browser automation

After starting the services, Firecrawl will be available at http://localhost:3002 by default.

Environment Configuration

Configure your .env file with essential settings:

# API Configuration
PORT=3002
HOST=0.0.0.0

# Database
DATABASE_URL=postgresql://user:password@postgres:5432/firecrawl

# Redis
REDIS_URL=redis://redis:6379

# API Keys (generate secure random strings)
API_KEY=your-secure-api-key-here

# Rate Limiting
RATE_LIMIT_ENABLED=true
RATE_LIMIT_MAX_REQUESTS=100

# Scraping Configuration
MAX_CONCURRENT_SCRAPERS=5
SCRAPE_TIMEOUT=30000

Using Your Self-Hosted Firecrawl Instance

Once deployed, you can interact with your self-hosted Firecrawl instance using the API. Here are examples in both Python and JavaScript.

Python Example

import requests

# Configure your self-hosted instance
FIRECRAWL_URL = "http://localhost:3002"
API_KEY = "your-secure-api-key-here"

headers = {
    "Authorization": f"Bearer {API_KEY}",
    "Content-Type": "application/json"
}

# Scrape a single page
def scrape_page(url):
    response = requests.post(
        f"{FIRECRAWL_URL}/v0/scrape",
        headers=headers,
        json={
            "url": url,
            "formats": ["markdown", "html"],
            "onlyMainContent": True
        }
    )
    return response.json()

# Crawl multiple pages
def crawl_website(url, max_pages=10):
    response = requests.post(
        f"{FIRECRAWL_URL}/v0/crawl",
        headers=headers,
        json={
            "url": url,
            "limit": max_pages,
            "scrapeOptions": {
                "formats": ["markdown"]
            }
        }
    )
    return response.json()

# Example usage
result = scrape_page("https://example.com")
print(result["data"]["markdown"])

JavaScript/Node.js Example

const axios = require('axios');

const FIRECRAWL_URL = 'http://localhost:3002';
const API_KEY = 'your-secure-api-key-here';

const client = axios.create({
  baseURL: FIRECRAWL_URL,
  headers: {
    'Authorization': `Bearer ${API_KEY}`,
    'Content-Type': 'application/json'
  }
});

// Scrape a single page
async function scrapePage(url) {
  try {
    const response = await client.post('/v0/scrape', {
      url: url,
      formats: ['markdown', 'html'],
      onlyMainContent: true
    });
    return response.data;
  } catch (error) {
    console.error('Scraping error:', error.message);
    throw error;
  }
}

// Crawl multiple pages
async function crawlWebsite(url, maxPages = 10) {
  try {
    const response = await client.post('/v0/crawl', {
      url: url,
      limit: maxPages,
      scrapeOptions: {
        formats: ['markdown']
      }
    });
    return response.data;
  } catch (error) {
    console.error('Crawling error:', error.message);
    throw error;
  }
}

// Example usage
(async () => {
  const result = await scrapePage('https://example.com');
  console.log(result.data.markdown);
})();

Advanced Self-Hosting Configurations

Scaling Your Self-Hosted Instance

For production deployments, you'll want to scale Firecrawl horizontally. Similar to how you can use Puppeteer with Docker for browser automation, you can run multiple Firecrawl workers:

version: '3.8'

services:
  api:
    image: firecrawl:latest
    environment:
      - WORKER_MODE=api
    ports:
      - "3002:3002"
    depends_on:
      - postgres
      - redis
    deploy:
      replicas: 2

  worker:
    image: firecrawl:latest
    environment:
      - WORKER_MODE=worker
      - MAX_CONCURRENT_SCRAPERS=10
    depends_on:
      - postgres
      - redis
    deploy:
      replicas: 5

  postgres:
    image: postgres:15
    environment:
      POSTGRES_DB: firecrawl
      POSTGRES_USER: firecrawl
      POSTGRES_PASSWORD: secure_password
    volumes:
      - postgres_data:/var/lib/postgresql/data

  redis:
    image: redis:7-alpine
    volumes:
      - redis_data:/data

volumes:
  postgres_data:
  redis_data:

Monitoring and Logging

Add monitoring to your self-hosted instance:

# View logs from all services
docker-compose logs -f

# View logs from specific service
docker-compose logs -f api

# Check worker status
docker-compose exec api npm run worker:status

Custom Scraping Configuration

Modify the scraping behavior by adjusting environment variables:

# Browser configuration
BROWSER_HEADLESS=true
BROWSER_ARGS=--no-sandbox,--disable-setuid-sandbox

# Timeout settings
SCRAPE_TIMEOUT=60000
NAVIGATION_TIMEOUT=30000

# Concurrency limits
MAX_CONCURRENT_SCRAPERS=10
MAX_PAGES_PER_CRAWL=100

# Proxy configuration
PROXY_URL=http://proxy.example.com:8080
PROXY_USERNAME=user
PROXY_PASSWORD=pass

Benefits of Self-Hosting Firecrawl

Data Privacy and Compliance

Self-hosting ensures your scraped data never leaves your infrastructure, which is critical for: - GDPR compliance - Healthcare data (HIPAA) - Financial services regulations - Internal company data policies

Cost Control

For high-volume scraping operations, self-hosting can be more cost-effective than cloud API usage: - No per-request costs - Pay only for infrastructure - Predictable monthly expenses - No rate limit constraints

Customization

With access to the source code, you can: - Add custom extractors for specific websites - Implement specialized authentication methods - Integrate with internal tools and databases - Optimize performance for your use cases

No Rate Limits

Self-hosted instances aren't subject to API rate limits, allowing you to: - Scrape at your own pace - Handle burst traffic - Process large crawl jobs - Scale based on your infrastructure

Challenges and Considerations

Maintenance Responsibility

Self-hosting means you're responsible for: - Security updates and patches - Database backups and recovery - Infrastructure monitoring - Performance optimization - Dependency updates

Infrastructure Costs

Consider the costs of: - Server hosting (cloud or on-premise) - Database storage for crawl results - Network bandwidth for scraping - Backup storage - Monitoring tools

Technical Expertise Required

Successful self-hosting requires knowledge of: - Docker and containerization - PostgreSQL database administration - Redis configuration - Load balancing and scaling - Browser automation challenges

When handling browser sessions or dealing with complex JavaScript-heavy sites, you'll need to understand the underlying Playwright automation that Firecrawl uses.

Production Deployment Best Practices

Use a Reverse Proxy

Deploy Firecrawl behind Nginx or Traefik for: - SSL/TLS termination - Load balancing - Rate limiting - Caching

Example Nginx configuration:

upstream firecrawl {
    server localhost:3002;
    server localhost:3003;
}

server {
    listen 443 ssl http2;
    server_name firecrawl.example.com;

    ssl_certificate /etc/ssl/certs/firecrawl.crt;
    ssl_certificate_key /etc/ssl/private/firecrawl.key;

    location / {
        proxy_pass http://firecrawl;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_read_timeout 300s;
    }
}

Implement Health Checks

Monitor your instance health:

# Health check endpoint
curl http://localhost:3002/health

# Check worker queue status
curl -H "Authorization: Bearer $API_KEY" \
  http://localhost:3002/v0/admin/queue-status

Database Backup Strategy

Implement regular backups:

# Automated PostgreSQL backup
pg_dump -U firecrawl firecrawl > backup_$(date +%Y%m%d).sql

# Restore from backup
psql -U firecrawl firecrawl < backup_20240101.sql

When to Choose Self-Hosting vs. Cloud Service

Choose Self-Hosting If:

  • You need complete data privacy and control
  • You have high-volume scraping requirements
  • You have DevOps resources for maintenance
  • You need custom modifications
  • You're subject to strict compliance requirements

Choose Cloud Service If:

  • You want zero infrastructure management
  • You need quick setup without DevOps expertise
  • You have variable or unpredictable scraping volumes
  • You want enterprise support and SLAs
  • You prefer predictable per-use pricing

Community and Support

As an open source project, Firecrawl has an active community:

  • GitHub Repository: Report issues and contribute code
  • Discord Community: Get help from other users
  • Documentation: Comprehensive guides and API references
  • Examples Repository: Sample implementations and use cases

Conclusion

Firecrawl's open source nature and self-hosting capabilities make it a powerful option for developers who need control over their web scraping infrastructure. While self-hosting requires technical expertise and ongoing maintenance, it offers significant benefits in terms of data privacy, cost control, and customization. Whether you choose to self-host or use the managed cloud service depends on your specific requirements, resources, and technical capabilities.

For production deployments, ensure you follow best practices for security, monitoring, and scaling. Start with the Docker Compose setup for development, then graduate to a more robust production configuration as your needs grow. The flexibility of open source combined with the option of commercial support makes Firecrawl suitable for projects of all sizes.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon