Is Firecrawl Open Source and Can I Self-Host It?
Yes, Firecrawl is an open source web scraping and crawling tool available under the AGPL-3.0 license. The complete source code is publicly available on GitHub, and you can self-host your own instance for free. This makes Firecrawl an attractive option for developers who want full control over their web scraping infrastructure, need to comply with data privacy regulations, or want to customize the scraping behavior to meet specific requirements.
Understanding Firecrawl's Open Source Model
Firecrawl uses a dual licensing model that allows both open source self-hosting and commercial cloud usage:
Open Source License (AGPL-3.0)
The AGPL-3.0 (GNU Affero General Public License) is a copyleft license that requires you to share your modifications if you deploy Firecrawl as a network service. Key points about this license:
- Free to use: You can download, modify, and deploy Firecrawl at no cost
- Source code access: Full access to the codebase for customization
- Copyleft requirement: If you modify Firecrawl and offer it as a service, you must share your changes
- Community contributions: You can contribute improvements back to the project
Commercial Cloud Service
Firecrawl also offers a managed cloud service with additional features, support, and no licensing obligations. This is ideal for teams that prefer managed infrastructure over self-hosting.
Self-Hosting Firecrawl: Getting Started
Self-hosting Firecrawl gives you complete control over your web scraping infrastructure. Here's how to get started with different deployment methods.
Prerequisites
Before self-hosting Firecrawl, ensure you have:
- Docker and Docker Compose installed
- Node.js 18+ (for local development)
- PostgreSQL database
- Redis instance
- Sufficient server resources (minimum 2GB RAM recommended)
Quick Start with Docker
The easiest way to self-host Firecrawl is using Docker Compose. This method handles all dependencies automatically:
# Clone the Firecrawl repository
git clone https://github.com/mendableai/firecrawl.git
cd firecrawl
# Copy the example environment file
cp .env.example .env
# Edit the .env file with your configuration
nano .env
# Start all services with Docker Compose
docker-compose up -d
The Docker setup includes: - Firecrawl API server - PostgreSQL database - Redis for queue management - Playwright for browser automation
After starting the services, Firecrawl will be available at http://localhost:3002
by default.
Environment Configuration
Configure your .env
file with essential settings:
# API Configuration
PORT=3002
HOST=0.0.0.0
# Database
DATABASE_URL=postgresql://user:password@postgres:5432/firecrawl
# Redis
REDIS_URL=redis://redis:6379
# API Keys (generate secure random strings)
API_KEY=your-secure-api-key-here
# Rate Limiting
RATE_LIMIT_ENABLED=true
RATE_LIMIT_MAX_REQUESTS=100
# Scraping Configuration
MAX_CONCURRENT_SCRAPERS=5
SCRAPE_TIMEOUT=30000
Using Your Self-Hosted Firecrawl Instance
Once deployed, you can interact with your self-hosted Firecrawl instance using the API. Here are examples in both Python and JavaScript.
Python Example
import requests
# Configure your self-hosted instance
FIRECRAWL_URL = "http://localhost:3002"
API_KEY = "your-secure-api-key-here"
headers = {
"Authorization": f"Bearer {API_KEY}",
"Content-Type": "application/json"
}
# Scrape a single page
def scrape_page(url):
response = requests.post(
f"{FIRECRAWL_URL}/v0/scrape",
headers=headers,
json={
"url": url,
"formats": ["markdown", "html"],
"onlyMainContent": True
}
)
return response.json()
# Crawl multiple pages
def crawl_website(url, max_pages=10):
response = requests.post(
f"{FIRECRAWL_URL}/v0/crawl",
headers=headers,
json={
"url": url,
"limit": max_pages,
"scrapeOptions": {
"formats": ["markdown"]
}
}
)
return response.json()
# Example usage
result = scrape_page("https://example.com")
print(result["data"]["markdown"])
JavaScript/Node.js Example
const axios = require('axios');
const FIRECRAWL_URL = 'http://localhost:3002';
const API_KEY = 'your-secure-api-key-here';
const client = axios.create({
baseURL: FIRECRAWL_URL,
headers: {
'Authorization': `Bearer ${API_KEY}`,
'Content-Type': 'application/json'
}
});
// Scrape a single page
async function scrapePage(url) {
try {
const response = await client.post('/v0/scrape', {
url: url,
formats: ['markdown', 'html'],
onlyMainContent: true
});
return response.data;
} catch (error) {
console.error('Scraping error:', error.message);
throw error;
}
}
// Crawl multiple pages
async function crawlWebsite(url, maxPages = 10) {
try {
const response = await client.post('/v0/crawl', {
url: url,
limit: maxPages,
scrapeOptions: {
formats: ['markdown']
}
});
return response.data;
} catch (error) {
console.error('Crawling error:', error.message);
throw error;
}
}
// Example usage
(async () => {
const result = await scrapePage('https://example.com');
console.log(result.data.markdown);
})();
Advanced Self-Hosting Configurations
Scaling Your Self-Hosted Instance
For production deployments, you'll want to scale Firecrawl horizontally. Similar to how you can use Puppeteer with Docker for browser automation, you can run multiple Firecrawl workers:
version: '3.8'
services:
api:
image: firecrawl:latest
environment:
- WORKER_MODE=api
ports:
- "3002:3002"
depends_on:
- postgres
- redis
deploy:
replicas: 2
worker:
image: firecrawl:latest
environment:
- WORKER_MODE=worker
- MAX_CONCURRENT_SCRAPERS=10
depends_on:
- postgres
- redis
deploy:
replicas: 5
postgres:
image: postgres:15
environment:
POSTGRES_DB: firecrawl
POSTGRES_USER: firecrawl
POSTGRES_PASSWORD: secure_password
volumes:
- postgres_data:/var/lib/postgresql/data
redis:
image: redis:7-alpine
volumes:
- redis_data:/data
volumes:
postgres_data:
redis_data:
Monitoring and Logging
Add monitoring to your self-hosted instance:
# View logs from all services
docker-compose logs -f
# View logs from specific service
docker-compose logs -f api
# Check worker status
docker-compose exec api npm run worker:status
Custom Scraping Configuration
Modify the scraping behavior by adjusting environment variables:
# Browser configuration
BROWSER_HEADLESS=true
BROWSER_ARGS=--no-sandbox,--disable-setuid-sandbox
# Timeout settings
SCRAPE_TIMEOUT=60000
NAVIGATION_TIMEOUT=30000
# Concurrency limits
MAX_CONCURRENT_SCRAPERS=10
MAX_PAGES_PER_CRAWL=100
# Proxy configuration
PROXY_URL=http://proxy.example.com:8080
PROXY_USERNAME=user
PROXY_PASSWORD=pass
Benefits of Self-Hosting Firecrawl
Data Privacy and Compliance
Self-hosting ensures your scraped data never leaves your infrastructure, which is critical for: - GDPR compliance - Healthcare data (HIPAA) - Financial services regulations - Internal company data policies
Cost Control
For high-volume scraping operations, self-hosting can be more cost-effective than cloud API usage: - No per-request costs - Pay only for infrastructure - Predictable monthly expenses - No rate limit constraints
Customization
With access to the source code, you can: - Add custom extractors for specific websites - Implement specialized authentication methods - Integrate with internal tools and databases - Optimize performance for your use cases
No Rate Limits
Self-hosted instances aren't subject to API rate limits, allowing you to: - Scrape at your own pace - Handle burst traffic - Process large crawl jobs - Scale based on your infrastructure
Challenges and Considerations
Maintenance Responsibility
Self-hosting means you're responsible for: - Security updates and patches - Database backups and recovery - Infrastructure monitoring - Performance optimization - Dependency updates
Infrastructure Costs
Consider the costs of: - Server hosting (cloud or on-premise) - Database storage for crawl results - Network bandwidth for scraping - Backup storage - Monitoring tools
Technical Expertise Required
Successful self-hosting requires knowledge of: - Docker and containerization - PostgreSQL database administration - Redis configuration - Load balancing and scaling - Browser automation challenges
When handling browser sessions or dealing with complex JavaScript-heavy sites, you'll need to understand the underlying Playwright automation that Firecrawl uses.
Production Deployment Best Practices
Use a Reverse Proxy
Deploy Firecrawl behind Nginx or Traefik for: - SSL/TLS termination - Load balancing - Rate limiting - Caching
Example Nginx configuration:
upstream firecrawl {
server localhost:3002;
server localhost:3003;
}
server {
listen 443 ssl http2;
server_name firecrawl.example.com;
ssl_certificate /etc/ssl/certs/firecrawl.crt;
ssl_certificate_key /etc/ssl/private/firecrawl.key;
location / {
proxy_pass http://firecrawl;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_read_timeout 300s;
}
}
Implement Health Checks
Monitor your instance health:
# Health check endpoint
curl http://localhost:3002/health
# Check worker queue status
curl -H "Authorization: Bearer $API_KEY" \
http://localhost:3002/v0/admin/queue-status
Database Backup Strategy
Implement regular backups:
# Automated PostgreSQL backup
pg_dump -U firecrawl firecrawl > backup_$(date +%Y%m%d).sql
# Restore from backup
psql -U firecrawl firecrawl < backup_20240101.sql
When to Choose Self-Hosting vs. Cloud Service
Choose Self-Hosting If:
- You need complete data privacy and control
- You have high-volume scraping requirements
- You have DevOps resources for maintenance
- You need custom modifications
- You're subject to strict compliance requirements
Choose Cloud Service If:
- You want zero infrastructure management
- You need quick setup without DevOps expertise
- You have variable or unpredictable scraping volumes
- You want enterprise support and SLAs
- You prefer predictable per-use pricing
Community and Support
As an open source project, Firecrawl has an active community:
- GitHub Repository: Report issues and contribute code
- Discord Community: Get help from other users
- Documentation: Comprehensive guides and API references
- Examples Repository: Sample implementations and use cases
Conclusion
Firecrawl's open source nature and self-hosting capabilities make it a powerful option for developers who need control over their web scraping infrastructure. While self-hosting requires technical expertise and ongoing maintenance, it offers significant benefits in terms of data privacy, cost control, and customization. Whether you choose to self-host or use the managed cloud service depends on your specific requirements, resources, and technical capabilities.
For production deployments, ensure you follow best practices for security, monitoring, and scaling. Start with the Docker Compose setup for development, then graduate to a more robust production configuration as your needs grow. The flexibility of open source combined with the option of commercial support makes Firecrawl suitable for projects of all sizes.