How do I crawl an entire website with Firecrawl?
Firecrawl provides a powerful /crawl
endpoint that allows you to crawl entire websites efficiently. Unlike single-page scraping, the crawl endpoint automatically discovers and follows links, respects robots.txt, and returns structured data from all pages on a domain. This guide covers everything you need to know about crawling entire websites with Firecrawl.
Understanding Firecrawl's Crawl Endpoint
The Firecrawl /crawl
endpoint is designed specifically for crawling multiple pages from a website. When you submit a crawl request, Firecrawl:
- Starts from the provided URL
- Automatically discovers links on the page
- Follows internal links within the same domain
- Converts HTML to clean markdown format
- Extracts metadata from each page
- Returns structured data for all discovered pages
This makes it ideal for tasks like documentation scraping, content migration, SEO analysis, and building knowledge bases from entire websites.
Basic Website Crawl with Python
Here's how to crawl an entire website using Firecrawl's Python SDK:
from firecrawl import FirecrawlApp
# Initialize the Firecrawl client
app = FirecrawlApp(api_key='your_api_key_here')
# Start crawling a website
crawl_result = app.crawl_url('https://example.com', params={
'crawlerOptions': {
'limit': 100, # Maximum number of pages to crawl
}
})
# Process the results
if crawl_result['success']:
for page in crawl_result['data']:
print(f"URL: {page['url']}")
print(f"Title: {page['metadata']['title']}")
print(f"Content: {page['markdown'][:200]}...")
print("---")
This basic example will crawl up to 100 pages starting from the specified URL, following internal links automatically.
Crawling Websites with JavaScript/Node.js
For JavaScript developers, Firecrawl offers a Node.js SDK with similar functionality:
const FirecrawlApp = require('@mendable/firecrawl-js').default;
async function crawlWebsite() {
const app = new FirecrawlApp({ apiKey: 'your_api_key_here' });
try {
const crawlResult = await app.crawlUrl('https://example.com', {
crawlerOptions: {
limit: 100,
excludePaths: ['admin/*', 'api/*'],
includePaths: ['docs/*', 'blog/*']
}
});
console.log(`Crawled ${crawlResult.data.length} pages`);
crawlResult.data.forEach(page => {
console.log(`URL: ${page.url}`);
console.log(`Title: ${page.metadata.title}`);
console.log('---');
});
} catch (error) {
console.error('Crawl failed:', error);
}
}
crawlWebsite();
Advanced Crawl Configuration Options
Firecrawl provides extensive configuration options to control crawl behavior:
Limiting Crawl Scope
crawl_params = {
'crawlerOptions': {
'limit': 50, # Maximum pages to crawl
'maxDepth': 3, # Maximum link depth from starting URL
'excludePaths': [
'admin/*',
'login',
'*/comments/*'
],
'includePaths': [
'blog/*',
'docs/*'
]
}
}
result = app.crawl_url('https://example.com', params=crawl_params)
Controlling Crawl Speed and Concurrency
const crawlOptions = {
crawlerOptions: {
limit: 200,
maxConcurrency: 5, // Number of concurrent requests
delay: 1000 // Delay between requests in milliseconds
}
};
const result = await app.crawlUrl('https://example.com', crawlOptions);
Asynchronous Crawling for Large Websites
For crawling large websites, Firecrawl supports asynchronous crawling where you start a crawl job and poll for results:
from firecrawl import FirecrawlApp
import time
app = FirecrawlApp(api_key='your_api_key_here')
# Start an async crawl
crawl_job = app.async_crawl_url('https://example.com', params={
'crawlerOptions': {
'limit': 500
}
})
job_id = crawl_job['jobId']
print(f"Crawl job started: {job_id}")
# Poll for completion
while True:
status = app.check_crawl_status(job_id)
if status['status'] == 'completed':
print(f"Crawl completed! Found {len(status['data'])} pages")
# Process results
for page in status['data']:
print(f"Processing: {page['url']}")
break
elif status['status'] == 'failed':
print("Crawl failed:", status.get('error'))
break
else:
print(f"Status: {status['status']}, Progress: {status.get('progress', 0)}%")
time.sleep(10) # Wait 10 seconds before checking again
Similar to how to handle browser sessions in Puppeteer, managing long-running crawl sessions requires careful state management and error handling.
JavaScript Async Crawl Example
const FirecrawlApp = require('@mendable/firecrawl-js').default;
async function asyncCrawl() {
const app = new FirecrawlApp({ apiKey: 'your_api_key_here' });
// Start the crawl
const crawlJob = await app.asyncCrawlUrl('https://example.com', {
crawlerOptions: {
limit: 500,
includePaths: ['docs/*']
}
});
console.log(`Job ID: ${crawlJob.jobId}`);
// Poll for results
const checkStatus = async () => {
const status = await app.checkCrawlStatus(crawlJob.jobId);
if (status.status === 'completed') {
console.log(`Crawl completed with ${status.data.length} pages`);
return status.data;
} else if (status.status === 'failed') {
throw new Error(`Crawl failed: ${status.error}`);
} else {
console.log(`Status: ${status.status}`);
await new Promise(resolve => setTimeout(resolve, 10000));
return checkStatus();
}
};
const results = await checkStatus();
return results;
}
asyncCrawl().catch(console.error);
Filtering and Targeting Specific Content
You can use path patterns to crawl only specific sections of a website:
# Crawl only blog posts and documentation
params = {
'crawlerOptions': {
'includePaths': [
'blog/*/posts/*',
'docs/**'
],
'excludePaths': [
'*/draft/*',
'*/preview/*'
],
'limit': 300
}
}
result = app.crawl_url('https://example.com', params=params)
Extracting Structured Data During Crawl
Firecrawl can extract structured data from pages during the crawl process:
const crawlOptions = {
crawlerOptions: {
limit: 100
},
pageOptions: {
onlyMainContent: true,
includeHtml: false,
screenshot: false
},
extractorOptions: {
mode: 'llm-extraction',
extractionSchema: {
type: 'object',
properties: {
title: { type: 'string' },
author: { type: 'string' },
publishDate: { type: 'string' },
tags: {
type: 'array',
items: { type: 'string' }
}
}
}
}
};
const result = await app.crawlUrl('https://blog.example.com', crawlOptions);
result.data.forEach(page => {
console.log('Extracted data:', page.extractedData);
});
Handling JavaScript-Rendered Content
Many modern websites rely heavily on JavaScript to render content. When crawling single-page applications, you need to ensure JavaScript executes before extracting content:
params = {
'crawlerOptions': {
'limit': 50
},
'pageOptions': {
'waitFor': 2000, # Wait 2 seconds for JavaScript to execute
'screenshot': False
}
}
result = app.crawl_url('https://spa-example.com', params=params)
Respecting Robots.txt and Rate Limiting
Firecrawl automatically respects robots.txt directives by default. You can also configure rate limiting to be a good web citizen:
const crawlOptions = {
crawlerOptions: {
limit: 200,
respectRobotsTxt: true,
delay: 2000, // 2 second delay between requests
maxConcurrency: 3 // Limit concurrent requests
}
};
const result = await app.crawlUrl('https://example.com', crawlOptions);
Monitoring Crawl Progress
For large crawls, monitoring progress is essential:
import time
from firecrawl import FirecrawlApp
app = FirecrawlApp(api_key='your_api_key_here')
# Start crawl
job = app.async_crawl_url('https://example.com', params={
'crawlerOptions': {'limit': 1000}
})
# Monitor progress
while True:
status = app.check_crawl_status(job['jobId'])
completed = status.get('completed', 0)
total = status.get('total', 0)
if total > 0:
progress = (completed / total) * 100
print(f"Progress: {completed}/{total} ({progress:.1f}%)")
if status['status'] == 'completed':
print("Crawl finished successfully!")
break
elif status['status'] == 'failed':
print(f"Crawl failed: {status.get('error')}")
break
time.sleep(5)
Error Handling and Retry Logic
Implement robust error handling for production crawls:
async function robustCrawl(url, maxRetries = 3) {
const app = new FirecrawlApp({ apiKey: process.env.FIRECRAWL_API_KEY });
for (let attempt = 1; attempt <= maxRetries; attempt++) {
try {
const result = await app.crawlUrl(url, {
crawlerOptions: {
limit: 200,
timeout: 30000
}
});
console.log(`Successfully crawled ${result.data.length} pages`);
return result;
} catch (error) {
console.error(`Attempt ${attempt} failed:`, error.message);
if (attempt < maxRetries) {
const delay = Math.pow(2, attempt) * 1000;
console.log(`Retrying in ${delay}ms...`);
await new Promise(resolve => setTimeout(resolve, delay));
} else {
throw error;
}
}
}
}
Similar to handling timeouts in Puppeteer, proper timeout and retry configuration ensures reliable crawling operations.
Saving Crawl Results
Store crawl results for later processing:
import json
from firecrawl import FirecrawlApp
app = FirecrawlApp(api_key='your_api_key_here')
result = app.crawl_url('https://example.com', params={
'crawlerOptions': {'limit': 100}
})
# Save to JSON file
with open('crawl_results.json', 'w', encoding='utf-8') as f:
json.dump(result['data'], f, indent=2, ensure_ascii=False)
# Save individual markdown files
import os
os.makedirs('crawled_pages', exist_ok=True)
for i, page in enumerate(result['data']):
filename = f"crawled_pages/page_{i:04d}.md"
with open(filename, 'w', encoding='utf-8') as f:
f.write(f"# {page['metadata'].get('title', 'Untitled')}\n\n")
f.write(f"URL: {page['url']}\n\n")
f.write(page['markdown'])
Best Practices for Website Crawling
- Start Small: Test with a small limit first, then scale up
- Use Path Filters: Exclude unnecessary sections like admin panels, login pages, and APIs
- Respect Rate Limits: Configure appropriate delays to avoid overwhelming servers
- Monitor Costs: Each page crawled consumes API credits
- Handle Errors Gracefully: Implement retry logic and proper error handling
- Store Results Efficiently: Save data in structured formats for easy processing
- Check Robots.txt: Ensure you're respecting website crawling policies
Conclusion
Firecrawl's crawl endpoint provides a powerful, developer-friendly way to crawl entire websites without managing complex infrastructure. By leveraging its automatic link discovery, JavaScript rendering, and structured data extraction, you can build robust web scraping solutions that scale from small blogs to large enterprise websites.
Whether you're building a search engine, migrating content, or conducting SEO analysis, Firecrawl's crawling capabilities offer a reliable foundation for your web data extraction needs.