What are the differences between Firecrawl's crawl and scrape endpoints?
Firecrawl offers two primary API endpoints for web data extraction: the scrape endpoint for single-page extraction and the crawl endpoint for multi-page website crawling. Understanding the differences between these endpoints is crucial for choosing the right tool for your web scraping project.
Overview of Firecrawl's Endpoints
The Scrape Endpoint
The scrape endpoint is designed for single-page data extraction. It fetches and processes one URL at a time, making it ideal when you need to extract data from specific pages without following links to other pages.
Key characteristics: - Processes a single URL per request - Returns data immediately (synchronous operation) - Lower API credit consumption - Perfect for targeted data extraction - Supports JavaScript rendering - Converts HTML to clean Markdown format
The Crawl Endpoint
The crawl endpoint is built for multi-page website crawling. It automatically discovers and extracts data from multiple pages on a website by following links, respecting crawl depth limits, and managing the entire crawling process.
Key characteristics: - Processes multiple URLs automatically - Asynchronous operation (returns job ID) - Higher API credit consumption - Ideal for extracting data from entire websites or sections - Supports sitemap-based crawling - Includes link discovery and URL filtering
When to Use Each Endpoint
Use the Scrape Endpoint When:
- Extracting data from a known URL: You have a specific page URL and need its content
- Real-time data needs: You require immediate results for a single page
- Low-volume scraping: Processing individual pages on demand
- API integration testing: Testing your integration with a single endpoint
- Monitoring specific pages: Tracking changes on particular URLs
Use the Crawl Endpoint When:
- Scraping entire websites: Extracting data from all product pages, blog posts, or articles
- Discovering content: You don't know all URLs in advance
- Batch processing: Processing large volumes of related pages
- Site archiving: Creating snapshots of entire website sections
- Competitive analysis: Gathering data across competitor websites
Code Examples
Using the Scrape Endpoint
Python Example
import requests
API_KEY = 'your_api_key_here'
url = 'https://api.firecrawl.dev/v0/scrape'
headers = {
'Authorization': f'Bearer {API_KEY}',
'Content-Type': 'application/json'
}
data = {
'url': 'https://example.com/product/123',
'formats': ['markdown', 'html'],
'onlyMainContent': True
}
response = requests.post(url, json=data, headers=headers)
result = response.json()
print("Extracted content:")
print(result['data']['markdown'])
JavaScript Example
const axios = require('axios');
const API_KEY = 'your_api_key_here';
const url = 'https://api.firecrawl.dev/v0/scrape';
const scrapeData = async () => {
try {
const response = await axios.post(url, {
url: 'https://example.com/product/123',
formats: ['markdown', 'html'],
onlyMainContent: true
}, {
headers: {
'Authorization': `Bearer ${API_KEY}`,
'Content-Type': 'application/json'
}
});
console.log('Extracted content:');
console.log(response.data.data.markdown);
} catch (error) {
console.error('Scraping error:', error.message);
}
};
scrapeData();
Using the Crawl Endpoint
Python Example
import requests
import time
API_KEY = 'your_api_key_here'
crawl_url = 'https://api.firecrawl.dev/v0/crawl'
status_url = 'https://api.firecrawl.dev/v0/crawl/status'
headers = {
'Authorization': f'Bearer {API_KEY}',
'Content-Type': 'application/json'
}
# Start crawl job
data = {
'url': 'https://example.com',
'crawlerOptions': {
'maxDepth': 3,
'limit': 100,
'includePaths': ['/products/*']
},
'pageOptions': {
'onlyMainContent': True
}
}
response = requests.post(crawl_url, json=data, headers=headers)
job_id = response.json()['jobId']
print(f"Crawl job started: {job_id}")
# Poll for results
while True:
status_response = requests.get(
f"{status_url}/{job_id}",
headers=headers
)
status_data = status_response.json()
if status_data['status'] == 'completed':
print(f"Crawl completed! Pages found: {len(status_data['data'])}")
for page in status_data['data']:
print(f"URL: {page['url']}")
print(f"Content: {page['markdown'][:100]}...")
break
elif status_data['status'] == 'failed':
print("Crawl failed!")
break
print(f"Status: {status_data['status']}")
time.sleep(5)
JavaScript Example
const axios = require('axios');
const API_KEY = 'your_api_key_here';
const crawlUrl = 'https://api.firecrawl.dev/v0/crawl';
const statusUrl = 'https://api.firecrawl.dev/v0/crawl/status';
const headers = {
'Authorization': `Bearer ${API_KEY}`,
'Content-Type': 'application/json'
};
const crawlWebsite = async () => {
try {
// Start crawl job
const crawlResponse = await axios.post(crawlUrl, {
url: 'https://example.com',
crawlerOptions: {
maxDepth: 3,
limit: 100,
includePaths: ['/products/*']
},
pageOptions: {
onlyMainContent: true
}
}, { headers });
const jobId = crawlResponse.data.jobId;
console.log(`Crawl job started: ${jobId}`);
// Poll for results
while (true) {
const statusResponse = await axios.get(
`${statusUrl}/${jobId}`,
{ headers }
);
const status = statusResponse.data;
if (status.status === 'completed') {
console.log(`Crawl completed! Pages found: ${status.data.length}`);
status.data.forEach(page => {
console.log(`URL: ${page.url}`);
console.log(`Content: ${page.markdown.substring(0, 100)}...`);
});
break;
} else if (status.status === 'failed') {
console.log('Crawl failed!');
break;
}
console.log(`Status: ${status.status}`);
await new Promise(resolve => setTimeout(resolve, 5000));
}
} catch (error) {
console.error('Crawl error:', error.message);
}
};
crawlWebsite();
Technical Differences
Response Format
Scrape Endpoint: - Returns synchronous response - Immediate data availability - Single page object in response
{
"success": true,
"data": {
"markdown": "# Page Title\n\nContent...",
"html": "<html>...</html>",
"metadata": {
"title": "Page Title",
"description": "Page description"
}
}
}
Crawl Endpoint: - Returns asynchronous job ID - Requires polling for completion - Array of page objects when complete
{
"success": true,
"jobId": "crawl-job-123",
"status": "processing"
}
Performance Considerations
Scrape Endpoint: - Response time: 2-10 seconds per page - No queuing delays - Suitable for real-time applications - Can be integrated with Puppeteer for browser automation
Crawl Endpoint: - Total time varies by website size - Queue-based processing - Better for batch operations - Automatically handles page navigation across the entire site
Credit Consumption
Scrape Endpoint: - 1 credit per page request - Predictable cost per operation - No overhead for link discovery
Crawl Endpoint: - Credits based on pages crawled - Additional overhead for link discovery - More cost-effective for large-scale scraping
Advanced Configuration Options
Scrape Endpoint Options
scrape_options = {
'url': 'https://example.com/page',
'formats': ['markdown', 'html', 'links'], # Output formats
'onlyMainContent': True, # Extract main content only
'includeTags': ['article', 'main'], # Specific tags to include
'excludeTags': ['nav', 'footer'], # Tags to exclude
'waitFor': 3000 # Wait time in milliseconds
}
Crawl Endpoint Options
crawl_options = {
'url': 'https://example.com',
'crawlerOptions': {
'maxDepth': 3, # Maximum crawl depth
'limit': 500, # Maximum pages to crawl
'includePaths': ['/blog/*', '/products/*'], # URL patterns to include
'excludePaths': ['/admin/*'], # URL patterns to exclude
'allowBackwardLinks': False, # Follow links to parent pages
'allowExternalLinks': False # Follow external links
},
'pageOptions': {
'onlyMainContent': True,
'formats': ['markdown']
}
}
Error Handling
Scrape Endpoint Error Handling
try:
response = requests.post(scrape_url, json=data, headers=headers)
response.raise_for_status()
result = response.json()
if not result.get('success'):
print(f"Scraping failed: {result.get('error')}")
except requests.exceptions.RequestException as e:
print(f"Request error: {e}")
Crawl Endpoint Error Handling
# Check crawl status for errors
status = requests.get(f"{status_url}/{job_id}", headers=headers).json()
if status['status'] == 'failed':
print(f"Crawl failed: {status.get('error')}")
print(f"Failed pages: {status.get('failedPages', [])}")
elif status['status'] == 'completed':
# Check for partial failures
if 'failedPages' in status and len(status['failedPages']) > 0:
print(f"Completed with {len(status['failedPages'])} failed pages")
Best Practices
For Scrape Endpoint:
- Implement retry logic for transient failures
- Cache results to minimize repeated requests
- Use appropriate wait times for JavaScript-heavy pages
- Handle rate limits by spacing requests appropriately
For Crawl Endpoint:
- Set appropriate depth limits to control scope
- Use URL patterns to filter relevant pages
- Monitor job status regularly for large crawls
- Handle partial failures gracefully
- Consider using sitemaps for more efficient crawling
Conclusion
The choice between Firecrawl's crawl and scrape endpoints depends on your specific needs. Use the scrape endpoint for targeted, single-page extraction with immediate results, and the crawl endpoint for comprehensive, multi-page data collection across entire websites or sections. Understanding these differences helps optimize both performance and cost for your web scraping projects.
For dynamic websites that require handling AJAX requests or complex JavaScript rendering, both endpoints support JavaScript execution, ensuring you can extract data from modern single-page applications effectively.