Can Firecrawl Extract Data from Multiple Pages Simultaneously?
Yes, Firecrawl can extract data from multiple pages simultaneously through its batch crawling capabilities and asynchronous job processing. This powerful feature allows developers to efficiently scrape large websites by crawling multiple URLs in parallel, significantly reducing the total time required for data extraction.
Understanding Firecrawl's Batch Crawling
Firecrawl provides a dedicated /crawl
endpoint that automatically discovers and processes multiple pages from a website concurrently. Unlike single-page scraping, batch crawling handles the entire workflow of page discovery, parallel processing, and data aggregation.
How Concurrent Processing Works
When you initiate a crawl job, Firecrawl:
- Discovers URLs by following links from the starting URL
- Processes multiple pages in parallel using distributed workers
- Aggregates results into a single response
- Manages rate limiting to avoid overwhelming target servers
This architecture allows Firecrawl to handle dynamic content and SPAs efficiently across multiple pages without requiring you to manage the complexity of parallel browser instances.
Batch Crawling with Python
Here's how to crawl multiple pages simultaneously using the Firecrawl Python SDK:
from firecrawl import FirecrawlApp
# Initialize the Firecrawl client
app = FirecrawlApp(api_key='your_api_key')
# Start a batch crawl job
crawl_result = app.crawl_url(
url='https://example.com',
params={
'limit': 100, # Maximum number of pages to crawl
'scrapeOptions': {
'formats': ['markdown', 'html'],
'onlyMainContent': True
}
},
wait_until_done=True # Wait for all pages to complete
)
# Access results from all crawled pages
for page in crawl_result['data']:
print(f"URL: {page['metadata']['url']}")
print(f"Content: {page['markdown'][:200]}...")
print("---")
Asynchronous Crawling for Better Performance
For improved performance and resource management, use asynchronous crawling:
import asyncio
from firecrawl import FirecrawlApp
async def crawl_website():
app = FirecrawlApp(api_key='your_api_key')
# Start crawl without waiting
crawl_job = app.crawl_url(
url='https://example.com',
params={'limit': 50},
wait_until_done=False
)
job_id = crawl_job['jobId']
print(f"Crawl job started: {job_id}")
# Poll for status
while True:
status = app.check_crawl_status(job_id)
if status['status'] == 'completed':
print(f"Crawled {status['total']} pages")
return status['data']
elif status['status'] == 'failed':
raise Exception("Crawl failed")
await asyncio.sleep(2) # Check every 2 seconds
# Run the async function
results = asyncio.run(crawl_website())
Batch Crawling with JavaScript/Node.js
The JavaScript SDK provides similar functionality for concurrent page extraction:
const FirecrawlApp = require('@mendable/firecrawl-js').default;
const app = new FirecrawlApp({ apiKey: 'your_api_key' });
async function crawlWebsite() {
// Start batch crawl
const crawlResult = await app.crawlUrl('https://example.com', {
limit: 100,
scrapeOptions: {
formats: ['markdown', 'html'],
onlyMainContent: true
}
}, true); // Wait for completion
// Process all pages
crawlResult.data.forEach(page => {
console.log(`URL: ${page.metadata.url}`);
console.log(`Title: ${page.metadata.title}`);
console.log(`Content length: ${page.markdown.length}`);
console.log('---');
});
return crawlResult;
}
crawlWebsite()
.then(result => console.log(`Total pages crawled: ${result.data.length}`))
.catch(error => console.error('Crawl error:', error));
Processing Results with Promise.all
For even more control, you can combine Firecrawl's crawl endpoint with JavaScript's parallel processing capabilities:
async function crawlAndProcess() {
const app = new FirecrawlApp({ apiKey: 'your_api_key' });
// Start crawl job
const crawlJob = await app.crawlUrl('https://example.com', {
limit: 50
}, false); // Don't wait
const jobId = crawlJob.jobId;
console.log(`Started job: ${jobId}`);
// Poll until complete
let status;
do {
await new Promise(resolve => setTimeout(resolve, 2000));
status = await app.checkCrawlStatus(jobId);
console.log(`Status: ${status.status} - ${status.completed}/${status.total}`);
} while (status.status === 'scraping');
if (status.status === 'completed') {
// Process all results in parallel
const processedData = await Promise.all(
status.data.map(async (page) => {
// Custom processing for each page
return {
url: page.metadata.url,
wordCount: page.markdown.split(/\s+/).length,
links: page.links || []
};
})
);
return processedData;
}
}
Controlling Concurrency and Performance
Setting Crawl Limits
Control how many pages are processed simultaneously:
# Limit total pages
crawl_result = app.crawl_url(
url='https://example.com',
params={
'limit': 50, # Stop after 50 pages
'maxDepth': 3 # Only crawl 3 levels deep
}
)
URL Filtering and Patterns
Focus on specific pages using include/exclude patterns:
const crawlResult = await app.crawlUrl('https://example.com', {
limit: 100,
includePaths: ['/blog/*', '/products/*'], // Only these paths
excludePaths: ['/admin/*', '/private/*'], // Skip these paths
maxDepth: 2
});
Managing Rate Limits
Firecrawl automatically manages rate limiting, but you can control the pace:
crawl_result = app.crawl_url(
url='https://example.com',
params={
'limit': 100,
'scrapeOptions': {
'waitFor': 1000 # Wait 1 second before scraping each page
}
}
)
Advanced: Custom Parallel Processing
For maximum control over parallel page processing, combine Firecrawl's single-page scrape endpoint with your own parallelization:
import asyncio
from firecrawl import FirecrawlApp
from concurrent.futures import ThreadPoolExecutor
def scrape_single_page(app, url):
"""Scrape a single page"""
return app.scrape_url(url, params={'formats': ['markdown']})
async def scrape_multiple_urls(urls):
"""Scrape multiple URLs in parallel"""
app = FirecrawlApp(api_key='your_api_key')
# Use ThreadPoolExecutor for parallel requests
with ThreadPoolExecutor(max_workers=5) as executor:
loop = asyncio.get_event_loop()
tasks = [
loop.run_in_executor(executor, scrape_single_page, app, url)
for url in urls
]
results = await asyncio.gather(*tasks)
return results
# Scrape multiple specific URLs
urls = [
'https://example.com/page1',
'https://example.com/page2',
'https://example.com/page3',
'https://example.com/page4',
'https://example.com/page5'
]
results = asyncio.run(scrape_multiple_urls(urls))
print(f"Scraped {len(results)} pages successfully")
Monitoring Crawl Progress
Track the progress of large batch crawls:
async function monitorCrawl(jobId) {
const app = new FirecrawlApp({ apiKey: 'your_api_key' });
const checkInterval = setInterval(async () => {
const status = await app.checkCrawlStatus(jobId);
console.log(`Progress: ${status.completed}/${status.total} pages`);
console.log(`Status: ${status.status}`);
if (status.status === 'completed') {
clearInterval(checkInterval);
console.log('Crawl finished!');
console.log(`Total pages: ${status.data.length}`);
} else if (status.status === 'failed') {
clearInterval(checkInterval);
console.error('Crawl failed:', status.error);
}
}, 3000); // Check every 3 seconds
}
Best Practices for Multi-Page Extraction
- Use appropriate limits: Set realistic
limit
values to avoid unnecessary API usage - Filter URLs strategically: Use
includePaths
andexcludePaths
to target relevant pages - Monitor job status: For large crawls, poll the status endpoint rather than waiting synchronously
- Handle errors gracefully: Implement retry logic for failed pages
- Respect rate limits: Use Firecrawl's built-in rate limiting rather than implementing your own
- Process results incrementally: For very large crawls, process results as they become available
Comparison with Traditional Approaches
Firecrawl's concurrent processing offers significant advantages over traditional sequential scraping:
| Feature | Firecrawl Batch | Sequential Scraping | |---------|----------------|---------------------| | Speed | Parallel processing | One page at a time | | Infrastructure | Managed by Firecrawl | Self-hosted browsers | | Rate Limiting | Automatic | Manual implementation | | Error Handling | Built-in retries | Custom logic required | | Scalability | Unlimited | Limited by resources |
Conclusion
Firecrawl excels at extracting data from multiple pages simultaneously through its robust crawling infrastructure. Whether you're using the batch /crawl
endpoint for automatic page discovery or implementing custom parallel processing with the /scrape
endpoint, Firecrawl provides the tools needed for efficient, large-scale web data extraction.
The combination of asynchronous job processing, automatic rate limiting, and built-in error handling makes Firecrawl an excellent choice for projects requiring concurrent multi-page scraping without the complexity of managing browser instances and parallel workers yourself.