How do I configure crawl depth with Firecrawl?
Configuring crawl depth in Firecrawl is essential for controlling how extensively your crawler traverses a website. The crawl depth determines how many levels of links the crawler will follow from the starting URL, allowing you to balance between comprehensive data collection and resource efficiency.
Understanding Crawl Depth
Crawl depth refers to the number of "hops" or link levels the crawler will follow from the initial URL. For example:
- Depth 0: Only crawl the starting URL
- Depth 1: Crawl the starting URL and all pages directly linked from it
- Depth 2: Crawl depth 1 pages plus all pages linked from those pages
- Depth 3: And so on...
Understanding the appropriate depth for your use case is crucial. Shallow depths (1-2) are ideal for focused scraping tasks, while deeper crawls (3-5) are better for comprehensive site mapping or large-scale data collection.
Configuring Crawl Depth in Firecrawl
Firecrawl provides a straightforward way to configure crawl depth through the maxDepth
parameter in the crawl options. This parameter is available in both the API and SDK implementations.
Using the Firecrawl API
When making a POST request to the /crawl
endpoint, you can specify the maxDepth
parameter in the request body:
curl -X POST https://api.firecrawl.dev/v1/crawl \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com",
"maxDepth": 2,
"limit": 100
}'
This configuration will:
- Start crawling from https://example.com
- Follow links up to 2 levels deep
- Stop after collecting 100 pages (as specified by the limit
parameter)
Using the Python SDK
The Firecrawl Python SDK provides a clean interface for configuring crawl depth:
from firecrawl import FirecrawlApp
# Initialize the Firecrawl client
app = FirecrawlApp(api_key='YOUR_API_KEY')
# Configure crawl parameters including depth
crawl_params = {
'maxDepth': 3,
'limit': 200,
'includeUrls': ['https://example.com/blog/*'],
'excludeUrls': ['https://example.com/admin/*']
}
# Start the crawl
crawl_result = app.crawl_url(
url='https://example.com',
params=crawl_params,
wait_until_done=True
)
# Process the results
for page in crawl_result['data']:
print(f"URL: {page['url']}")
print(f"Title: {page['metadata']['title']}")
print(f"Content: {page['content'][:200]}...")
print("---")
Using the JavaScript/Node.js SDK
For JavaScript developers, Firecrawl's Node.js SDK offers similar functionality:
import FirecrawlApp from '@mendable/firecrawl-js';
// Initialize the client
const app = new FirecrawlApp({ apiKey: 'YOUR_API_KEY' });
async function crawlWebsite() {
try {
const crawlResult = await app.crawlUrl('https://example.com', {
maxDepth: 2,
limit: 150,
scrapeOptions: {
formats: ['markdown', 'html'],
onlyMainContent: true
}
});
// Wait for crawl to complete
if (crawlResult.success) {
console.log(`Crawled ${crawlResult.data.length} pages`);
// Process each page
crawlResult.data.forEach(page => {
console.log(`URL: ${page.url}`);
console.log(`Depth: ${page.metadata.depth || 'N/A'}`);
console.log(`Content: ${page.markdown.substring(0, 200)}...`);
console.log('---');
});
}
} catch (error) {
console.error('Crawl failed:', error);
}
}
crawlWebsite();
Advanced Crawl Depth Configuration
Combining Depth with URL Patterns
You can make your crawls more efficient by combining depth limits with URL include/exclude patterns:
from firecrawl import FirecrawlApp
app = FirecrawlApp(api_key='YOUR_API_KEY')
# Crawl only blog posts up to 2 levels deep
crawl_params = {
'maxDepth': 2,
'limit': 500,
'includeUrls': [
'https://example.com/blog/*',
'https://example.com/articles/*'
],
'excludeUrls': [
'https://example.com/*/comments/*',
'https://example.com/*/share/*'
]
}
result = app.crawl_url(
url='https://example.com/blog',
params=crawl_params,
wait_until_done=True
)
This approach is particularly useful when you want to perform focused crawling on specific sections of a website, similar to how you might crawl a single page application with targeted navigation.
Dynamic Depth Adjustment
For more sophisticated crawling strategies, you might want to adjust depth based on the content you're finding:
import FirecrawlApp from '@mendable/firecrawl-js';
const app = new FirecrawlApp({ apiKey: 'YOUR_API_KEY' });
async function adaptiveCrawl(baseUrl, initialDepth = 2) {
let currentDepth = initialDepth;
let allPages = [];
while (currentDepth <= 4) {
console.log(`Crawling with depth ${currentDepth}...`);
const result = await app.crawlUrl(baseUrl, {
maxDepth: currentDepth,
limit: 100
});
if (result.success) {
allPages = allPages.concat(result.data);
// Check if we found enough pages
if (result.data.length >= 80) {
console.log(`Found sufficient pages at depth ${currentDepth}`);
break;
}
// Increase depth if we need more pages
currentDepth++;
} else {
break;
}
}
return allPages;
}
// Use the adaptive crawl
adaptiveCrawl('https://example.com')
.then(pages => {
console.log(`Total pages collected: ${pages.length}`);
})
.catch(error => {
console.error('Adaptive crawl failed:', error);
});
Monitoring Crawl Progress by Depth
When working with deeper crawls, monitoring progress becomes important:
from firecrawl import FirecrawlApp
import time
app = FirecrawlApp(api_key='YOUR_API_KEY')
# Start an asynchronous crawl
crawl_params = {
'maxDepth': 4,
'limit': 1000
}
# Initiate crawl without waiting
crawl_job = app.crawl_url(
url='https://example.com',
params=crawl_params,
wait_until_done=False
)
job_id = crawl_job['id']
print(f"Crawl job started: {job_id}")
# Poll for status
while True:
status = app.check_crawl_status(job_id)
if status['status'] == 'completed':
print(f"\nCrawl completed!")
print(f"Total pages: {status['total']}")
print(f"Completed: {status['completed']}")
# Retrieve results
results = status['data']
break
elif status['status'] == 'failed':
print("Crawl failed!")
break
else:
print(f"Progress: {status['completed']}/{status['total']} pages")
time.sleep(5)
This pattern is useful when dealing with large websites where handling timeouts and monitoring progress is crucial for successful data collection.
Best Practices for Crawl Depth Configuration
1. Start Conservative
Begin with a shallow depth (1-2) to understand the site structure and estimate the total number of pages you'll encounter:
# Initial exploration crawl
exploration = app.crawl_url(
url='https://example.com',
params={'maxDepth': 1, 'limit': 50},
wait_until_done=True
)
print(f"Found {len(exploration['data'])} pages at depth 1")
print("Sample URLs:")
for page in exploration['data'][:5]:
print(f" - {page['url']}")
2. Consider Site Architecture
Different types of websites require different depth strategies:
- Blogs: Depth 2-3 (home → category → article)
- E-commerce: Depth 3-4 (home → category → subcategory → product)
- Documentation: Depth 2-3 (home → section → page)
- News sites: Depth 2 (home/section → article)
3. Combine with Rate Limiting
When increasing depth, always be mindful of server load and implement appropriate rate limiting:
const crawlResult = await app.crawlUrl('https://example.com', {
maxDepth: 3,
limit: 500,
scrapeOptions: {
waitFor: 1000, // Wait 1 second between requests
timeout: 30000 // 30 second timeout per page
}
});
4. Use Depth Metadata
Track the depth of each page in your results for better analytics:
from collections import defaultdict
# Organize results by depth
pages_by_depth = defaultdict(list)
for page in crawl_result['data']:
depth = page.get('metadata', {}).get('depth', 0)
pages_by_depth[depth].append(page['url'])
# Analyze distribution
for depth, urls in sorted(pages_by_depth.items()):
print(f"Depth {depth}: {len(urls)} pages")
Common Pitfalls and Solutions
Issue: Crawling Too Many Pages
Problem: Setting depth too high results in thousands of unwanted pages.
Solution: Combine maxDepth
with strict URL patterns and lower limit
:
crawl_params = {
'maxDepth': 3,
'limit': 200, # Hard limit on total pages
'includeUrls': ['https://example.com/docs/*']
}
Issue: Missing Important Pages
Problem: Important pages are beyond your configured depth.
Solution: Use multiple targeted crawls with different starting points:
important_sections = [
'https://example.com/products',
'https://example.com/blog',
'https://example.com/docs'
]
all_results = []
for section in important_sections:
result = app.crawl_url(
url=section,
params={'maxDepth': 2, 'limit': 100},
wait_until_done=True
)
all_results.extend(result['data'])
Issue: Crawl Takes Too Long
Problem: Deep crawls with high limits take hours to complete.
Solution: Use asynchronous crawling and process results incrementally, similar to how you might handle browser sessions for long-running operations.
Conclusion
Configuring crawl depth in Firecrawl is a balancing act between comprehensive coverage and efficient resource usage. By starting with conservative depth settings, understanding your target website's structure, and combining depth limits with URL patterns and page limits, you can create efficient and effective web crawling workflows.
Remember to always respect website terms of service, implement appropriate rate limiting, and monitor your crawls to ensure they're performing as expected. With proper configuration, Firecrawl's depth control features enable you to extract exactly the data you need without unnecessary overhead.