Table of contents

How do I configure crawl depth with Firecrawl?

Configuring crawl depth in Firecrawl is essential for controlling how extensively your crawler traverses a website. The crawl depth determines how many levels of links the crawler will follow from the starting URL, allowing you to balance between comprehensive data collection and resource efficiency.

Understanding Crawl Depth

Crawl depth refers to the number of "hops" or link levels the crawler will follow from the initial URL. For example:

  • Depth 0: Only crawl the starting URL
  • Depth 1: Crawl the starting URL and all pages directly linked from it
  • Depth 2: Crawl depth 1 pages plus all pages linked from those pages
  • Depth 3: And so on...

Understanding the appropriate depth for your use case is crucial. Shallow depths (1-2) are ideal for focused scraping tasks, while deeper crawls (3-5) are better for comprehensive site mapping or large-scale data collection.

Configuring Crawl Depth in Firecrawl

Firecrawl provides a straightforward way to configure crawl depth through the maxDepth parameter in the crawl options. This parameter is available in both the API and SDK implementations.

Using the Firecrawl API

When making a POST request to the /crawl endpoint, you can specify the maxDepth parameter in the request body:

curl -X POST https://api.firecrawl.dev/v1/crawl \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com",
    "maxDepth": 2,
    "limit": 100
  }'

This configuration will: - Start crawling from https://example.com - Follow links up to 2 levels deep - Stop after collecting 100 pages (as specified by the limit parameter)

Using the Python SDK

The Firecrawl Python SDK provides a clean interface for configuring crawl depth:

from firecrawl import FirecrawlApp

# Initialize the Firecrawl client
app = FirecrawlApp(api_key='YOUR_API_KEY')

# Configure crawl parameters including depth
crawl_params = {
    'maxDepth': 3,
    'limit': 200,
    'includeUrls': ['https://example.com/blog/*'],
    'excludeUrls': ['https://example.com/admin/*']
}

# Start the crawl
crawl_result = app.crawl_url(
    url='https://example.com',
    params=crawl_params,
    wait_until_done=True
)

# Process the results
for page in crawl_result['data']:
    print(f"URL: {page['url']}")
    print(f"Title: {page['metadata']['title']}")
    print(f"Content: {page['content'][:200]}...")
    print("---")

Using the JavaScript/Node.js SDK

For JavaScript developers, Firecrawl's Node.js SDK offers similar functionality:

import FirecrawlApp from '@mendable/firecrawl-js';

// Initialize the client
const app = new FirecrawlApp({ apiKey: 'YOUR_API_KEY' });

async function crawlWebsite() {
  try {
    const crawlResult = await app.crawlUrl('https://example.com', {
      maxDepth: 2,
      limit: 150,
      scrapeOptions: {
        formats: ['markdown', 'html'],
        onlyMainContent: true
      }
    });

    // Wait for crawl to complete
    if (crawlResult.success) {
      console.log(`Crawled ${crawlResult.data.length} pages`);

      // Process each page
      crawlResult.data.forEach(page => {
        console.log(`URL: ${page.url}`);
        console.log(`Depth: ${page.metadata.depth || 'N/A'}`);
        console.log(`Content: ${page.markdown.substring(0, 200)}...`);
        console.log('---');
      });
    }
  } catch (error) {
    console.error('Crawl failed:', error);
  }
}

crawlWebsite();

Advanced Crawl Depth Configuration

Combining Depth with URL Patterns

You can make your crawls more efficient by combining depth limits with URL include/exclude patterns:

from firecrawl import FirecrawlApp

app = FirecrawlApp(api_key='YOUR_API_KEY')

# Crawl only blog posts up to 2 levels deep
crawl_params = {
    'maxDepth': 2,
    'limit': 500,
    'includeUrls': [
        'https://example.com/blog/*',
        'https://example.com/articles/*'
    ],
    'excludeUrls': [
        'https://example.com/*/comments/*',
        'https://example.com/*/share/*'
    ]
}

result = app.crawl_url(
    url='https://example.com/blog',
    params=crawl_params,
    wait_until_done=True
)

This approach is particularly useful when you want to perform focused crawling on specific sections of a website, similar to how you might crawl a single page application with targeted navigation.

Dynamic Depth Adjustment

For more sophisticated crawling strategies, you might want to adjust depth based on the content you're finding:

import FirecrawlApp from '@mendable/firecrawl-js';

const app = new FirecrawlApp({ apiKey: 'YOUR_API_KEY' });

async function adaptiveCrawl(baseUrl, initialDepth = 2) {
  let currentDepth = initialDepth;
  let allPages = [];

  while (currentDepth <= 4) {
    console.log(`Crawling with depth ${currentDepth}...`);

    const result = await app.crawlUrl(baseUrl, {
      maxDepth: currentDepth,
      limit: 100
    });

    if (result.success) {
      allPages = allPages.concat(result.data);

      // Check if we found enough pages
      if (result.data.length >= 80) {
        console.log(`Found sufficient pages at depth ${currentDepth}`);
        break;
      }

      // Increase depth if we need more pages
      currentDepth++;
    } else {
      break;
    }
  }

  return allPages;
}

// Use the adaptive crawl
adaptiveCrawl('https://example.com')
  .then(pages => {
    console.log(`Total pages collected: ${pages.length}`);
  })
  .catch(error => {
    console.error('Adaptive crawl failed:', error);
  });

Monitoring Crawl Progress by Depth

When working with deeper crawls, monitoring progress becomes important:

from firecrawl import FirecrawlApp
import time

app = FirecrawlApp(api_key='YOUR_API_KEY')

# Start an asynchronous crawl
crawl_params = {
    'maxDepth': 4,
    'limit': 1000
}

# Initiate crawl without waiting
crawl_job = app.crawl_url(
    url='https://example.com',
    params=crawl_params,
    wait_until_done=False
)

job_id = crawl_job['id']
print(f"Crawl job started: {job_id}")

# Poll for status
while True:
    status = app.check_crawl_status(job_id)

    if status['status'] == 'completed':
        print(f"\nCrawl completed!")
        print(f"Total pages: {status['total']}")
        print(f"Completed: {status['completed']}")

        # Retrieve results
        results = status['data']
        break
    elif status['status'] == 'failed':
        print("Crawl failed!")
        break
    else:
        print(f"Progress: {status['completed']}/{status['total']} pages")
        time.sleep(5)

This pattern is useful when dealing with large websites where handling timeouts and monitoring progress is crucial for successful data collection.

Best Practices for Crawl Depth Configuration

1. Start Conservative

Begin with a shallow depth (1-2) to understand the site structure and estimate the total number of pages you'll encounter:

# Initial exploration crawl
exploration = app.crawl_url(
    url='https://example.com',
    params={'maxDepth': 1, 'limit': 50},
    wait_until_done=True
)

print(f"Found {len(exploration['data'])} pages at depth 1")
print("Sample URLs:")
for page in exploration['data'][:5]:
    print(f"  - {page['url']}")

2. Consider Site Architecture

Different types of websites require different depth strategies:

  • Blogs: Depth 2-3 (home → category → article)
  • E-commerce: Depth 3-4 (home → category → subcategory → product)
  • Documentation: Depth 2-3 (home → section → page)
  • News sites: Depth 2 (home/section → article)

3. Combine with Rate Limiting

When increasing depth, always be mindful of server load and implement appropriate rate limiting:

const crawlResult = await app.crawlUrl('https://example.com', {
  maxDepth: 3,
  limit: 500,
  scrapeOptions: {
    waitFor: 1000, // Wait 1 second between requests
    timeout: 30000 // 30 second timeout per page
  }
});

4. Use Depth Metadata

Track the depth of each page in your results for better analytics:

from collections import defaultdict

# Organize results by depth
pages_by_depth = defaultdict(list)

for page in crawl_result['data']:
    depth = page.get('metadata', {}).get('depth', 0)
    pages_by_depth[depth].append(page['url'])

# Analyze distribution
for depth, urls in sorted(pages_by_depth.items()):
    print(f"Depth {depth}: {len(urls)} pages")

Common Pitfalls and Solutions

Issue: Crawling Too Many Pages

Problem: Setting depth too high results in thousands of unwanted pages.

Solution: Combine maxDepth with strict URL patterns and lower limit:

crawl_params = {
    'maxDepth': 3,
    'limit': 200,  # Hard limit on total pages
    'includeUrls': ['https://example.com/docs/*']
}

Issue: Missing Important Pages

Problem: Important pages are beyond your configured depth.

Solution: Use multiple targeted crawls with different starting points:

important_sections = [
    'https://example.com/products',
    'https://example.com/blog',
    'https://example.com/docs'
]

all_results = []
for section in important_sections:
    result = app.crawl_url(
        url=section,
        params={'maxDepth': 2, 'limit': 100},
        wait_until_done=True
    )
    all_results.extend(result['data'])

Issue: Crawl Takes Too Long

Problem: Deep crawls with high limits take hours to complete.

Solution: Use asynchronous crawling and process results incrementally, similar to how you might handle browser sessions for long-running operations.

Conclusion

Configuring crawl depth in Firecrawl is a balancing act between comprehensive coverage and efficient resource usage. By starting with conservative depth settings, understanding your target website's structure, and combining depth limits with URL patterns and page limits, you can create efficient and effective web crawling workflows.

Remember to always respect website terms of service, implement appropriate rate limiting, and monitor your crawls to ensure they're performing as expected. With proper configuration, Firecrawl's depth control features enable you to extract exactly the data you need without unnecessary overhead.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon