Table of contents

Can I Use Firecrawl to Crawl Websites with a Sitemap?

Yes, Firecrawl provides native support for crawling websites using sitemaps, making it one of the most efficient ways to discover and scrape pages at scale. Sitemap-based crawling allows you to leverage the structured list of URLs that websites provide, ensuring comprehensive coverage while respecting the site's intended structure.

Understanding Sitemap-Based Crawling in Firecrawl

Sitemaps are XML files that websites publish to help search engines and crawlers discover their content. When you use Firecrawl with a sitemap, the crawler automatically:

  • Parses the sitemap XML to extract all listed URLs
  • Respects priority and change frequency hints
  • Handles nested sitemaps and sitemap index files
  • Processes large sitemaps efficiently with automatic pagination

This approach is particularly useful when you need to crawl large websites systematically or when you want to ensure you're capturing all important pages without missing content hidden behind complex navigation.

Basic Sitemap Crawling with Firecrawl

Python Implementation

Here's how to use Firecrawl's sitemap support in Python:

from firecrawl import FirecrawlApp

# Initialize Firecrawl
app = FirecrawlApp(api_key='your_api_key_here')

# Crawl using sitemap
crawl_result = app.crawl_url(
    url='https://example.com',
    params={
        'useSitemap': True,
        'limit': 100,
        'scrapeOptions': {
            'formats': ['markdown', 'html']
        }
    }
)

# Process the results
for page in crawl_result['data']:
    print(f"URL: {page['metadata']['url']}")
    print(f"Title: {page['metadata']['title']}")
    print(f"Content length: {len(page['markdown'])}")
    print("---")

JavaScript/Node.js Implementation

import FirecrawlApp from '@mendable/firecrawl-js';

// Initialize Firecrawl
const app = new FirecrawlApp({ apiKey: 'your_api_key_here' });

// Crawl using sitemap
async function crawlWithSitemap() {
    const crawlResult = await app.crawlUrl('https://example.com', {
        useSitemap: true,
        limit: 100,
        scrapeOptions: {
            formats: ['markdown', 'html']
        }
    });

    // Process results
    for (const page of crawlResult.data) {
        console.log(`URL: ${page.metadata.url}`);
        console.log(`Title: ${page.metadata.title}`);
        console.log(`Content length: ${page.markdown.length}`);
        console.log('---');
    }
}

crawlWithSitemap().catch(console.error);

Advanced Sitemap Configuration Options

Specifying Custom Sitemap URLs

If the sitemap isn't located at the standard /sitemap.xml location, you can specify a custom sitemap URL:

from firecrawl import FirecrawlApp

app = FirecrawlApp(api_key='your_api_key_here')

# Use a custom sitemap location
crawl_result = app.crawl_url(
    url='https://example.com',
    params={
        'useSitemap': True,
        'sitemapUrl': 'https://example.com/custom-sitemap.xml',
        'limit': 200
    }
)

Filtering Sitemap URLs

You can apply filters to crawl only specific pages from the sitemap:

const crawlResult = await app.crawlUrl('https://example.com', {
    useSitemap: true,
    includePaths: ['/blog/**', '/docs/**'],
    excludePaths: ['/blog/archive/**'],
    limit: 150,
    scrapeOptions: {
        formats: ['markdown'],
        onlyMainContent: true
    }
});

Handling Large Sitemaps

For websites with extensive sitemaps containing thousands of URLs, Firecrawl provides several strategies to manage the crawl efficiently:

Pagination and Batching

from firecrawl import FirecrawlApp
import time

app = FirecrawlApp(api_key='your_api_key_here')

# Crawl in batches
def crawl_large_sitemap(base_url, batch_size=100):
    offset = 0
    all_results = []

    while True:
        result = app.crawl_url(
            url=base_url,
            params={
                'useSitemap': True,
                'limit': batch_size,
                'offset': offset,
                'scrapeOptions': {
                    'formats': ['markdown']
                }
            }
        )

        all_results.extend(result['data'])

        if len(result['data']) < batch_size:
            break

        offset += batch_size
        time.sleep(1)  # Rate limiting

    return all_results

# Usage
results = crawl_large_sitemap('https://example.com', batch_size=50)
print(f"Total pages crawled: {len(results)}")

Asynchronous Crawling

For better performance with large sitemaps, use Firecrawl's async crawl feature:

import FirecrawlApp from '@mendable/firecrawl-js';

const app = new FirecrawlApp({ apiKey: 'your_api_key_here' });

async function asyncCrawlWithSitemap() {
    // Start async crawl
    const crawlId = await app.asyncCrawlUrl('https://example.com', {
        useSitemap: true,
        limit: 1000,
        scrapeOptions: {
            formats: ['markdown', 'html']
        }
    });

    console.log(`Crawl started with ID: ${crawlId}`);

    // Poll for completion
    let status;
    do {
        await new Promise(resolve => setTimeout(resolve, 5000));
        status = await app.checkCrawlStatus(crawlId);
        console.log(`Progress: ${status.completed}/${status.total}`);
    } while (status.status === 'scraping');

    // Get results
    const results = await app.getCrawlResults(crawlId);
    return results.data;
}

asyncCrawlWithSitemap().then(data => {
    console.log(`Successfully crawled ${data.length} pages`);
}).catch(console.error);

Combining Sitemap with Traditional Crawling

Sometimes you'll want to combine sitemap discovery with traditional link-following crawling to ensure comprehensive coverage:

from firecrawl import FirecrawlApp

app = FirecrawlApp(api_key='your_api_key_here')

# Hybrid approach: sitemap + link following
crawl_result = app.crawl_url(
    url='https://example.com',
    params={
        'useSitemap': True,
        'mode': 'hybrid',  # Use both sitemap and link discovery
        'maxDepth': 3,
        'limit': 500,
        'scrapeOptions': {
            'formats': ['markdown'],
            'waitFor': 2000  # Wait for dynamic content
        }
    }
)

This hybrid approach is particularly useful for websites where the sitemap might not include all pages or when you want to discover pages that were recently added but haven't been included in the sitemap yet.

Working with Sitemap Index Files

Many large websites use sitemap index files that reference multiple individual sitemaps. Firecrawl handles these automatically:

from firecrawl import FirecrawlApp

app = FirecrawlApp(api_key='your_api_key_here')

# Firecrawl automatically processes sitemap index files
crawl_result = app.crawl_url(
    url='https://example.com',
    params={
        'useSitemap': True,
        'sitemapUrl': 'https://example.com/sitemap_index.xml',
        'limit': 1000,
        'scrapeOptions': {
            'formats': ['markdown']
        }
    }
)

# Access metadata about which sitemap each URL came from
for page in crawl_result['data']:
    print(f"URL: {page['metadata']['url']}")
    print(f"Source sitemap: {page['metadata'].get('sourceSitemap', 'N/A')}")

Extracting Structured Data from Sitemap Crawls

Once you've crawled pages using a sitemap, you can combine this with Firecrawl's AI-powered extraction capabilities:

import FirecrawlApp from '@mendable/firecrawl-js';

const app = new FirecrawlApp({ apiKey: 'your_api_key_here' });

async function crawlAndExtractFromSitemap() {
    const crawlResult = await app.crawlUrl('https://blog.example.com', {
        useSitemap: true,
        limit: 100,
        scrapeOptions: {
            formats: ['markdown'],
            extractorOptions: {
                mode: 'llm-extraction',
                extractionPrompt: 'Extract the article title, author, publication date, and main topics',
                extractionSchema: {
                    type: 'object',
                    properties: {
                        title: { type: 'string' },
                        author: { type: 'string' },
                        date: { type: 'string' },
                        topics: { type: 'array', items: { type: 'string' } }
                    }
                }
            }
        }
    });

    // Process extracted data
    for (const page of crawlResult.data) {
        console.log(`Article: ${page.extract.title}`);
        console.log(`Author: ${page.extract.author}`);
        console.log(`Topics: ${page.extract.topics.join(', ')}`);
    }
}

crawlAndExtractFromSitemap().catch(console.error);

Best Practices for Sitemap-Based Crawling

1. Verify Sitemap Availability

Before starting a large crawl, verify that the sitemap exists and is accessible:

import requests

def check_sitemap(base_url):
    sitemap_urls = [
        f"{base_url}/sitemap.xml",
        f"{base_url}/sitemap_index.xml",
        f"{base_url}/sitemap-index.xml"
    ]

    for url in sitemap_urls:
        try:
            response = requests.head(url, timeout=5)
            if response.status_code == 200:
                print(f"Sitemap found at: {url}")
                return url
        except:
            continue

    print("No sitemap found")
    return None

# Check before crawling
sitemap_url = check_sitemap('https://example.com')

2. Implement Rate Limiting

Respect the target website by implementing appropriate rate limiting:

async function crawlWithRateLimit(url, requestsPerMinute = 60) {
    const delayMs = (60 / requestsPerMinute) * 1000;

    const crawlResult = await app.crawlUrl(url, {
        useSitemap: true,
        limit: 500,
        crawlOptions: {
            delay: delayMs,
            respectRobotsTxt: true
        },
        scrapeOptions: {
            formats: ['markdown']
        }
    });

    return crawlResult;
}

3. Handle Errors Gracefully

Implement robust error handling for failed pages:

from firecrawl import FirecrawlApp
import logging

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

app = FirecrawlApp(api_key='your_api_key_here')

def safe_crawl_with_sitemap(url, max_retries=3):
    for attempt in range(max_retries):
        try:
            result = app.crawl_url(
                url=url,
                params={
                    'useSitemap': True,
                    'limit': 200,
                    'scrapeOptions': {
                        'formats': ['markdown']
                    }
                }
            )

            # Filter out failed pages
            successful_pages = [
                page for page in result['data']
                if page.get('metadata', {}).get('statusCode') == 200
            ]

            failed_pages = len(result['data']) - len(successful_pages)
            if failed_pages > 0:
                logger.warning(f"{failed_pages} pages failed to scrape")

            return successful_pages

        except Exception as e:
            logger.error(f"Attempt {attempt + 1} failed: {str(e)}")
            if attempt == max_retries - 1:
                raise
            time.sleep(2 ** attempt)  # Exponential backoff

Performance Comparison: Sitemap vs. Traditional Crawling

Sitemap-based crawling offers several advantages over traditional link-following approaches:

| Aspect | Sitemap Crawling | Traditional Crawling | |--------|------------------|---------------------| | Speed | Faster - direct URL access | Slower - must follow links | | Coverage | Complete if sitemap is up-to-date | May miss pages | | Resource Usage | Lower - no link parsing needed | Higher - must parse each page | | Reliability | High - structured data source | Variable - depends on site structure |

Integration with Web Scraping Workflows

Sitemap-based crawling with Firecrawl integrates well with other web scraping tools and techniques. For instance, when handling AJAX requests using Puppeteer or crawling single page applications, you can use the sitemap to identify which pages need special handling for JavaScript rendering.

Monitoring Crawl Progress

For large sitemap crawls, monitoring progress is essential:

from firecrawl import FirecrawlApp
import time

app = FirecrawlApp(api_key='your_api_key_here')

def crawl_with_progress(url):
    # Start async crawl
    crawl_id = app.async_crawl_url(
        url=url,
        params={
            'useSitemap': True,
            'limit': 1000
        }
    )

    print(f"Started crawl {crawl_id}")

    # Monitor progress
    while True:
        status = app.check_crawl_status(crawl_id)

        completed = status.get('completed', 0)
        total = status.get('total', 0)
        percentage = (completed / total * 100) if total > 0 else 0

        print(f"Progress: {completed}/{total} ({percentage:.1f}%)")

        if status['status'] == 'completed':
            print("Crawl completed!")
            break
        elif status['status'] == 'failed':
            print("Crawl failed!")
            break

        time.sleep(5)

    # Retrieve results
    return app.get_crawl_results(crawl_id)

results = crawl_with_progress('https://example.com')

Conclusion

Firecrawl's sitemap support provides a powerful and efficient way to crawl websites systematically. By leveraging sitemaps, you can ensure comprehensive coverage, reduce resource usage, and speed up your web scraping projects. Whether you're crawling a small blog or a massive e-commerce site, sitemap-based crawling offers a reliable foundation for your data extraction needs.

The combination of sitemap discovery with Firecrawl's advanced features like AI-powered extraction, markdown conversion, and async processing makes it an excellent choice for developers who need to scrape web content at scale while maintaining code simplicity and reliability.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon