Table of contents

How do I Use Firecrawl with Python?

Firecrawl is a powerful web scraping and crawling API that converts websites into clean, LLM-ready markdown or structured data. It handles the complexity of modern web pages, including JavaScript rendering, dynamic content, and browser sessions, making it an excellent choice for developers who need reliable web data extraction.

In this guide, you'll learn how to use Firecrawl with Python, from basic setup to advanced scraping techniques.

What is Firecrawl?

Firecrawl is a managed web scraping service that provides:

  • JavaScript rendering - Executes JavaScript to capture dynamically loaded content
  • Clean markdown output - Converts HTML to clean, structured markdown
  • Smart crawling - Automatically discovers and crawls related pages
  • LLM-ready data - Outputs data optimized for AI/ML applications
  • Managed infrastructure - No need to manage proxies, browsers, or anti-bot systems

Installing the Firecrawl Python SDK

The easiest way to use Firecrawl with Python is through the official Python SDK. Install it using pip:

pip install firecrawl-py

For projects using Poetry:

poetry add firecrawl-py

For Pipenv:

pipenv install firecrawl-py

Getting Your API Key

Before you can use Firecrawl, you need an API key:

  1. Sign up at firecrawl.dev
  2. Navigate to your dashboard
  3. Copy your API key from the API Keys section

Store your API key securely using environment variables:

export FIRECRAWL_API_KEY='your_api_key_here'

Basic Usage: Scraping a Single Page

Here's a simple example of scraping a single web page with Firecrawl:

from firecrawl import FirecrawlApp
import os

# Initialize the Firecrawl client
app = FirecrawlApp(api_key=os.getenv('FIRECRAWL_API_KEY'))

# Scrape a single page
url = 'https://example.com'
scraped_data = app.scrape_url(url)

# Access the content
print(scraped_data['markdown'])  # Clean markdown content
print(scraped_data['html'])      # Original HTML
print(scraped_data['metadata'])  # Page metadata

Scraping with Options

You can customize the scraping behavior with various options:

from firecrawl import FirecrawlApp

app = FirecrawlApp(api_key=os.getenv('FIRECRAWL_API_KEY'))

# Scrape with custom options
scraped_data = app.scrape_url(
    url='https://example.com',
    params={
        'formats': ['markdown', 'html', 'screenshot'],
        'onlyMainContent': True,  # Extract only main content
        'waitFor': 2000,          # Wait 2 seconds for JavaScript
        'includeTags': ['article', 'main'],
        'excludeTags': ['nav', 'footer'],
    }
)

print(scraped_data['markdown'])

Crawling Multiple Pages

Firecrawl can automatically discover and crawl multiple pages from a website. This is similar to handling page redirections but at a larger scale:

from firecrawl import FirecrawlApp

app = FirecrawlApp(api_key=os.getenv('FIRECRAWL_API_KEY'))

# Start a crawl job
crawl_result = app.crawl_url(
    url='https://example.com',
    params={
        'limit': 100,              # Maximum pages to crawl
        'scrapeOptions': {
            'formats': ['markdown'],
            'onlyMainContent': True
        }
    },
    poll_interval=5  # Check status every 5 seconds
)

# Process all crawled pages
for page in crawl_result['data']:
    print(f"URL: {page['metadata']['sourceURL']}")
    print(f"Content: {page['markdown'][:200]}...")
    print("---")

Crawling with URL Patterns

You can control which pages to crawl using include and exclude patterns:

crawl_result = app.crawl_url(
    url='https://example.com',
    params={
        'limit': 50,
        'includePaths': ['/blog/*', '/articles/*'],
        'excludePaths': ['/admin/*', '/login'],
        'maxDepth': 3,  # Maximum crawl depth
        'scrapeOptions': {
            'formats': ['markdown'],
            'waitFor': 1000
        }
    }
)

Extracting Structured Data with LLM

One of Firecrawl's most powerful features is structured data extraction using AI:

from firecrawl import FirecrawlApp

app = FirecrawlApp(api_key=os.getenv('FIRECRAWL_API_KEY'))

# Define the schema for extracted data
schema = {
    'type': 'object',
    'properties': {
        'title': {'type': 'string'},
        'price': {'type': 'number'},
        'description': {'type': 'string'},
        'features': {
            'type': 'array',
            'items': {'type': 'string'}
        },
        'inStock': {'type': 'boolean'}
    },
    'required': ['title', 'price']
}

# Extract structured data
result = app.scrape_url(
    url='https://example.com/product',
    params={
        'formats': ['extract'],
        'extract': {
            'schema': schema,
            'systemPrompt': 'Extract product information from the page',
            'prompt': 'Extract the product title, price, description, features, and stock status'
        }
    }
)

# Access structured data
product_data = result['extract']
print(f"Product: {product_data['title']}")
print(f"Price: ${product_data['price']}")
print(f"In Stock: {product_data['inStock']}")

Handling Authentication

For scraping pages that require authentication, you can pass cookies or headers:

from firecrawl import FirecrawlApp

app = FirecrawlApp(api_key=os.getenv('FIRECRAWL_API_KEY'))

# Scrape with authentication
scraped_data = app.scrape_url(
    url='https://example.com/protected',
    params={
        'headers': {
            'Authorization': 'Bearer your_token_here',
            'Cookie': 'session_id=your_session_id'
        }
    }
)

This is particularly useful when you need to handle authentication for protected resources.

Batch Processing with Async Operations

For high-performance scraping, you can use Python's async capabilities:

import asyncio
from firecrawl import FirecrawlApp

async def scrape_multiple_urls(urls):
    app = FirecrawlApp(api_key=os.getenv('FIRECRAWL_API_KEY'))

    async def scrape_url(url):
        # Note: The SDK doesn't natively support async,
        # but you can use asyncio.to_thread for I/O operations
        return await asyncio.to_thread(
            app.scrape_url,
            url,
            params={'formats': ['markdown']}
        )

    # Scrape all URLs concurrently
    tasks = [scrape_url(url) for url in urls]
    results = await asyncio.gather(*tasks)
    return results

# Usage
urls = [
    'https://example.com/page1',
    'https://example.com/page2',
    'https://example.com/page3'
]

results = asyncio.run(scrape_multiple_urls(urls))
for result in results:
    print(result['markdown'][:100])

Error Handling and Retries

Implement robust error handling for production applications:

from firecrawl import FirecrawlApp
import time

def scrape_with_retry(url, max_retries=3, retry_delay=5):
    app = FirecrawlApp(api_key=os.getenv('FIRECRAWL_API_KEY'))

    for attempt in range(max_retries):
        try:
            result = app.scrape_url(
                url,
                params={
                    'formats': ['markdown'],
                    'timeout': 30000  # 30 second timeout
                }
            )
            return result
        except Exception as e:
            print(f"Attempt {attempt + 1} failed: {e}")
            if attempt < max_retries - 1:
                time.sleep(retry_delay)
            else:
                raise

# Usage
try:
    data = scrape_with_retry('https://example.com')
    print(data['markdown'])
except Exception as e:
    print(f"Failed to scrape after all retries: {e}")

Monitoring Crawl Progress

For long-running crawl jobs, you can monitor progress:

from firecrawl import FirecrawlApp
import time

app = FirecrawlApp(api_key=os.getenv('FIRECRAWL_API_KEY'))

# Start crawl without polling
crawl_id = app.crawl_url(
    url='https://example.com',
    params={'limit': 100},
    poll_interval=None  # Don't auto-poll
)['id']

# Manually check status
while True:
    status = app.check_crawl_status(crawl_id)

    print(f"Status: {status['status']}")
    print(f"Completed: {status['completed']}/{status['total']}")

    if status['status'] == 'completed':
        # Retrieve all data
        for page in status['data']:
            print(f"Scraped: {page['metadata']['sourceURL']}")
        break

    time.sleep(5)

Saving Results to Files

Save scraped data to various formats:

import json
from firecrawl import FirecrawlApp

app = FirecrawlApp(api_key=os.getenv('FIRECRAWL_API_KEY'))

# Scrape and save as markdown
result = app.scrape_url('https://example.com')

# Save markdown
with open('output.md', 'w', encoding='utf-8') as f:
    f.write(result['markdown'])

# Save as JSON
with open('output.json', 'w', encoding='utf-8') as f:
    json.dump(result, f, indent=2)

# Save screenshot (if requested)
if 'screenshot' in result:
    import base64
    screenshot_data = base64.b64decode(result['screenshot'])
    with open('screenshot.png', 'wb') as f:
        f.write(screenshot_data)

Best Practices

  1. Use Environment Variables - Never hardcode API keys in your source code
  2. Implement Rate Limiting - Respect API rate limits to avoid throttling
  3. Handle Errors Gracefully - Always implement try-catch blocks and retries
  4. Cache Results - Store scraped data to avoid redundant API calls
  5. Use Specific Selectors - When possible, use includeTags and excludeTags to reduce processing time
  6. Monitor Usage - Track your API usage to stay within plan limits
  7. Test with Small Batches - Test your scraping logic on a few URLs before scaling up

Comparison with Other Tools

Firecrawl offers several advantages over traditional scraping libraries:

  • No Infrastructure Management - Unlike self-hosted solutions with Puppeteer or Selenium
  • Built-in JavaScript Rendering - No need to manage headless browsers
  • LLM-Optimized Output - Perfect for AI/ML applications
  • Automatic Retries - Built-in resilience and error handling
  • Scalable - Handles high-volume scraping without managing proxies

Conclusion

Firecrawl provides a powerful, managed solution for web scraping with Python. Its combination of JavaScript rendering, clean markdown output, and structured data extraction makes it ideal for modern web scraping needs, especially when building AI-powered applications.

Whether you're scraping a single page or crawling an entire website, Firecrawl's Python SDK offers a straightforward API that handles the complexity of modern web scraping, allowing you to focus on extracting value from the data.

For production use, remember to implement proper error handling, respect rate limits, and monitor your API usage to ensure reliable, sustainable scraping operations.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon