How do I Use Firecrawl with Python?
Firecrawl is a powerful web scraping and crawling API that converts websites into clean, LLM-ready markdown or structured data. It handles the complexity of modern web pages, including JavaScript rendering, dynamic content, and browser sessions, making it an excellent choice for developers who need reliable web data extraction.
In this guide, you'll learn how to use Firecrawl with Python, from basic setup to advanced scraping techniques.
What is Firecrawl?
Firecrawl is a managed web scraping service that provides:
- JavaScript rendering - Executes JavaScript to capture dynamically loaded content
- Clean markdown output - Converts HTML to clean, structured markdown
- Smart crawling - Automatically discovers and crawls related pages
- LLM-ready data - Outputs data optimized for AI/ML applications
- Managed infrastructure - No need to manage proxies, browsers, or anti-bot systems
Installing the Firecrawl Python SDK
The easiest way to use Firecrawl with Python is through the official Python SDK. Install it using pip:
pip install firecrawl-py
For projects using Poetry:
poetry add firecrawl-py
For Pipenv:
pipenv install firecrawl-py
Getting Your API Key
Before you can use Firecrawl, you need an API key:
- Sign up at firecrawl.dev
- Navigate to your dashboard
- Copy your API key from the API Keys section
Store your API key securely using environment variables:
export FIRECRAWL_API_KEY='your_api_key_here'
Basic Usage: Scraping a Single Page
Here's a simple example of scraping a single web page with Firecrawl:
from firecrawl import FirecrawlApp
import os
# Initialize the Firecrawl client
app = FirecrawlApp(api_key=os.getenv('FIRECRAWL_API_KEY'))
# Scrape a single page
url = 'https://example.com'
scraped_data = app.scrape_url(url)
# Access the content
print(scraped_data['markdown']) # Clean markdown content
print(scraped_data['html']) # Original HTML
print(scraped_data['metadata']) # Page metadata
Scraping with Options
You can customize the scraping behavior with various options:
from firecrawl import FirecrawlApp
app = FirecrawlApp(api_key=os.getenv('FIRECRAWL_API_KEY'))
# Scrape with custom options
scraped_data = app.scrape_url(
url='https://example.com',
params={
'formats': ['markdown', 'html', 'screenshot'],
'onlyMainContent': True, # Extract only main content
'waitFor': 2000, # Wait 2 seconds for JavaScript
'includeTags': ['article', 'main'],
'excludeTags': ['nav', 'footer'],
}
)
print(scraped_data['markdown'])
Crawling Multiple Pages
Firecrawl can automatically discover and crawl multiple pages from a website. This is similar to handling page redirections but at a larger scale:
from firecrawl import FirecrawlApp
app = FirecrawlApp(api_key=os.getenv('FIRECRAWL_API_KEY'))
# Start a crawl job
crawl_result = app.crawl_url(
url='https://example.com',
params={
'limit': 100, # Maximum pages to crawl
'scrapeOptions': {
'formats': ['markdown'],
'onlyMainContent': True
}
},
poll_interval=5 # Check status every 5 seconds
)
# Process all crawled pages
for page in crawl_result['data']:
print(f"URL: {page['metadata']['sourceURL']}")
print(f"Content: {page['markdown'][:200]}...")
print("---")
Crawling with URL Patterns
You can control which pages to crawl using include and exclude patterns:
crawl_result = app.crawl_url(
url='https://example.com',
params={
'limit': 50,
'includePaths': ['/blog/*', '/articles/*'],
'excludePaths': ['/admin/*', '/login'],
'maxDepth': 3, # Maximum crawl depth
'scrapeOptions': {
'formats': ['markdown'],
'waitFor': 1000
}
}
)
Extracting Structured Data with LLM
One of Firecrawl's most powerful features is structured data extraction using AI:
from firecrawl import FirecrawlApp
app = FirecrawlApp(api_key=os.getenv('FIRECRAWL_API_KEY'))
# Define the schema for extracted data
schema = {
'type': 'object',
'properties': {
'title': {'type': 'string'},
'price': {'type': 'number'},
'description': {'type': 'string'},
'features': {
'type': 'array',
'items': {'type': 'string'}
},
'inStock': {'type': 'boolean'}
},
'required': ['title', 'price']
}
# Extract structured data
result = app.scrape_url(
url='https://example.com/product',
params={
'formats': ['extract'],
'extract': {
'schema': schema,
'systemPrompt': 'Extract product information from the page',
'prompt': 'Extract the product title, price, description, features, and stock status'
}
}
)
# Access structured data
product_data = result['extract']
print(f"Product: {product_data['title']}")
print(f"Price: ${product_data['price']}")
print(f"In Stock: {product_data['inStock']}")
Handling Authentication
For scraping pages that require authentication, you can pass cookies or headers:
from firecrawl import FirecrawlApp
app = FirecrawlApp(api_key=os.getenv('FIRECRAWL_API_KEY'))
# Scrape with authentication
scraped_data = app.scrape_url(
url='https://example.com/protected',
params={
'headers': {
'Authorization': 'Bearer your_token_here',
'Cookie': 'session_id=your_session_id'
}
}
)
This is particularly useful when you need to handle authentication for protected resources.
Batch Processing with Async Operations
For high-performance scraping, you can use Python's async capabilities:
import asyncio
from firecrawl import FirecrawlApp
async def scrape_multiple_urls(urls):
app = FirecrawlApp(api_key=os.getenv('FIRECRAWL_API_KEY'))
async def scrape_url(url):
# Note: The SDK doesn't natively support async,
# but you can use asyncio.to_thread for I/O operations
return await asyncio.to_thread(
app.scrape_url,
url,
params={'formats': ['markdown']}
)
# Scrape all URLs concurrently
tasks = [scrape_url(url) for url in urls]
results = await asyncio.gather(*tasks)
return results
# Usage
urls = [
'https://example.com/page1',
'https://example.com/page2',
'https://example.com/page3'
]
results = asyncio.run(scrape_multiple_urls(urls))
for result in results:
print(result['markdown'][:100])
Error Handling and Retries
Implement robust error handling for production applications:
from firecrawl import FirecrawlApp
import time
def scrape_with_retry(url, max_retries=3, retry_delay=5):
app = FirecrawlApp(api_key=os.getenv('FIRECRAWL_API_KEY'))
for attempt in range(max_retries):
try:
result = app.scrape_url(
url,
params={
'formats': ['markdown'],
'timeout': 30000 # 30 second timeout
}
)
return result
except Exception as e:
print(f"Attempt {attempt + 1} failed: {e}")
if attempt < max_retries - 1:
time.sleep(retry_delay)
else:
raise
# Usage
try:
data = scrape_with_retry('https://example.com')
print(data['markdown'])
except Exception as e:
print(f"Failed to scrape after all retries: {e}")
Monitoring Crawl Progress
For long-running crawl jobs, you can monitor progress:
from firecrawl import FirecrawlApp
import time
app = FirecrawlApp(api_key=os.getenv('FIRECRAWL_API_KEY'))
# Start crawl without polling
crawl_id = app.crawl_url(
url='https://example.com',
params={'limit': 100},
poll_interval=None # Don't auto-poll
)['id']
# Manually check status
while True:
status = app.check_crawl_status(crawl_id)
print(f"Status: {status['status']}")
print(f"Completed: {status['completed']}/{status['total']}")
if status['status'] == 'completed':
# Retrieve all data
for page in status['data']:
print(f"Scraped: {page['metadata']['sourceURL']}")
break
time.sleep(5)
Saving Results to Files
Save scraped data to various formats:
import json
from firecrawl import FirecrawlApp
app = FirecrawlApp(api_key=os.getenv('FIRECRAWL_API_KEY'))
# Scrape and save as markdown
result = app.scrape_url('https://example.com')
# Save markdown
with open('output.md', 'w', encoding='utf-8') as f:
f.write(result['markdown'])
# Save as JSON
with open('output.json', 'w', encoding='utf-8') as f:
json.dump(result, f, indent=2)
# Save screenshot (if requested)
if 'screenshot' in result:
import base64
screenshot_data = base64.b64decode(result['screenshot'])
with open('screenshot.png', 'wb') as f:
f.write(screenshot_data)
Best Practices
- Use Environment Variables - Never hardcode API keys in your source code
- Implement Rate Limiting - Respect API rate limits to avoid throttling
- Handle Errors Gracefully - Always implement try-catch blocks and retries
- Cache Results - Store scraped data to avoid redundant API calls
- Use Specific Selectors - When possible, use
includeTags
andexcludeTags
to reduce processing time - Monitor Usage - Track your API usage to stay within plan limits
- Test with Small Batches - Test your scraping logic on a few URLs before scaling up
Comparison with Other Tools
Firecrawl offers several advantages over traditional scraping libraries:
- No Infrastructure Management - Unlike self-hosted solutions with Puppeteer or Selenium
- Built-in JavaScript Rendering - No need to manage headless browsers
- LLM-Optimized Output - Perfect for AI/ML applications
- Automatic Retries - Built-in resilience and error handling
- Scalable - Handles high-volume scraping without managing proxies
Conclusion
Firecrawl provides a powerful, managed solution for web scraping with Python. Its combination of JavaScript rendering, clean markdown output, and structured data extraction makes it ideal for modern web scraping needs, especially when building AI-powered applications.
Whether you're scraping a single page or crawling an entire website, Firecrawl's Python SDK offers a straightforward API that handles the complexity of modern web scraping, allowing you to focus on extracting value from the data.
For production use, remember to implement proper error handling, respect rate limits, and monitor your API usage to ensure reliable, sustainable scraping operations.