What is the Firecrawl Tutorial for Beginners?
Firecrawl is a powerful web scraping and crawling API that simplifies the process of extracting data from websites. Unlike traditional scraping tools that require complex setup and maintenance, Firecrawl handles JavaScript rendering, anti-bot measures, and infrastructure management automatically. This tutorial will guide you through everything you need to know to start using Firecrawl effectively.
What is Firecrawl?
Firecrawl is a managed web scraping service that converts websites into clean, structured data formats. It's designed specifically for developers who need reliable data extraction without the hassle of managing proxies, headless browsers, or constantly updating selectors.
Key Features
- JavaScript Rendering - Automatically executes JavaScript to capture dynamic content
- Clean Markdown Output - Converts HTML into LLM-ready markdown format
- Intelligent Crawling - Discovers and navigates through website pages automatically
- Structured Data Extraction - Uses AI to extract specific fields based on your schema
- Built-in Anti-Bot Bypass - Handles CAPTCHAs and bot detection mechanisms
- No Infrastructure Required - Fully managed service, no servers to maintain
Getting Started with Firecrawl
Step 1: Create an Account
- Visit firecrawl.dev and sign up for an account
- Navigate to your dashboard
- Generate and copy your API key from the API Keys section
- Start with the free tier to test the service (typically includes 500 credits)
Step 2: Choose Your Language
Firecrawl provides official SDKs for both Python and Node.js. This tutorial covers both languages so you can choose based on your preference.
Installing Firecrawl
Python Installation
# Using pip
pip install firecrawl-py
# Using Poetry
poetry add firecrawl-py
# Using Pipenv
pipenv install firecrawl-py
Node.js Installation
# Using npm
npm install @mendable/firecrawl-js
# Using yarn
yarn add @mendable/firecrawl-js
Your First Scraping Project
Setting Up Your API Key
Always store your API key securely using environment variables:
Python:
export FIRECRAWL_API_KEY='your_api_key_here'
Node.js (.env file):
FIRECRAWL_API_KEY=your_api_key_here
Basic Scraping Example
Let's start with the simplest use case: scraping a single web page.
Python:
from firecrawl import FirecrawlApp
import os
# Initialize the client
app = FirecrawlApp(api_key=os.getenv('FIRECRAWL_API_KEY'))
# Scrape a single page
result = app.scrape_url('https://example.com')
# Access the content
print("Markdown Content:")
print(result['markdown'])
print("\nMetadata:")
print(result['metadata'])
JavaScript:
import FirecrawlApp from '@mendable/firecrawl-js';
import dotenv from 'dotenv';
dotenv.config();
// Initialize the client
const app = new FirecrawlApp({
apiKey: process.env.FIRECRAWL_API_KEY
});
// Scrape a single page
async function scrapePage() {
const result = await app.scrapeUrl('https://example.com', {
formats: ['markdown']
});
console.log("Markdown Content:");
console.log(result.markdown);
console.log("\nMetadata:");
console.log(result.metadata);
}
scrapePage();
Understanding the Response
The scraping response includes several useful fields:
markdown
- Clean, formatted text contenthtml
- Original HTML (if requested)metadata
- Page title, description, language, and morelinks
- All links found on the page (if requested)
Customizing Scraping Options
Firecrawl offers numerous options to customize how pages are scraped. This is particularly useful when dealing with complex layouts or handling dynamic content.
Focusing on Main Content
Extract only the main article content, excluding navigation, ads, and footers:
Python:
result = app.scrape_url(
'https://example.com/article',
params={
'formats': ['markdown'],
'onlyMainContent': True,
'includeTags': ['article', 'main', '.content'],
'excludeTags': ['nav', 'footer', '.ads', '.sidebar']
}
)
JavaScript:
const result = await app.scrapeUrl('https://example.com/article', {
formats: ['markdown'],
onlyMainContent: true,
includeTags: ['article', 'main', '.content'],
excludeTags: ['nav', 'footer', '.ads', '.sidebar']
});
Waiting for JavaScript
Some websites load content dynamically. Use the waitFor
parameter to ensure all content is loaded:
Python:
result = app.scrape_url(
'https://example.com/dynamic-page',
params={
'formats': ['markdown'],
'waitFor': 3000 # Wait 3 seconds for JavaScript to execute
}
)
JavaScript:
const result = await app.scrapeUrl('https://example.com/dynamic-page', {
formats: ['markdown'],
waitFor: 3000 // Wait 3 seconds
});
This approach is similar to using the waitFor function in Puppeteer but managed entirely by Firecrawl.
Crawling Multiple Pages
One of Firecrawl's most powerful features is its ability to automatically discover and scrape multiple pages from a website.
Basic Crawling
Python:
from firecrawl import FirecrawlApp
import os
app = FirecrawlApp(api_key=os.getenv('FIRECRAWL_API_KEY'))
# Crawl up to 50 pages
crawl_result = app.crawl_url(
'https://example.com',
params={
'limit': 50,
'scrapeOptions': {
'formats': ['markdown'],
'onlyMainContent': True
}
},
poll_interval=5 # Check status every 5 seconds
)
# Process all pages
for page in crawl_result['data']:
url = page['metadata']['sourceURL']
content = page['markdown']
print(f"Scraped: {url}")
print(f"Content preview: {content[:100]}...\n")
JavaScript:
async function crawlWebsite() {
const crawlResult = await app.crawlUrl('https://example.com', {
limit: 50,
scrapeOptions: {
formats: ['markdown'],
onlyMainContent: true
}
});
console.log(`Successfully crawled ${crawlResult.data.length} pages`);
crawlResult.data.forEach(page => {
console.log(`URL: ${page.metadata.sourceURL}`);
console.log(`Content: ${page.markdown.substring(0, 100)}...\n`);
});
}
crawlWebsite();
Advanced Crawling with Filters
Control which pages to crawl using path patterns:
Python:
crawl_result = app.crawl_url(
'https://example.com',
params={
'limit': 100,
'maxDepth': 3, # Maximum depth from starting URL
'includePaths': ['/blog/*', '/articles/*', '/docs/*'],
'excludePaths': ['/admin/*', '/login', '/signup'],
'allowBackwardLinks': False, # Don't crawl parent directories
'allowExternalLinks': False, # Stay on the same domain
'scrapeOptions': {
'formats': ['markdown'],
'waitFor': 1000
}
}
)
JavaScript:
const crawlResult = await app.crawlUrl('https://example.com', {
limit: 100,
maxDepth: 3,
includePaths: ['/blog/*', '/articles/*', '/docs/*'],
excludePaths: ['/admin/*', '/login', '/signup'],
allowBackwardLinks: false,
allowExternalLinks: false,
scrapeOptions: {
formats: ['markdown'],
waitFor: 1000
}
});
Extracting Structured Data
Firecrawl can extract specific data fields using AI-powered extraction. This is incredibly useful for scraping product information, job listings, or any structured content.
Defining a Schema
Python:
from firecrawl import FirecrawlApp
import os
app = FirecrawlApp(api_key=os.getenv('FIRECRAWL_API_KEY'))
# Define the data structure you want to extract
schema = {
'type': 'object',
'properties': {
'productName': {'type': 'string'},
'price': {'type': 'number'},
'currency': {'type': 'string'},
'description': {'type': 'string'},
'features': {
'type': 'array',
'items': {'type': 'string'}
},
'inStock': {'type': 'boolean'},
'rating': {'type': 'number'}
},
'required': ['productName', 'price']
}
# Extract structured data
result = app.scrape_url(
'https://example.com/product/laptop',
params={
'formats': ['extract'],
'extract': {
'schema': schema,
'systemPrompt': 'Extract product information from this e-commerce page',
'prompt': 'Extract all product details including name, price, features, and availability'
}
}
)
# Access the extracted data
product = result['extract']
print(f"Product: {product['productName']}")
print(f"Price: {product['price']} {product['currency']}")
print(f"In Stock: {product['inStock']}")
print(f"Features: {product['features']}")
JavaScript:
const schema = {
type: 'object',
properties: {
productName: { type: 'string' },
price: { type: 'number' },
currency: { type: 'string' },
description: { type: 'string' },
features: {
type: 'array',
items: { type: 'string' }
},
inStock: { type: 'boolean' },
rating: { type: 'number' }
},
required: ['productName', 'price']
};
const result = await app.scrapeUrl('https://example.com/product/laptop', {
formats: ['extract'],
extract: {
schema: schema,
systemPrompt: 'Extract product information from this e-commerce page',
prompt: 'Extract all product details including name, price, features, and availability'
}
});
const product = result.extract;
console.log(`Product: ${product.productName}`);
console.log(`Price: ${product.price} ${product.currency}`);
console.log(`In Stock: ${product.inStock}`);
Batch Extraction
Extract data from multiple pages efficiently:
Python:
import asyncio
async def scrape_multiple_products(urls):
app = FirecrawlApp(api_key=os.getenv('FIRECRAWL_API_KEY'))
schema = {
'type': 'object',
'properties': {
'productName': {'type': 'string'},
'price': {'type': 'number'}
}
}
products = []
for url in urls:
result = app.scrape_url(
url,
params={
'formats': ['extract'],
'extract': {'schema': schema}
}
)
products.append(result['extract'])
return products
# Usage
urls = [
'https://example.com/product1',
'https://example.com/product2',
'https://example.com/product3'
]
products = asyncio.run(scrape_multiple_products(urls))
for product in products:
print(f"{product['productName']}: ${product['price']}")
JavaScript:
async function scrapeMultipleProducts(urls) {
const schema = {
type: 'object',
properties: {
productName: { type: 'string' },
price: { type: 'number' }
}
};
const promises = urls.map(url =>
app.scrapeUrl(url, {
formats: ['extract'],
extract: { schema }
})
);
const results = await Promise.all(promises);
return results.map(r => r.extract);
}
// Usage
const urls = [
'https://example.com/product1',
'https://example.com/product2',
'https://example.com/product3'
];
const products = await scrapeMultipleProducts(urls);
products.forEach(product => {
console.log(`${product.productName}: $${product.price}`);
});
Handling Authentication
For pages that require login or authentication, you can pass custom headers and cookies:
Python:
result = app.scrape_url(
'https://example.com/protected',
params={
'formats': ['markdown'],
'headers': {
'Authorization': 'Bearer your_token_here',
'Cookie': 'session_id=abc123; user_pref=xyz'
}
}
)
JavaScript:
const result = await app.scrapeUrl('https://example.com/protected', {
formats: ['markdown'],
headers: {
'Authorization': 'Bearer your_token_here',
'Cookie': 'session_id=abc123; user_pref=xyz'
}
});
This technique is similar to how you handle authentication in Puppeteer, but Firecrawl manages the browser session automatically.
Working with Screenshots
Firecrawl can capture screenshots of pages, which is useful for visual verification or archiving:
Python:
import base64
result = app.scrape_url(
'https://example.com',
params={
'formats': ['markdown', 'screenshot']
}
)
# Save the screenshot
if 'screenshot' in result:
screenshot_data = base64.b64decode(result['screenshot'])
with open('page_screenshot.png', 'wb') as f:
f.write(screenshot_data)
print("Screenshot saved!")
JavaScript:
import fs from 'fs';
const result = await app.scrapeUrl('https://example.com', {
formats: ['markdown', 'screenshot']
});
// Save the screenshot
if (result.screenshot) {
const buffer = Buffer.from(result.screenshot, 'base64');
fs.writeFileSync('page_screenshot.png', buffer);
console.log("Screenshot saved!");
}
Error Handling and Best Practices
Implementing Retry Logic
Python:
import time
def scrape_with_retry(url, max_retries=3):
app = FirecrawlApp(api_key=os.getenv('FIRECRAWL_API_KEY'))
for attempt in range(max_retries):
try:
result = app.scrape_url(
url,
params={
'formats': ['markdown'],
'timeout': 30000 # 30 seconds
}
)
return result
except Exception as e:
print(f"Attempt {attempt + 1} failed: {e}")
if attempt < max_retries - 1:
time.sleep(2 ** attempt) # Exponential backoff
else:
raise
# Usage
try:
data = scrape_with_retry('https://example.com')
print(data['markdown'])
except Exception as e:
print(f"Failed after all retries: {e}")
JavaScript:
async function scrapeWithRetry(url, maxRetries = 3) {
for (let attempt = 0; attempt < maxRetries; attempt++) {
try {
const result = await app.scrapeUrl(url, {
formats: ['markdown'],
timeout: 30000
});
return result;
} catch (error) {
console.log(`Attempt ${attempt + 1} failed: ${error.message}`);
if (attempt < maxRetries - 1) {
await new Promise(resolve =>
setTimeout(resolve, Math.pow(2, attempt) * 1000)
);
} else {
throw error;
}
}
}
}
// Usage
try {
const data = await scrapeWithRetry('https://example.com');
console.log(data.markdown);
} catch (error) {
console.log(`Failed after all retries: ${error.message}`);
}
Best Practices for Beginners
- Start Small - Test with a single page before crawling entire websites
- Use Environment Variables - Never hardcode API keys in your source code
- Implement Rate Limiting - Respect API limits and avoid throttling
- Handle Errors Gracefully - Always use try-catch blocks and implement retries
- Cache Results - Store scraped data to avoid redundant API calls
- Monitor Usage - Track your credit consumption through the dashboard
- Test Selectors - Use
includeTags
andexcludeTags
to improve accuracy - Set Appropriate Timeouts - Adjust based on page complexity
Saving and Exporting Data
Saving to JSON
Python:
import json
result = app.scrape_url('https://example.com')
# Save as JSON
with open('scraped_data.json', 'w', encoding='utf-8') as f:
json.dump(result, f, indent=2, ensure_ascii=False)
print("Data saved to scraped_data.json")
JavaScript:
import fs from 'fs';
const result = await app.scrapeUrl('https://example.com');
// Save as JSON
fs.writeFileSync(
'scraped_data.json',
JSON.stringify(result, null, 2)
);
console.log("Data saved to scraped_data.json");
Saving to CSV
Python:
import csv
# Assuming you've extracted structured data
products = [
{'name': 'Product 1', 'price': 29.99},
{'name': 'Product 2', 'price': 39.99}
]
with open('products.csv', 'w', newline='', encoding='utf-8') as f:
writer = csv.DictWriter(f, fieldnames=['name', 'price'])
writer.writeheader()
writer.writerows(products)
JavaScript:
import fs from 'fs';
import { stringify } from 'csv-stringify/sync';
const products = [
{ name: 'Product 1', price: 29.99 },
{ name: 'Product 2', price: 39.99 }
];
const csv = stringify(products, { header: true });
fs.writeFileSync('products.csv', csv);
Common Use Cases
1. Blog Content Extraction
Perfect for content aggregation or analysis:
result = app.scrape_url(
'https://example.com/blog/article',
params={
'formats': ['markdown'],
'onlyMainContent': True,
'includeTags': ['article', '.post-content']
}
)
2. E-commerce Price Monitoring
Extract product prices for comparison:
schema = {
'type': 'object',
'properties': {
'productName': {'type': 'string'},
'currentPrice': {'type': 'number'},
'originalPrice': {'type': 'number'},
'inStock': {'type': 'boolean'}
}
}
3. Job Listing Aggregation
Collect job postings from multiple sites:
schema = {
'type': 'object',
'properties': {
'jobTitle': {'type': 'string'},
'company': {'type': 'string'},
'location': {'type': 'string'},
'salary': {'type': 'string'},
'description': {'type': 'string'}
}
}
Next Steps
Now that you understand the basics of Firecrawl, you can:
- Learn how to use Firecrawl with Python for more Python-specific examples
- Learn how to use Firecrawl with Node.js for advanced JavaScript patterns
- Explore handling JavaScript-rendered websites for complex scraping scenarios
- Review the official Firecrawl documentation for advanced features
- Join the Firecrawl community for support and best practices
Conclusion
Firecrawl simplifies web scraping by handling the complex infrastructure, browser automation, and anti-bot measures automatically. This tutorial covered the essential concepts: basic scraping, crawling multiple pages, extracting structured data, and implementing best practices.
As a beginner, start with simple single-page scraping, gradually move to crawling, and then experiment with structured data extraction. With Firecrawl's managed approach, you can focus on extracting value from data rather than maintaining scraping infrastructure.
Remember to always respect websites' terms of service, implement proper error handling, and monitor your API usage to build reliable, sustainable scraping solutions.