How do I Extract Structured Data Using Firecrawl?
Firecrawl provides powerful tools for extracting structured data from websites through its Extract API endpoint. Unlike traditional web scraping that requires manual parsing of HTML, Firecrawl uses large language models (LLMs) to intelligently extract data based on schemas you define. This approach makes it easy to convert unstructured web content into clean, structured JSON data.
Understanding Firecrawl's Extract Endpoint
The Extract endpoint (/extract
) is specifically designed for structured data extraction. It takes a URL and a schema definition, then returns data that matches your schema. This is particularly useful when you need to extract specific fields from web pages without writing complex parsing logic.
Key Features
- Schema-based extraction: Define the structure you want using JSON schemas
- LLM-powered parsing: Uses AI to understand page content and extract relevant data
- Automatic type conversion: Converts extracted data to the appropriate types (strings, numbers, booleans, arrays)
- Handles dynamic content: Works with JavaScript-rendered pages
- Multiple format support: Extracts from HTML, markdown, and PDF files
Basic Structured Data Extraction
Python Example
Here's how to extract structured data using Firecrawl's Python SDK:
from firecrawl import FirecrawlApp
# Initialize the Firecrawl client
app = FirecrawlApp(api_key='your_api_key')
# Define the schema for the data you want to extract
schema = {
"type": "object",
"properties": {
"title": {
"type": "string",
"description": "The main title of the article"
},
"author": {
"type": "string",
"description": "The author's name"
},
"publishDate": {
"type": "string",
"description": "Publication date in YYYY-MM-DD format"
},
"tags": {
"type": "array",
"items": {"type": "string"},
"description": "Article tags or categories"
}
},
"required": ["title", "author"]
}
# Extract data from a URL
result = app.extract_url(
url='https://example.com/article',
schema=schema
)
# Access the extracted data
print(result['data'])
JavaScript/Node.js Example
The same extraction using Firecrawl's JavaScript SDK:
import FirecrawlApp from '@mendable/firecrawl-js';
// Initialize the client
const app = new FirecrawlApp({ apiKey: 'your_api_key' });
// Define your schema
const schema = {
type: 'object',
properties: {
title: {
type: 'string',
description: 'The main title of the article'
},
author: {
type: 'string',
description: "The author's name"
},
publishDate: {
type: 'string',
description: 'Publication date in YYYY-MM-DD format'
},
tags: {
type: 'array',
items: { type: 'string' },
description: 'Article tags or categories'
}
},
required: ['title', 'author']
};
// Extract data
const result = await app.extractUrl({
url: 'https://example.com/article',
schema: schema
});
console.log(result.data);
Advanced Schema Definitions
Nested Objects
You can define complex nested structures for hierarchical data:
schema = {
"type": "object",
"properties": {
"product": {
"type": "object",
"properties": {
"name": {"type": "string"},
"price": {
"type": "object",
"properties": {
"amount": {"type": "number"},
"currency": {"type": "string"}
}
},
"specifications": {
"type": "array",
"items": {
"type": "object",
"properties": {
"key": {"type": "string"},
"value": {"type": "string"}
}
}
}
}
}
}
}
result = app.extract_url(
url='https://example.com/product',
schema=schema
)
Extracting Lists of Items
To extract multiple items (like search results or product listings):
const schema = {
type: 'object',
properties: {
products: {
type: 'array',
items: {
type: 'object',
properties: {
name: { type: 'string' },
price: { type: 'number' },
rating: { type: 'number' },
url: { type: 'string' }
}
},
description: 'List of all products on the page'
}
}
};
const result = await app.extractUrl({
url: 'https://example.com/products',
schema: schema
});
// Access the array of products
result.data.products.forEach(product => {
console.log(`${product.name}: $${product.price}`);
});
Extracting Data from Multiple Pages
When you need to extract structured data from multiple pages, combine Firecrawl's crawl functionality with extraction:
from firecrawl import FirecrawlApp
app = FirecrawlApp(api_key='your_api_key')
# Schema for extraction
product_schema = {
"type": "object",
"properties": {
"name": {"type": "string"},
"price": {"type": "number"},
"description": {"type": "string"}
}
}
# Crawl with extraction
result = app.crawl_url(
url='https://example.com/shop',
params={
'limit': 10,
'scrapeOptions': {
'formats': ['extract'],
'extract': {
'schema': product_schema
}
}
}
)
# Process extracted data from all pages
for page in result['data']:
if 'extract' in page:
print(f"Product: {page['extract']['name']}")
print(f"Price: ${page['extract']['price']}")
Using API Directly
If you prefer to use the REST API directly without an SDK:
curl -X POST https://api.firecrawl.dev/v1/extract \
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer YOUR_API_KEY' \
-d '{
"url": "https://example.com/article",
"schema": {
"type": "object",
"properties": {
"title": {"type": "string"},
"content": {"type": "string"},
"author": {"type": "string"}
}
}
}'
Best Practices for Structured Extraction
1. Provide Clear Descriptions
Adding descriptions to your schema properties helps the LLM understand what data to extract:
schema = {
"type": "object",
"properties": {
"price": {
"type": "number",
"description": "The current selling price, not the original or discounted price"
},
"availability": {
"type": "string",
"description": "Whether the product is in stock, out of stock, or available for pre-order"
}
}
}
2. Use Appropriate Data Types
Match your schema types to the expected data format:
string
for text, dates, and URLsnumber
for prices, quantities, and ratingsboolean
for yes/no or true/false valuesarray
for lists of itemsobject
for nested structures
3. Specify Required Fields
Mark essential fields as required to ensure they're always extracted:
const schema = {
type: 'object',
properties: {
productId: { type: 'string' },
name: { type: 'string' },
price: { type: 'number' },
description: { type: 'string' }
},
required: ['productId', 'name', 'price']
};
4. Handle Dynamic Content
For pages that load content via JavaScript (similar to handling AJAX requests using Puppeteer), Firecrawl automatically waits for content to load. You can also specify wait conditions:
result = app.extract_url(
url='https://example.com/dynamic-page',
schema=schema,
params={
'waitFor': 2000 # Wait 2 seconds for content to load
}
)
Error Handling and Validation
Always implement proper error handling when extracting structured data:
try:
result = app.extract_url(
url='https://example.com/page',
schema=schema
)
# Validate the extracted data
if result.get('success'):
data = result['data']
# Check if required fields are present
if 'title' in data and 'content' in data:
process_data(data)
else:
print("Missing required fields in extracted data")
else:
print(f"Extraction failed: {result.get('error')}")
except Exception as e:
print(f"Error during extraction: {str(e)}")
try {
const result = await app.extractUrl({
url: 'https://example.com/page',
schema: schema
});
if (result.success && result.data) {
// Process the extracted data
console.log('Extracted data:', result.data);
} else {
console.error('Extraction failed:', result.error);
}
} catch (error) {
console.error('Error during extraction:', error.message);
}
Real-World Use Cases
E-commerce Product Extraction
product_schema = {
"type": "object",
"properties": {
"name": {"type": "string"},
"sku": {"type": "string"},
"price": {"type": "number"},
"currency": {"type": "string"},
"inStock": {"type": "boolean"},
"images": {
"type": "array",
"items": {"type": "string"}
},
"reviews": {
"type": "object",
"properties": {
"averageRating": {"type": "number"},
"totalCount": {"type": "number"}
}
}
}
}
result = app.extract_url(
url='https://store.example.com/product/12345',
schema=product_schema
)
Article Metadata Extraction
const articleSchema = {
type: 'object',
properties: {
headline: { type: 'string' },
subheadline: { type: 'string' },
author: { type: 'string' },
publishedDate: { type: 'string' },
modifiedDate: { type: 'string' },
wordCount: { type: 'number' },
readingTime: { type: 'number' },
categories: {
type: 'array',
items: { type: 'string' }
}
}
};
const result = await app.extractUrl({
url: 'https://blog.example.com/article',
schema: articleSchema
});
Job Listing Extraction
job_schema = {
"type": "object",
"properties": {
"title": {"type": "string"},
"company": {"type": "string"},
"location": {"type": "string"},
"remote": {"type": "boolean"},
"salary": {
"type": "object",
"properties": {
"min": {"type": "number"},
"max": {"type": "number"},
"currency": {"type": "string"}
}
},
"requirements": {
"type": "array",
"items": {"type": "string"}
},
"benefits": {
"type": "array",
"items": {"type": "string"}
}
}
}
result = app.extract_url(
url='https://jobs.example.com/posting/123',
schema=job_schema
)
Combining with Crawling
For large-scale data extraction across multiple pages, use Firecrawl's crawl endpoint with extraction schemas:
# Crawl an entire website section and extract structured data
crawl_result = app.crawl_url(
url='https://example.com/blog',
params={
'limit': 100,
'scrapeOptions': {
'formats': ['extract'],
'extract': {
'schema': article_schema
}
}
}
)
# Save all extracted articles to a database
for page in crawl_result['data']:
if 'extract' in page:
save_to_database(page['extract'])
Performance Considerations
- Rate Limiting: Be mindful of API rate limits when extracting data from multiple pages
- Schema Complexity: Simpler schemas generally extract faster and more accurately
- Page Size: Large pages may take longer to process; consider using the
onlyMainContent
option - Caching: Implement caching for frequently accessed pages to reduce API calls
Troubleshooting Common Issues
Missing Data
If extracted data is incomplete:
- Add more detailed descriptions to your schema properties
- Check if the data exists on the page
- Verify the page has fully loaded (increase waitFor
time if needed)
Incorrect Data Types
If data types don't match: - Ensure your schema types match the actual data format - Use string type for mixed content, then convert in your application - Check for special formatting (dates, currencies, etc.)
Extraction Timeouts
For pages that take long to load:
- Increase the timeout parameter
- Use the onlyMainContent
option to focus on the main content area
- Consider handling timeouts in Puppeteer for similar timeout management strategies
Conclusion
Firecrawl's structured data extraction provides a powerful, AI-driven approach to web scraping that eliminates the need for complex HTML parsing. By defining clear schemas and leveraging LLM capabilities, you can extract clean, structured data from any website with minimal code. Whether you're building a product aggregator, content management system, or data analysis pipeline, Firecrawl's Extract API simplifies the process of converting unstructured web content into usable data.
Remember to always respect website terms of service and robots.txt files when scraping, and implement proper error handling and rate limiting in your production applications.