How do I Extract Metadata from Websites Using Firecrawl?
Metadata extraction is a crucial aspect of web scraping, providing essential information about web pages such as titles, descriptions, social media tags, and structured data. Firecrawl simplifies this process by automatically extracting metadata alongside your web content, making it an excellent choice for developers who need comprehensive page information without complex parsing logic.
Understanding Metadata Extraction with Firecrawl
Firecrawl is a web scraping API that converts websites into clean, structured data. One of its key features is automatic metadata extraction, which captures important page information including:
- Standard Meta Tags: Title, description, keywords, author
- Open Graph Tags: Social media sharing metadata (og:title, og:description, og:image)
- Twitter Card Tags: Twitter-specific metadata (twitter:card, twitter:title, twitter:image)
- Canonical URLs: The preferred version of a web page
- Language Information: Page language and locale settings
- Favicon URLs: Website icons and branding elements
Unlike traditional web scraping approaches where you need to manually parse HTML and extract specific meta tags, Firecrawl handles this automatically and returns structured metadata in a clean JSON format.
Basic Metadata Extraction with Firecrawl
Python Implementation
Here's how to extract metadata from a website using Firecrawl's Python SDK:
from firecrawl import FirecrawlApp
# Initialize Firecrawl with your API key
app = FirecrawlApp(api_key='your_api_key_here')
# Scrape a URL and extract metadata
result = app.scrape_url('https://example.com')
# Access metadata
metadata = result.get('metadata', {})
print(f"Title: {metadata.get('title')}")
print(f"Description: {metadata.get('description')}")
print(f"Language: {metadata.get('language')}")
print(f"Canonical URL: {metadata.get('sourceURL')}")
# Access Open Graph metadata
og_title = metadata.get('ogTitle')
og_description = metadata.get('ogDescription')
og_image = metadata.get('ogImage')
print(f"\nOpen Graph Data:")
print(f"OG Title: {og_title}")
print(f"OG Description: {og_description}")
print(f"OG Image: {og_image}")
JavaScript/Node.js Implementation
For Node.js applications, you can use Firecrawl's JavaScript SDK:
import FirecrawlApp from '@mendable/firecrawl-js';
// Initialize Firecrawl
const app = new FirecrawlApp({ apiKey: 'your_api_key_here' });
async function extractMetadata(url) {
try {
// Scrape the URL
const result = await app.scrapeUrl(url);
// Access metadata
const metadata = result.metadata || {};
console.log('Title:', metadata.title);
console.log('Description:', metadata.description);
console.log('Language:', metadata.language);
console.log('Keywords:', metadata.keywords);
// Access social media metadata
console.log('\nSocial Media Metadata:');
console.log('OG Title:', metadata.ogTitle);
console.log('OG Description:', metadata.ogDescription);
console.log('OG Image:', metadata.ogImage);
console.log('Twitter Card:', metadata.twitterCard);
console.log('Twitter Title:', metadata.twitterTitle);
return metadata;
} catch (error) {
console.error('Error extracting metadata:', error);
throw error;
}
}
// Usage
extractMetadata('https://example.com')
.then(metadata => console.log('Extraction complete:', metadata));
Advanced Metadata Extraction Techniques
Extracting Metadata from Multiple Pages
When you need to extract metadata from multiple pages, you can use Firecrawl's crawling functionality:
from firecrawl import FirecrawlApp
app = FirecrawlApp(api_key='your_api_key_here')
# Crawl multiple pages and extract metadata
crawl_result = app.crawl_url(
'https://example.com',
params={
'limit': 10, # Number of pages to crawl
'scrapeOptions': {
'formats': ['markdown', 'html']
}
}
)
# Process metadata from each crawled page
for page in crawl_result.get('data', []):
metadata = page.get('metadata', {})
print(f"\nURL: {metadata.get('sourceURL')}")
print(f"Title: {metadata.get('title')}")
print(f"Description: {metadata.get('description')}")
Filtering and Processing Metadata
You can process and filter extracted metadata to focus on specific information:
async function extractAndFilterMetadata(url, requiredFields) {
const app = new FirecrawlApp({ apiKey: 'your_api_key_here' });
const result = await app.scrapeUrl(url);
const metadata = result.metadata || {};
// Filter metadata to only include required fields
const filteredMetadata = {};
requiredFields.forEach(field => {
if (metadata[field]) {
filteredMetadata[field] = metadata[field];
}
});
// Validate that all required fields are present
const missingFields = requiredFields.filter(
field => !filteredMetadata[field]
);
if (missingFields.length > 0) {
console.warn('Missing metadata fields:', missingFields);
}
return filteredMetadata;
}
// Usage: Extract only specific metadata fields
const requiredFields = ['title', 'description', 'ogImage', 'language'];
extractAndFilterMetadata('https://example.com', requiredFields)
.then(metadata => console.log('Filtered metadata:', metadata));
Working with Structured Data
Firecrawl can also extract structured data from web pages, which is particularly useful for SEO analysis and content understanding:
from firecrawl import FirecrawlApp
import json
app = FirecrawlApp(api_key='your_api_key_here')
# Scrape with all available formats
result = app.scrape_url(
'https://example.com',
params={
'formats': ['markdown', 'html', 'rawHtml']
}
)
metadata = result.get('metadata', {})
# Extract and parse structured data if available
structured_data = metadata.get('structuredData', {})
if structured_data:
print("Structured Data Found:")
print(json.dumps(structured_data, indent=2))
# Process specific schema types
if structured_data.get('@type') == 'Article':
print(f"\nArticle Title: {structured_data.get('headline')}")
print(f"Author: {structured_data.get('author', {}).get('name')}")
print(f"Published: {structured_data.get('datePublished')}")
Using the REST API Directly
If you prefer to use HTTP requests directly without an SDK, here's how to extract metadata using cURL:
curl -X POST https://api.firecrawl.dev/v0/scrape \
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer your_api_key_here' \
-d '{
"url": "https://example.com",
"formats": ["markdown", "html"]
}'
The response will include a comprehensive metadata object:
{
"success": true,
"data": {
"markdown": "...",
"html": "...",
"metadata": {
"title": "Example Domain",
"description": "Example domain description",
"language": "en",
"sourceURL": "https://example.com",
"ogTitle": "Example Domain - Social Title",
"ogDescription": "Social media description",
"ogImage": "https://example.com/og-image.jpg",
"twitterCard": "summary_large_image",
"twitterTitle": "Example Domain",
"twitterDescription": "Twitter description",
"favicon": "https://example.com/favicon.ico"
}
}
}
Handling JavaScript-Rendered Metadata
Many modern websites render metadata dynamically using JavaScript. Similar to how AJAX requests are handled in browser automation, Firecrawl automatically executes JavaScript and waits for the page to fully load before extracting metadata:
from firecrawl import FirecrawlApp
app = FirecrawlApp(api_key='your_api_key_here')
# Firecrawl automatically handles JavaScript rendering
result = app.scrape_url(
'https://spa-example.com', # Single-page application
params={
'waitFor': 2000 # Wait 2 seconds for dynamic content
}
)
metadata = result.get('metadata', {})
print(f"Dynamically loaded title: {metadata.get('title')}")
Best Practices for Metadata Extraction
1. Validate Extracted Metadata
Always validate that the extracted metadata contains the information you need:
def validate_metadata(metadata, required_fields):
"""Validate that all required metadata fields are present."""
missing = [field for field in required_fields if not metadata.get(field)]
if missing:
raise ValueError(f"Missing required metadata: {', '.join(missing)}")
return True
# Usage
metadata = result.get('metadata', {})
try:
validate_metadata(metadata, ['title', 'description', 'sourceURL'])
print("Metadata validation passed")
except ValueError as e:
print(f"Validation error: {e}")
2. Handle Missing Metadata Gracefully
Not all websites include comprehensive metadata. Use fallback values:
function getMetadataWithDefaults(metadata) {
return {
title: metadata.title || 'Untitled Page',
description: metadata.description || metadata.ogDescription || '',
image: metadata.ogImage || metadata.twitterImage || '',
url: metadata.sourceURL || '',
language: metadata.language || 'en'
};
}
3. Rate Limiting and Error Handling
When extracting metadata from multiple pages, implement proper error handling and respect rate limits:
import time
from firecrawl import FirecrawlApp
app = FirecrawlApp(api_key='your_api_key_here')
def extract_metadata_batch(urls, delay=1):
"""Extract metadata from multiple URLs with rate limiting."""
results = []
for url in urls:
try:
result = app.scrape_url(url)
metadata = result.get('metadata', {})
results.append({
'url': url,
'metadata': metadata,
'success': True
})
except Exception as e:
results.append({
'url': url,
'error': str(e),
'success': False
})
# Rate limiting
time.sleep(delay)
return results
# Usage
urls = ['https://example1.com', 'https://example2.com', 'https://example3.com']
metadata_results = extract_metadata_batch(urls, delay=1)
Common Use Cases
SEO Analysis
Extract metadata to analyze SEO optimization:
def analyze_seo_metadata(url):
"""Analyze SEO-related metadata."""
app = FirecrawlApp(api_key='your_api_key_here')
result = app.scrape_url(url)
metadata = result.get('metadata', {})
analysis = {
'title_length': len(metadata.get('title', '')),
'description_length': len(metadata.get('description', '')),
'has_og_tags': bool(metadata.get('ogTitle')),
'has_twitter_tags': bool(metadata.get('twitterCard')),
'has_canonical': bool(metadata.get('sourceURL')),
'language': metadata.get('language')
}
# SEO recommendations
if analysis['title_length'] < 30 or analysis['title_length'] > 60:
analysis['title_warning'] = 'Title should be 30-60 characters'
if analysis['description_length'] < 120 or analysis['description_length'] > 160:
analysis['description_warning'] = 'Description should be 120-160 characters'
return analysis
Content Aggregation
Collect metadata from multiple sources for content aggregation platforms:
async function aggregateContent(urls) {
const app = new FirecrawlApp({ apiKey: 'your_api_key_here' });
const articles = [];
for (const url of urls) {
try {
const result = await app.scrapeUrl(url);
const metadata = result.metadata || {};
articles.push({
title: metadata.title,
description: metadata.description,
image: metadata.ogImage,
url: metadata.sourceURL,
publishedDate: metadata.publishedDate,
author: metadata.author
});
} catch (error) {
console.error(`Failed to process ${url}:`, error.message);
}
}
return articles;
}
Comparing Firecrawl to Traditional Approaches
Traditional metadata extraction requires manual HTML parsing and handling of different meta tag formats. When monitoring network requests or using browser automation tools, you'd need to write custom code to extract each metadata field. Firecrawl simplifies this by providing a unified API that automatically extracts all common metadata formats.
Conclusion
Firecrawl provides a powerful and straightforward approach to extracting metadata from websites. By automatically handling JavaScript rendering, parsing multiple metadata formats, and returning structured JSON data, it significantly reduces the complexity of metadata extraction compared to traditional web scraping methods. Whether you're building an SEO analysis tool, content aggregation platform, or social media sharing application, Firecrawl's metadata extraction capabilities can streamline your development process and ensure you capture all the essential information from web pages.
The key advantages include automatic handling of Open Graph and Twitter Card tags, support for JavaScript-rendered content, and a clean API that works consistently across different types of websites. By following the best practices outlined in this guide, you can build robust metadata extraction workflows that handle edge cases gracefully and scale to process thousands of pages efficiently.