How do I Extract Metadata from Websites Using Firecrawl?

Metadata extraction is a crucial aspect of web scraping, providing essential information about web pages such as titles, descriptions, social media tags, and structured data. Firecrawl simplifies this process by automatically extracting metadata alongside your web content, making it an excellent choice for developers who need comprehensive page information without complex parsing logic.

Understanding Metadata Extraction with Firecrawl

Firecrawl is a web scraping API that converts websites into clean, structured data. One of its key features is automatic metadata extraction, which captures important page information including:

Standard Meta Tags: Title, description, keywords, author
Open Graph Tags: Social media sharing metadata (og:title, og:description, og:image)
Twitter Card Tags: Twitter-specific metadata (twitter:card, twitter:title, twitter:image)
Canonical URLs: The preferred version of a web page
Language Information: Page language and locale settings
Favicon URLs: Website icons and branding elements

Unlike traditional web scraping approaches where you need to manually parse HTML and extract specific meta tags, Firecrawl handles this automatically and returns structured metadata in a clean JSON format.

Basic Metadata Extraction with Firecrawl

Python Implementation

Here's how to extract metadata from a website using Firecrawl's Python SDK:

from firecrawl import FirecrawlApp

# Initialize Firecrawl with your API key
app = FirecrawlApp(api_key='your_api_key_here')

# Scrape a URL and extract metadata
result = app.scrape_url('https://example.com')

# Access metadata
metadata = result.get('metadata', {})

print(f"Title: {metadata.get('title')}")
print(f"Description: {metadata.get('description')}")
print(f"Language: {metadata.get('language')}")
print(f"Canonical URL: {metadata.get('sourceURL')}")

# Access Open Graph metadata
og_title = metadata.get('ogTitle')
og_description = metadata.get('ogDescription')
og_image = metadata.get('ogImage')

print(f"\nOpen Graph Data:")
print(f"OG Title: {og_title}")
print(f"OG Description: {og_description}")
print(f"OG Image: {og_image}")

JavaScript/Node.js Implementation

For Node.js applications, you can use Firecrawl's JavaScript SDK:

import FirecrawlApp from '@mendable/firecrawl-js';

// Initialize Firecrawl
const app = new FirecrawlApp({ apiKey: 'your_api_key_here' });

async function extractMetadata(url) {
  try {
    // Scrape the URL
    const result = await app.scrapeUrl(url);

    // Access metadata
    const metadata = result.metadata || {};

    console.log('Title:', metadata.title);
    console.log('Description:', metadata.description);
    console.log('Language:', metadata.language);
    console.log('Keywords:', metadata.keywords);

    // Access social media metadata
    console.log('\nSocial Media Metadata:');
    console.log('OG Title:', metadata.ogTitle);
    console.log('OG Description:', metadata.ogDescription);
    console.log('OG Image:', metadata.ogImage);
    console.log('Twitter Card:', metadata.twitterCard);
    console.log('Twitter Title:', metadata.twitterTitle);

    return metadata;
  } catch (error) {
    console.error('Error extracting metadata:', error);
    throw error;
  }
}

// Usage
extractMetadata('https://example.com')
  .then(metadata => console.log('Extraction complete:', metadata));

Advanced Metadata Extraction Techniques

Extracting Metadata from Multiple Pages

When you need to extract metadata from multiple pages, you can use Firecrawl's crawling functionality:

from firecrawl import FirecrawlApp

app = FirecrawlApp(api_key='your_api_key_here')

# Crawl multiple pages and extract metadata
crawl_result = app.crawl_url(
    'https://example.com',
    params={
        'limit': 10,  # Number of pages to crawl
        'scrapeOptions': {
            'formats': ['markdown', 'html']
        }
    }
)

# Process metadata from each crawled page
for page in crawl_result.get('data', []):
    metadata = page.get('metadata', {})
    print(f"\nURL: {metadata.get('sourceURL')}")
    print(f"Title: {metadata.get('title')}")
    print(f"Description: {metadata.get('description')}")

Filtering and Processing Metadata

You can process and filter extracted metadata to focus on specific information:

async function extractAndFilterMetadata(url, requiredFields) {
  const app = new FirecrawlApp({ apiKey: 'your_api_key_here' });

  const result = await app.scrapeUrl(url);
  const metadata = result.metadata || {};

  // Filter metadata to only include required fields
  const filteredMetadata = {};
  requiredFields.forEach(field => {
    if (metadata[field]) {
      filteredMetadata[field] = metadata[field];
    }
  });

  // Validate that all required fields are present
  const missingFields = requiredFields.filter(
    field => !filteredMetadata[field]
  );

  if (missingFields.length > 0) {
    console.warn('Missing metadata fields:', missingFields);
  }

  return filteredMetadata;
}

// Usage: Extract only specific metadata fields
const requiredFields = ['title', 'description', 'ogImage', 'language'];
extractAndFilterMetadata('https://example.com', requiredFields)
  .then(metadata => console.log('Filtered metadata:', metadata));

Working with Structured Data

Firecrawl can also extract structured data from web pages, which is particularly useful for SEO analysis and content understanding:

from firecrawl import FirecrawlApp
import json

app = FirecrawlApp(api_key='your_api_key_here')

# Scrape with all available formats
result = app.scrape_url(
    'https://example.com',
    params={
        'formats': ['markdown', 'html', 'rawHtml']
    }
)

metadata = result.get('metadata', {})

# Extract and parse structured data if available
structured_data = metadata.get('structuredData', {})

if structured_data:
    print("Structured Data Found:")
    print(json.dumps(structured_data, indent=2))

    # Process specific schema types
    if structured_data.get('@type') == 'Article':
        print(f"\nArticle Title: {structured_data.get('headline')}")
        print(f"Author: {structured_data.get('author', {}).get('name')}")
        print(f"Published: {structured_data.get('datePublished')}")

Using the REST API Directly

If you prefer to use HTTP requests directly without an SDK, here's how to extract metadata using cURL:

curl -X POST https://api.firecrawl.dev/v0/scrape \
  -H 'Content-Type: application/json' \
  -H 'Authorization: Bearer your_api_key_here' \
  -d '{
    "url": "https://example.com",
    "formats": ["markdown", "html"]
  }'

The response will include a comprehensive metadata object:

{
  "success": true,
  "data": {
    "markdown": "...",
    "html": "...",
    "metadata": {
      "title": "Example Domain",
      "description": "Example domain description",
      "language": "en",
      "sourceURL": "https://example.com",
      "ogTitle": "Example Domain - Social Title",
      "ogDescription": "Social media description",
      "ogImage": "https://example.com/og-image.jpg",
      "twitterCard": "summary_large_image",
      "twitterTitle": "Example Domain",
      "twitterDescription": "Twitter description",
      "favicon": "https://example.com/favicon.ico"
    }
  }
}

Handling JavaScript-Rendered Metadata

Many modern websites render metadata dynamically using JavaScript. Similar to how AJAX requests are handled in browser automation, Firecrawl automatically executes JavaScript and waits for the page to fully load before extracting metadata:

from firecrawl import FirecrawlApp

app = FirecrawlApp(api_key='your_api_key_here')

# Firecrawl automatically handles JavaScript rendering
result = app.scrape_url(
    'https://spa-example.com',  # Single-page application
    params={
        'waitFor': 2000  # Wait 2 seconds for dynamic content
    }
)

metadata = result.get('metadata', {})
print(f"Dynamically loaded title: {metadata.get('title')}")

Best Practices for Metadata Extraction

1. Validate Extracted Metadata

Always validate that the extracted metadata contains the information you need:

def validate_metadata(metadata, required_fields):
    """Validate that all required metadata fields are present."""
    missing = [field for field in required_fields if not metadata.get(field)]

    if missing:
        raise ValueError(f"Missing required metadata: {', '.join(missing)}")

    return True

# Usage
metadata = result.get('metadata', {})
try:
    validate_metadata(metadata, ['title', 'description', 'sourceURL'])
    print("Metadata validation passed")
except ValueError as e:
    print(f"Validation error: {e}")

2. Handle Missing Metadata Gracefully

Not all websites include comprehensive metadata. Use fallback values:

function getMetadataWithDefaults(metadata) {
  return {
    title: metadata.title || 'Untitled Page',
    description: metadata.description || metadata.ogDescription || '',
    image: metadata.ogImage || metadata.twitterImage || '',
    url: metadata.sourceURL || '',
    language: metadata.language || 'en'
  };
}

3. Rate Limiting and Error Handling

When extracting metadata from multiple pages, implement proper error handling and respect rate limits:

import time
from firecrawl import FirecrawlApp

app = FirecrawlApp(api_key='your_api_key_here')

def extract_metadata_batch(urls, delay=1):
    """Extract metadata from multiple URLs with rate limiting."""
    results = []

    for url in urls:
        try:
            result = app.scrape_url(url)
            metadata = result.get('metadata', {})
            results.append({
                'url': url,
                'metadata': metadata,
                'success': True
            })
        except Exception as e:
            results.append({
                'url': url,
                'error': str(e),
                'success': False
            })

        # Rate limiting
        time.sleep(delay)

    return results

# Usage
urls = ['https://example1.com', 'https://example2.com', 'https://example3.com']
metadata_results = extract_metadata_batch(urls, delay=1)

Common Use Cases

SEO Analysis

Extract metadata to analyze SEO optimization:

def analyze_seo_metadata(url):
    """Analyze SEO-related metadata."""
    app = FirecrawlApp(api_key='your_api_key_here')
    result = app.scrape_url(url)
    metadata = result.get('metadata', {})

    analysis = {
        'title_length': len(metadata.get('title', '')),
        'description_length': len(metadata.get('description', '')),
        'has_og_tags': bool(metadata.get('ogTitle')),
        'has_twitter_tags': bool(metadata.get('twitterCard')),
        'has_canonical': bool(metadata.get('sourceURL')),
        'language': metadata.get('language')
    }

    # SEO recommendations
    if analysis['title_length'] < 30 or analysis['title_length'] > 60:
        analysis['title_warning'] = 'Title should be 30-60 characters'

    if analysis['description_length'] < 120 or analysis['description_length'] > 160:
        analysis['description_warning'] = 'Description should be 120-160 characters'

    return analysis

Content Aggregation

Collect metadata from multiple sources for content aggregation platforms:

async function aggregateContent(urls) {
  const app = new FirecrawlApp({ apiKey: 'your_api_key_here' });
  const articles = [];

  for (const url of urls) {
    try {
      const result = await app.scrapeUrl(url);
      const metadata = result.metadata || {};

      articles.push({
        title: metadata.title,
        description: metadata.description,
        image: metadata.ogImage,
        url: metadata.sourceURL,
        publishedDate: metadata.publishedDate,
        author: metadata.author
      });
    } catch (error) {
      console.error(`Failed to process ${url}:`, error.message);
    }
  }

  return articles;
}

Comparing Firecrawl to Traditional Approaches

Traditional metadata extraction requires manual HTML parsing and handling of different meta tag formats. When monitoring network requests or using browser automation tools, you'd need to write custom code to extract each metadata field. Firecrawl simplifies this by providing a unified API that automatically extracts all common metadata formats.

Conclusion

Firecrawl provides a powerful and straightforward approach to extracting metadata from websites. By automatically handling JavaScript rendering, parsing multiple metadata formats, and returning structured JSON data, it significantly reduces the complexity of metadata extraction compared to traditional web scraping methods. Whether you're building an SEO analysis tool, content aggregation platform, or social media sharing application, Firecrawl's metadata extraction capabilities can streamline your development process and ensure you capture all the essential information from web pages.

The key advantages include automatic handling of Open Graph and Twitter Card tags, support for JavaScript-rendered content, and a clean API that works consistently across different types of websites. By following the best practices outlined in this guide, you can build robust metadata extraction workflows that handle edge cases gracefully and scale to process thousands of pages efficiently.

Table of contents

How do I Extract Metadata from Websites Using Firecrawl?

Understanding Metadata Extraction with Firecrawl

Basic Metadata Extraction with Firecrawl

Python Implementation

JavaScript/Node.js Implementation

Advanced Metadata Extraction Techniques

Extracting Metadata from Multiple Pages

Filtering and Processing Metadata

Working with Structured Data

Using the REST API Directly

Handling JavaScript-Rendered Metadata

Best Practices for Metadata Extraction

1. Validate Extracted Metadata

2. Handle Missing Metadata Gracefully

3. Rate Limiting and Error Handling

Common Use Cases

SEO Analysis

Content Aggregation

Comparing Firecrawl to Traditional Approaches

Conclusion

Try WebScraping.AI for Your Web Scraping Needs

Key Features:

Getting Started:

Related Questions

Can Firecrawl extract images from web pages?

How do I extract links from a website using Firecrawl?

What is the Firecrawl rate limit and how do I manage it?

Get Started Now

Support