How do I Extract Structured Data Using Firecrawl?

Firecrawl provides powerful tools for extracting structured data from websites through its Extract API endpoint. Unlike traditional web scraping that requires manual parsing of HTML, Firecrawl uses large language models (LLMs) to intelligently extract data based on schemas you define. This approach makes it easy to convert unstructured web content into clean, structured JSON data.

Understanding Firecrawl's Extract Endpoint

The Extract endpoint (/extract) is specifically designed for structured data extraction. It takes a URL and a schema definition, then returns data that matches your schema. This is particularly useful when you need to extract specific fields from web pages without writing complex parsing logic.

Key Features

Schema-based extraction: Define the structure you want using JSON schemas
LLM-powered parsing: Uses AI to understand page content and extract relevant data
Automatic type conversion: Converts extracted data to the appropriate types (strings, numbers, booleans, arrays)
Handles dynamic content: Works with JavaScript-rendered pages
Multiple format support: Extracts from HTML, markdown, and PDF files

Basic Structured Data Extraction

Python Example

Here's how to extract structured data using Firecrawl's Python SDK:

from firecrawl import FirecrawlApp

# Initialize the Firecrawl client
app = FirecrawlApp(api_key='your_api_key')

# Define the schema for the data you want to extract
schema = {
    "type": "object",
    "properties": {
        "title": {
            "type": "string",
            "description": "The main title of the article"
        },
        "author": {
            "type": "string",
            "description": "The author's name"
        },
        "publishDate": {
            "type": "string",
            "description": "Publication date in YYYY-MM-DD format"
        },
        "tags": {
            "type": "array",
            "items": {"type": "string"},
            "description": "Article tags or categories"
        }
    },
    "required": ["title", "author"]
}

# Extract data from a URL
result = app.extract_url(
    url='https://example.com/article',
    schema=schema
)

# Access the extracted data
print(result['data'])

JavaScript/Node.js Example

The same extraction using Firecrawl's JavaScript SDK:

import FirecrawlApp from '@mendable/firecrawl-js';

// Initialize the client
const app = new FirecrawlApp({ apiKey: 'your_api_key' });

// Define your schema
const schema = {
  type: 'object',
  properties: {
    title: {
      type: 'string',
      description: 'The main title of the article'
    },
    author: {
      type: 'string',
      description: "The author's name"
    },
    publishDate: {
      type: 'string',
      description: 'Publication date in YYYY-MM-DD format'
    },
    tags: {
      type: 'array',
      items: { type: 'string' },
      description: 'Article tags or categories'
    }
  },
  required: ['title', 'author']
};

// Extract data
const result = await app.extractUrl({
  url: 'https://example.com/article',
  schema: schema
});

console.log(result.data);

Advanced Schema Definitions

Nested Objects

You can define complex nested structures for hierarchical data:

schema = {
    "type": "object",
    "properties": {
        "product": {
            "type": "object",
            "properties": {
                "name": {"type": "string"},
                "price": {
                    "type": "object",
                    "properties": {
                        "amount": {"type": "number"},
                        "currency": {"type": "string"}
                    }
                },
                "specifications": {
                    "type": "array",
                    "items": {
                        "type": "object",
                        "properties": {
                            "key": {"type": "string"},
                            "value": {"type": "string"}
                        }
                    }
                }
            }
        }
    }
}

result = app.extract_url(
    url='https://example.com/product',
    schema=schema
)

Extracting Lists of Items

To extract multiple items (like search results or product listings):

const schema = {
  type: 'object',
  properties: {
    products: {
      type: 'array',
      items: {
        type: 'object',
        properties: {
          name: { type: 'string' },
          price: { type: 'number' },
          rating: { type: 'number' },
          url: { type: 'string' }
        }
      },
      description: 'List of all products on the page'
    }
  }
};

const result = await app.extractUrl({
  url: 'https://example.com/products',
  schema: schema
});

// Access the array of products
result.data.products.forEach(product => {
  console.log(`${product.name}: $${product.price}`);
});

Extracting Data from Multiple Pages

When you need to extract structured data from multiple pages, combine Firecrawl's crawl functionality with extraction:

from firecrawl import FirecrawlApp

app = FirecrawlApp(api_key='your_api_key')

# Schema for extraction
product_schema = {
    "type": "object",
    "properties": {
        "name": {"type": "string"},
        "price": {"type": "number"},
        "description": {"type": "string"}
    }
}

# Crawl with extraction
result = app.crawl_url(
    url='https://example.com/shop',
    params={
        'limit': 10,
        'scrapeOptions': {
            'formats': ['extract'],
            'extract': {
                'schema': product_schema
            }
        }
    }
)

# Process extracted data from all pages
for page in result['data']:
    if 'extract' in page:
        print(f"Product: {page['extract']['name']}")
        print(f"Price: ${page['extract']['price']}")

Using API Directly

If you prefer to use the REST API directly without an SDK:

curl -X POST https://api.firecrawl.dev/v1/extract \
  -H 'Content-Type: application/json' \
  -H 'Authorization: Bearer YOUR_API_KEY' \
  -d '{
    "url": "https://example.com/article",
    "schema": {
      "type": "object",
      "properties": {
        "title": {"type": "string"},
        "content": {"type": "string"},
        "author": {"type": "string"}
      }
    }
  }'

Best Practices for Structured Extraction

1. Provide Clear Descriptions

Adding descriptions to your schema properties helps the LLM understand what data to extract:

schema = {
    "type": "object",
    "properties": {
        "price": {
            "type": "number",
            "description": "The current selling price, not the original or discounted price"
        },
        "availability": {
            "type": "string",
            "description": "Whether the product is in stock, out of stock, or available for pre-order"
        }
    }
}

2. Use Appropriate Data Types

Match your schema types to the expected data format:

string for text, dates, and URLs
number for prices, quantities, and ratings
boolean for yes/no or true/false values
array for lists of items
object for nested structures

3. Specify Required Fields

Mark essential fields as required to ensure they're always extracted:

const schema = {
  type: 'object',
  properties: {
    productId: { type: 'string' },
    name: { type: 'string' },
    price: { type: 'number' },
    description: { type: 'string' }
  },
  required: ['productId', 'name', 'price']
};

4. Handle Dynamic Content

For pages that load content via JavaScript (similar to handling AJAX requests using Puppeteer), Firecrawl automatically waits for content to load. You can also specify wait conditions:

result = app.extract_url(
    url='https://example.com/dynamic-page',
    schema=schema,
    params={
        'waitFor': 2000  # Wait 2 seconds for content to load
    }
)

Error Handling and Validation

Always implement proper error handling when extracting structured data:

try:
    result = app.extract_url(
        url='https://example.com/page',
        schema=schema
    )

    # Validate the extracted data
    if result.get('success'):
        data = result['data']

        # Check if required fields are present
        if 'title' in data and 'content' in data:
            process_data(data)
        else:
            print("Missing required fields in extracted data")
    else:
        print(f"Extraction failed: {result.get('error')}")

except Exception as e:
    print(f"Error during extraction: {str(e)}")

try {
  const result = await app.extractUrl({
    url: 'https://example.com/page',
    schema: schema
  });

  if (result.success && result.data) {
    // Process the extracted data
    console.log('Extracted data:', result.data);
  } else {
    console.error('Extraction failed:', result.error);
  }
} catch (error) {
  console.error('Error during extraction:', error.message);
}

Real-World Use Cases

E-commerce Product Extraction

product_schema = {
    "type": "object",
    "properties": {
        "name": {"type": "string"},
        "sku": {"type": "string"},
        "price": {"type": "number"},
        "currency": {"type": "string"},
        "inStock": {"type": "boolean"},
        "images": {
            "type": "array",
            "items": {"type": "string"}
        },
        "reviews": {
            "type": "object",
            "properties": {
                "averageRating": {"type": "number"},
                "totalCount": {"type": "number"}
            }
        }
    }
}

result = app.extract_url(
    url='https://store.example.com/product/12345',
    schema=product_schema
)

Article Metadata Extraction

const articleSchema = {
  type: 'object',
  properties: {
    headline: { type: 'string' },
    subheadline: { type: 'string' },
    author: { type: 'string' },
    publishedDate: { type: 'string' },
    modifiedDate: { type: 'string' },
    wordCount: { type: 'number' },
    readingTime: { type: 'number' },
    categories: {
      type: 'array',
      items: { type: 'string' }
    }
  }
};

const result = await app.extractUrl({
  url: 'https://blog.example.com/article',
  schema: articleSchema
});

Job Listing Extraction

job_schema = {
    "type": "object",
    "properties": {
        "title": {"type": "string"},
        "company": {"type": "string"},
        "location": {"type": "string"},
        "remote": {"type": "boolean"},
        "salary": {
            "type": "object",
            "properties": {
                "min": {"type": "number"},
                "max": {"type": "number"},
                "currency": {"type": "string"}
            }
        },
        "requirements": {
            "type": "array",
            "items": {"type": "string"}
        },
        "benefits": {
            "type": "array",
            "items": {"type": "string"}
        }
    }
}

result = app.extract_url(
    url='https://jobs.example.com/posting/123',
    schema=job_schema
)

Combining with Crawling

For large-scale data extraction across multiple pages, use Firecrawl's crawl endpoint with extraction schemas:

# Crawl an entire website section and extract structured data
crawl_result = app.crawl_url(
    url='https://example.com/blog',
    params={
        'limit': 100,
        'scrapeOptions': {
            'formats': ['extract'],
            'extract': {
                'schema': article_schema
            }
        }
    }
)

# Save all extracted articles to a database
for page in crawl_result['data']:
    if 'extract' in page:
        save_to_database(page['extract'])

Performance Considerations

Rate Limiting: Be mindful of API rate limits when extracting data from multiple pages
Schema Complexity: Simpler schemas generally extract faster and more accurately
Page Size: Large pages may take longer to process; consider using the onlyMainContent option
Caching: Implement caching for frequently accessed pages to reduce API calls

Troubleshooting Common Issues

Missing Data

If extracted data is incomplete: - Add more detailed descriptions to your schema properties - Check if the data exists on the page - Verify the page has fully loaded (increase waitFor time if needed)

Incorrect Data Types

If data types don't match: - Ensure your schema types match the actual data format - Use string type for mixed content, then convert in your application - Check for special formatting (dates, currencies, etc.)

Extraction Timeouts

For pages that take long to load: - Increase the timeout parameter - Use the onlyMainContent option to focus on the main content area - Consider handling timeouts in Puppeteer for similar timeout management strategies

Conclusion

Firecrawl's structured data extraction provides a powerful, AI-driven approach to web scraping that eliminates the need for complex HTML parsing. By defining clear schemas and leveraging LLM capabilities, you can extract clean, structured data from any website with minimal code. Whether you're building a product aggregator, content management system, or data analysis pipeline, Firecrawl's Extract API simplifies the process of converting unstructured web content into usable data.

Remember to always respect website terms of service and robots.txt files when scraping, and implement proper error handling and rate limiting in your production applications.

Table of contents