What is the Firecrawl Tutorial for Beginners?

Firecrawl is a powerful web scraping and crawling API that simplifies the process of extracting data from websites. Unlike traditional scraping tools that require complex setup and maintenance, Firecrawl handles JavaScript rendering, anti-bot measures, and infrastructure management automatically. This tutorial will guide you through everything you need to know to start using Firecrawl effectively.

What is Firecrawl?

Firecrawl is a managed web scraping service that converts websites into clean, structured data formats. It's designed specifically for developers who need reliable data extraction without the hassle of managing proxies, headless browsers, or constantly updating selectors.

Key Features

JavaScript Rendering - Automatically executes JavaScript to capture dynamic content
Clean Markdown Output - Converts HTML into LLM-ready markdown format
Intelligent Crawling - Discovers and navigates through website pages automatically
Structured Data Extraction - Uses AI to extract specific fields based on your schema
Built-in Anti-Bot Bypass - Handles CAPTCHAs and bot detection mechanisms
No Infrastructure Required - Fully managed service, no servers to maintain

Getting Started with Firecrawl

Step 1: Create an Account

Visit firecrawl.dev and sign up for an account
Navigate to your dashboard
Generate and copy your API key from the API Keys section
Start with the free tier to test the service (typically includes 500 credits)

Step 2: Choose Your Language

Firecrawl provides official SDKs for both Python and Node.js. This tutorial covers both languages so you can choose based on your preference.

Installing Firecrawl

Python Installation

# Using pip
pip install firecrawl-py

# Using Poetry
poetry add firecrawl-py

# Using Pipenv
pipenv install firecrawl-py

Node.js Installation

# Using npm
npm install @mendable/firecrawl-js

# Using yarn
yarn add @mendable/firecrawl-js

Your First Scraping Project

Setting Up Your API Key

Always store your API key securely using environment variables:

Python:

export FIRECRAWL_API_KEY='your_api_key_here'

Node.js (.env file): FIRECRAWL_API_KEY=your_api_key_here

Basic Scraping Example

Let's start with the simplest use case: scraping a single web page.

Python:

from firecrawl import FirecrawlApp
import os

# Initialize the client
app = FirecrawlApp(api_key=os.getenv('FIRECRAWL_API_KEY'))

# Scrape a single page
result = app.scrape_url('https://example.com')

# Access the content
print("Markdown Content:")
print(result['markdown'])

print("\nMetadata:")
print(result['metadata'])

JavaScript:

import FirecrawlApp from '@mendable/firecrawl-js';
import dotenv from 'dotenv';

dotenv.config();

// Initialize the client
const app = new FirecrawlApp({
  apiKey: process.env.FIRECRAWL_API_KEY
});

// Scrape a single page
async function scrapePage() {
  const result = await app.scrapeUrl('https://example.com', {
    formats: ['markdown']
  });

  console.log("Markdown Content:");
  console.log(result.markdown);

  console.log("\nMetadata:");
  console.log(result.metadata);
}

scrapePage();

Understanding the Response

The scraping response includes several useful fields:

markdown - Clean, formatted text content
html - Original HTML (if requested)
metadata - Page title, description, language, and more
links - All links found on the page (if requested)

Customizing Scraping Options

Firecrawl offers numerous options to customize how pages are scraped. This is particularly useful when dealing with complex layouts or handling dynamic content.

Focusing on Main Content

Extract only the main article content, excluding navigation, ads, and footers:

Python:

result = app.scrape_url(
    'https://example.com/article',
    params={
        'formats': ['markdown'],
        'onlyMainContent': True,
        'includeTags': ['article', 'main', '.content'],
        'excludeTags': ['nav', 'footer', '.ads', '.sidebar']
    }
)

JavaScript:

const result = await app.scrapeUrl('https://example.com/article', {
  formats: ['markdown'],
  onlyMainContent: true,
  includeTags: ['article', 'main', '.content'],
  excludeTags: ['nav', 'footer', '.ads', '.sidebar']
});

Waiting for JavaScript

Some websites load content dynamically. Use the waitFor parameter to ensure all content is loaded:

Python:

result = app.scrape_url(
    'https://example.com/dynamic-page',
    params={
        'formats': ['markdown'],
        'waitFor': 3000  # Wait 3 seconds for JavaScript to execute
    }
)

JavaScript:

const result = await app.scrapeUrl('https://example.com/dynamic-page', {
  formats: ['markdown'],
  waitFor: 3000  // Wait 3 seconds
});

This approach is similar to using the waitFor function in Puppeteer but managed entirely by Firecrawl.

Crawling Multiple Pages

One of Firecrawl's most powerful features is its ability to automatically discover and scrape multiple pages from a website.

Basic Crawling

Python:

from firecrawl import FirecrawlApp
import os

app = FirecrawlApp(api_key=os.getenv('FIRECRAWL_API_KEY'))

# Crawl up to 50 pages
crawl_result = app.crawl_url(
    'https://example.com',
    params={
        'limit': 50,
        'scrapeOptions': {
            'formats': ['markdown'],
            'onlyMainContent': True
        }
    },
    poll_interval=5  # Check status every 5 seconds
)

# Process all pages
for page in crawl_result['data']:
    url = page['metadata']['sourceURL']
    content = page['markdown']
    print(f"Scraped: {url}")
    print(f"Content preview: {content[:100]}...\n")

JavaScript:

async function crawlWebsite() {
  const crawlResult = await app.crawlUrl('https://example.com', {
    limit: 50,
    scrapeOptions: {
      formats: ['markdown'],
      onlyMainContent: true
    }
  });

  console.log(`Successfully crawled ${crawlResult.data.length} pages`);

  crawlResult.data.forEach(page => {
    console.log(`URL: ${page.metadata.sourceURL}`);
    console.log(`Content: ${page.markdown.substring(0, 100)}...\n`);
  });
}

crawlWebsite();

Advanced Crawling with Filters

Control which pages to crawl using path patterns:

Python:

crawl_result = app.crawl_url(
    'https://example.com',
    params={
        'limit': 100,
        'maxDepth': 3,  # Maximum depth from starting URL
        'includePaths': ['/blog/*', '/articles/*', '/docs/*'],
        'excludePaths': ['/admin/*', '/login', '/signup'],
        'allowBackwardLinks': False,  # Don't crawl parent directories
        'allowExternalLinks': False,  # Stay on the same domain
        'scrapeOptions': {
            'formats': ['markdown'],
            'waitFor': 1000
        }
    }
)

JavaScript:

const crawlResult = await app.crawlUrl('https://example.com', {
  limit: 100,
  maxDepth: 3,
  includePaths: ['/blog/*', '/articles/*', '/docs/*'],
  excludePaths: ['/admin/*', '/login', '/signup'],
  allowBackwardLinks: false,
  allowExternalLinks: false,
  scrapeOptions: {
    formats: ['markdown'],
    waitFor: 1000
  }
});

Extracting Structured Data

Firecrawl can extract specific data fields using AI-powered extraction. This is incredibly useful for scraping product information, job listings, or any structured content.

Defining a Schema

Python:

from firecrawl import FirecrawlApp
import os

app = FirecrawlApp(api_key=os.getenv('FIRECRAWL_API_KEY'))

# Define the data structure you want to extract
schema = {
    'type': 'object',
    'properties': {
        'productName': {'type': 'string'},
        'price': {'type': 'number'},
        'currency': {'type': 'string'},
        'description': {'type': 'string'},
        'features': {
            'type': 'array',
            'items': {'type': 'string'}
        },
        'inStock': {'type': 'boolean'},
        'rating': {'type': 'number'}
    },
    'required': ['productName', 'price']
}

# Extract structured data
result = app.scrape_url(
    'https://example.com/product/laptop',
    params={
        'formats': ['extract'],
        'extract': {
            'schema': schema,
            'systemPrompt': 'Extract product information from this e-commerce page',
            'prompt': 'Extract all product details including name, price, features, and availability'
        }
    }
)

# Access the extracted data
product = result['extract']
print(f"Product: {product['productName']}")
print(f"Price: {product['price']} {product['currency']}")
print(f"In Stock: {product['inStock']}")
print(f"Features: {product['features']}")

JavaScript:

const schema = {
  type: 'object',
  properties: {
    productName: { type: 'string' },
    price: { type: 'number' },
    currency: { type: 'string' },
    description: { type: 'string' },
    features: {
      type: 'array',
      items: { type: 'string' }
    },
    inStock: { type: 'boolean' },
    rating: { type: 'number' }
  },
  required: ['productName', 'price']
};

const result = await app.scrapeUrl('https://example.com/product/laptop', {
  formats: ['extract'],
  extract: {
    schema: schema,
    systemPrompt: 'Extract product information from this e-commerce page',
    prompt: 'Extract all product details including name, price, features, and availability'
  }
});

const product = result.extract;
console.log(`Product: ${product.productName}`);
console.log(`Price: ${product.price} ${product.currency}`);
console.log(`In Stock: ${product.inStock}`);

Batch Extraction

Extract data from multiple pages efficiently:

Python:

import asyncio

async def scrape_multiple_products(urls):
    app = FirecrawlApp(api_key=os.getenv('FIRECRAWL_API_KEY'))

    schema = {
        'type': 'object',
        'properties': {
            'productName': {'type': 'string'},
            'price': {'type': 'number'}
        }
    }

    products = []
    for url in urls:
        result = app.scrape_url(
            url,
            params={
                'formats': ['extract'],
                'extract': {'schema': schema}
            }
        )
        products.append(result['extract'])

    return products

# Usage
urls = [
    'https://example.com/product1',
    'https://example.com/product2',
    'https://example.com/product3'
]

products = asyncio.run(scrape_multiple_products(urls))
for product in products:
    print(f"{product['productName']}: ${product['price']}")

JavaScript:

async function scrapeMultipleProducts(urls) {
  const schema = {
    type: 'object',
    properties: {
      productName: { type: 'string' },
      price: { type: 'number' }
    }
  };

  const promises = urls.map(url =>
    app.scrapeUrl(url, {
      formats: ['extract'],
      extract: { schema }
    })
  );

  const results = await Promise.all(promises);
  return results.map(r => r.extract);
}

// Usage
const urls = [
  'https://example.com/product1',
  'https://example.com/product2',
  'https://example.com/product3'
];

const products = await scrapeMultipleProducts(urls);
products.forEach(product => {
  console.log(`${product.productName}: $${product.price}`);
});

Handling Authentication

For pages that require login or authentication, you can pass custom headers and cookies:

Python:

result = app.scrape_url(
    'https://example.com/protected',
    params={
        'formats': ['markdown'],
        'headers': {
            'Authorization': 'Bearer your_token_here',
            'Cookie': 'session_id=abc123; user_pref=xyz'
        }
    }
)

JavaScript:

const result = await app.scrapeUrl('https://example.com/protected', {
  formats: ['markdown'],
  headers: {
    'Authorization': 'Bearer your_token_here',
    'Cookie': 'session_id=abc123; user_pref=xyz'
  }
});

This technique is similar to how you handle authentication in Puppeteer, but Firecrawl manages the browser session automatically.

Working with Screenshots

Firecrawl can capture screenshots of pages, which is useful for visual verification or archiving:

Python:

import base64

result = app.scrape_url(
    'https://example.com',
    params={
        'formats': ['markdown', 'screenshot']
    }
)

# Save the screenshot
if 'screenshot' in result:
    screenshot_data = base64.b64decode(result['screenshot'])
    with open('page_screenshot.png', 'wb') as f:
        f.write(screenshot_data)
    print("Screenshot saved!")

JavaScript:

import fs from 'fs';

const result = await app.scrapeUrl('https://example.com', {
  formats: ['markdown', 'screenshot']
});

// Save the screenshot
if (result.screenshot) {
  const buffer = Buffer.from(result.screenshot, 'base64');
  fs.writeFileSync('page_screenshot.png', buffer);
  console.log("Screenshot saved!");
}

Error Handling and Best Practices

Implementing Retry Logic

Python:

import time

def scrape_with_retry(url, max_retries=3):
    app = FirecrawlApp(api_key=os.getenv('FIRECRAWL_API_KEY'))

    for attempt in range(max_retries):
        try:
            result = app.scrape_url(
                url,
                params={
                    'formats': ['markdown'],
                    'timeout': 30000  # 30 seconds
                }
            )
            return result
        except Exception as e:
            print(f"Attempt {attempt + 1} failed: {e}")
            if attempt < max_retries - 1:
                time.sleep(2 ** attempt)  # Exponential backoff
            else:
                raise

# Usage
try:
    data = scrape_with_retry('https://example.com')
    print(data['markdown'])
except Exception as e:
    print(f"Failed after all retries: {e}")

JavaScript:

async function scrapeWithRetry(url, maxRetries = 3) {
  for (let attempt = 0; attempt < maxRetries; attempt++) {
    try {
      const result = await app.scrapeUrl(url, {
        formats: ['markdown'],
        timeout: 30000
      });
      return result;
    } catch (error) {
      console.log(`Attempt ${attempt + 1} failed: ${error.message}`);
      if (attempt < maxRetries - 1) {
        await new Promise(resolve =>
          setTimeout(resolve, Math.pow(2, attempt) * 1000)
        );
      } else {
        throw error;
      }
    }
  }
}

// Usage
try {
  const data = await scrapeWithRetry('https://example.com');
  console.log(data.markdown);
} catch (error) {
  console.log(`Failed after all retries: ${error.message}`);
}

Best Practices for Beginners

Start Small - Test with a single page before crawling entire websites
Use Environment Variables - Never hardcode API keys in your source code
Implement Rate Limiting - Respect API limits and avoid throttling
Handle Errors Gracefully - Always use try-catch blocks and implement retries
Cache Results - Store scraped data to avoid redundant API calls
Monitor Usage - Track your credit consumption through the dashboard
Test Selectors - Use includeTags and excludeTags to improve accuracy
Set Appropriate Timeouts - Adjust based on page complexity

Saving and Exporting Data

Saving to JSON

Python:

import json

result = app.scrape_url('https://example.com')

# Save as JSON
with open('scraped_data.json', 'w', encoding='utf-8') as f:
    json.dump(result, f, indent=2, ensure_ascii=False)

print("Data saved to scraped_data.json")

JavaScript:

import fs from 'fs';

const result = await app.scrapeUrl('https://example.com');

// Save as JSON
fs.writeFileSync(
  'scraped_data.json',
  JSON.stringify(result, null, 2)
);

console.log("Data saved to scraped_data.json");

Saving to CSV

Python:

import csv

# Assuming you've extracted structured data
products = [
    {'name': 'Product 1', 'price': 29.99},
    {'name': 'Product 2', 'price': 39.99}
]

with open('products.csv', 'w', newline='', encoding='utf-8') as f:
    writer = csv.DictWriter(f, fieldnames=['name', 'price'])
    writer.writeheader()
    writer.writerows(products)

JavaScript:

import fs from 'fs';
import { stringify } from 'csv-stringify/sync';

const products = [
  { name: 'Product 1', price: 29.99 },
  { name: 'Product 2', price: 39.99 }
];

const csv = stringify(products, { header: true });
fs.writeFileSync('products.csv', csv);

Common Use Cases

1. Blog Content Extraction

Perfect for content aggregation or analysis:

result = app.scrape_url(
    'https://example.com/blog/article',
    params={
        'formats': ['markdown'],
        'onlyMainContent': True,
        'includeTags': ['article', '.post-content']
    }
)

2. E-commerce Price Monitoring

Extract product prices for comparison:

schema = {
    'type': 'object',
    'properties': {
        'productName': {'type': 'string'},
        'currentPrice': {'type': 'number'},
        'originalPrice': {'type': 'number'},
        'inStock': {'type': 'boolean'}
    }
}

3. Job Listing Aggregation

Collect job postings from multiple sites:

schema = {
    'type': 'object',
    'properties': {
        'jobTitle': {'type': 'string'},
        'company': {'type': 'string'},
        'location': {'type': 'string'},
        'salary': {'type': 'string'},
        'description': {'type': 'string'}
    }
}

Next Steps

Now that you understand the basics of Firecrawl, you can:

Learn how to use Firecrawl with Python for more Python-specific examples
Learn how to use Firecrawl with Node.js for advanced JavaScript patterns
Explore handling JavaScript-rendered websites for complex scraping scenarios
Review the official Firecrawl documentation for advanced features
Join the Firecrawl community for support and best practices

Conclusion

Firecrawl simplifies web scraping by handling the complex infrastructure, browser automation, and anti-bot measures automatically. This tutorial covered the essential concepts: basic scraping, crawling multiple pages, extracting structured data, and implementing best practices.

As a beginner, start with simple single-page scraping, gradually move to crawling, and then experiment with structured data extraction. With Firecrawl's managed approach, you can focus on extracting value from data rather than maintaining scraping infrastructure.

Remember to always respect websites' terms of service, implement proper error handling, and monitor your API usage to build reliable, sustainable scraping solutions.

Table of contents

What is the Firecrawl Tutorial for Beginners?

What is Firecrawl?

Key Features

Getting Started with Firecrawl

Step 1: Create an Account

Step 2: Choose Your Language

Installing Firecrawl

Python Installation

Node.js Installation

Your First Scraping Project

Setting Up Your API Key

Basic Scraping Example

Understanding the Response

Customizing Scraping Options

Focusing on Main Content

Waiting for JavaScript

Crawling Multiple Pages

Basic Crawling

Advanced Crawling with Filters

Extracting Structured Data

Defining a Schema

Batch Extraction

Handling Authentication

Working with Screenshots

Error Handling and Best Practices

Implementing Retry Logic

Best Practices for Beginners

Saving and Exporting Data

Saving to JSON

Saving to CSV

Common Use Cases

1. Blog Content Extraction

2. E-commerce Price Monitoring

3. Job Listing Aggregation

Next Steps

Conclusion

Try WebScraping.AI for Your Web Scraping Needs

Key Features:

Getting Started:

Related Questions

How do I use Firecrawl to render HTML with JavaScript?

Can Firecrawl extract data from multiple pages simultaneously?

What are the limitations of using Firecrawl for web scraping?

Get Started Now

Support