Table of contents

What is the Firecrawl Tutorial for Beginners?

Firecrawl is a powerful web scraping and crawling API that simplifies the process of extracting data from websites. Unlike traditional scraping tools that require complex setup and maintenance, Firecrawl handles JavaScript rendering, anti-bot measures, and infrastructure management automatically. This tutorial will guide you through everything you need to know to start using Firecrawl effectively.

What is Firecrawl?

Firecrawl is a managed web scraping service that converts websites into clean, structured data formats. It's designed specifically for developers who need reliable data extraction without the hassle of managing proxies, headless browsers, or constantly updating selectors.

Key Features

  • JavaScript Rendering - Automatically executes JavaScript to capture dynamic content
  • Clean Markdown Output - Converts HTML into LLM-ready markdown format
  • Intelligent Crawling - Discovers and navigates through website pages automatically
  • Structured Data Extraction - Uses AI to extract specific fields based on your schema
  • Built-in Anti-Bot Bypass - Handles CAPTCHAs and bot detection mechanisms
  • No Infrastructure Required - Fully managed service, no servers to maintain

Getting Started with Firecrawl

Step 1: Create an Account

  1. Visit firecrawl.dev and sign up for an account
  2. Navigate to your dashboard
  3. Generate and copy your API key from the API Keys section
  4. Start with the free tier to test the service (typically includes 500 credits)

Step 2: Choose Your Language

Firecrawl provides official SDKs for both Python and Node.js. This tutorial covers both languages so you can choose based on your preference.

Installing Firecrawl

Python Installation

# Using pip
pip install firecrawl-py

# Using Poetry
poetry add firecrawl-py

# Using Pipenv
pipenv install firecrawl-py

Node.js Installation

# Using npm
npm install @mendable/firecrawl-js

# Using yarn
yarn add @mendable/firecrawl-js

Your First Scraping Project

Setting Up Your API Key

Always store your API key securely using environment variables:

Python:

export FIRECRAWL_API_KEY='your_api_key_here'

Node.js (.env file): FIRECRAWL_API_KEY=your_api_key_here

Basic Scraping Example

Let's start with the simplest use case: scraping a single web page.

Python:

from firecrawl import FirecrawlApp
import os

# Initialize the client
app = FirecrawlApp(api_key=os.getenv('FIRECRAWL_API_KEY'))

# Scrape a single page
result = app.scrape_url('https://example.com')

# Access the content
print("Markdown Content:")
print(result['markdown'])

print("\nMetadata:")
print(result['metadata'])

JavaScript:

import FirecrawlApp from '@mendable/firecrawl-js';
import dotenv from 'dotenv';

dotenv.config();

// Initialize the client
const app = new FirecrawlApp({
  apiKey: process.env.FIRECRAWL_API_KEY
});

// Scrape a single page
async function scrapePage() {
  const result = await app.scrapeUrl('https://example.com', {
    formats: ['markdown']
  });

  console.log("Markdown Content:");
  console.log(result.markdown);

  console.log("\nMetadata:");
  console.log(result.metadata);
}

scrapePage();

Understanding the Response

The scraping response includes several useful fields:

  • markdown - Clean, formatted text content
  • html - Original HTML (if requested)
  • metadata - Page title, description, language, and more
  • links - All links found on the page (if requested)

Customizing Scraping Options

Firecrawl offers numerous options to customize how pages are scraped. This is particularly useful when dealing with complex layouts or handling dynamic content.

Focusing on Main Content

Extract only the main article content, excluding navigation, ads, and footers:

Python:

result = app.scrape_url(
    'https://example.com/article',
    params={
        'formats': ['markdown'],
        'onlyMainContent': True,
        'includeTags': ['article', 'main', '.content'],
        'excludeTags': ['nav', 'footer', '.ads', '.sidebar']
    }
)

JavaScript:

const result = await app.scrapeUrl('https://example.com/article', {
  formats: ['markdown'],
  onlyMainContent: true,
  includeTags: ['article', 'main', '.content'],
  excludeTags: ['nav', 'footer', '.ads', '.sidebar']
});

Waiting for JavaScript

Some websites load content dynamically. Use the waitFor parameter to ensure all content is loaded:

Python:

result = app.scrape_url(
    'https://example.com/dynamic-page',
    params={
        'formats': ['markdown'],
        'waitFor': 3000  # Wait 3 seconds for JavaScript to execute
    }
)

JavaScript:

const result = await app.scrapeUrl('https://example.com/dynamic-page', {
  formats: ['markdown'],
  waitFor: 3000  // Wait 3 seconds
});

This approach is similar to using the waitFor function in Puppeteer but managed entirely by Firecrawl.

Crawling Multiple Pages

One of Firecrawl's most powerful features is its ability to automatically discover and scrape multiple pages from a website.

Basic Crawling

Python:

from firecrawl import FirecrawlApp
import os

app = FirecrawlApp(api_key=os.getenv('FIRECRAWL_API_KEY'))

# Crawl up to 50 pages
crawl_result = app.crawl_url(
    'https://example.com',
    params={
        'limit': 50,
        'scrapeOptions': {
            'formats': ['markdown'],
            'onlyMainContent': True
        }
    },
    poll_interval=5  # Check status every 5 seconds
)

# Process all pages
for page in crawl_result['data']:
    url = page['metadata']['sourceURL']
    content = page['markdown']
    print(f"Scraped: {url}")
    print(f"Content preview: {content[:100]}...\n")

JavaScript:

async function crawlWebsite() {
  const crawlResult = await app.crawlUrl('https://example.com', {
    limit: 50,
    scrapeOptions: {
      formats: ['markdown'],
      onlyMainContent: true
    }
  });

  console.log(`Successfully crawled ${crawlResult.data.length} pages`);

  crawlResult.data.forEach(page => {
    console.log(`URL: ${page.metadata.sourceURL}`);
    console.log(`Content: ${page.markdown.substring(0, 100)}...\n`);
  });
}

crawlWebsite();

Advanced Crawling with Filters

Control which pages to crawl using path patterns:

Python:

crawl_result = app.crawl_url(
    'https://example.com',
    params={
        'limit': 100,
        'maxDepth': 3,  # Maximum depth from starting URL
        'includePaths': ['/blog/*', '/articles/*', '/docs/*'],
        'excludePaths': ['/admin/*', '/login', '/signup'],
        'allowBackwardLinks': False,  # Don't crawl parent directories
        'allowExternalLinks': False,  # Stay on the same domain
        'scrapeOptions': {
            'formats': ['markdown'],
            'waitFor': 1000
        }
    }
)

JavaScript:

const crawlResult = await app.crawlUrl('https://example.com', {
  limit: 100,
  maxDepth: 3,
  includePaths: ['/blog/*', '/articles/*', '/docs/*'],
  excludePaths: ['/admin/*', '/login', '/signup'],
  allowBackwardLinks: false,
  allowExternalLinks: false,
  scrapeOptions: {
    formats: ['markdown'],
    waitFor: 1000
  }
});

Extracting Structured Data

Firecrawl can extract specific data fields using AI-powered extraction. This is incredibly useful for scraping product information, job listings, or any structured content.

Defining a Schema

Python:

from firecrawl import FirecrawlApp
import os

app = FirecrawlApp(api_key=os.getenv('FIRECRAWL_API_KEY'))

# Define the data structure you want to extract
schema = {
    'type': 'object',
    'properties': {
        'productName': {'type': 'string'},
        'price': {'type': 'number'},
        'currency': {'type': 'string'},
        'description': {'type': 'string'},
        'features': {
            'type': 'array',
            'items': {'type': 'string'}
        },
        'inStock': {'type': 'boolean'},
        'rating': {'type': 'number'}
    },
    'required': ['productName', 'price']
}

# Extract structured data
result = app.scrape_url(
    'https://example.com/product/laptop',
    params={
        'formats': ['extract'],
        'extract': {
            'schema': schema,
            'systemPrompt': 'Extract product information from this e-commerce page',
            'prompt': 'Extract all product details including name, price, features, and availability'
        }
    }
)

# Access the extracted data
product = result['extract']
print(f"Product: {product['productName']}")
print(f"Price: {product['price']} {product['currency']}")
print(f"In Stock: {product['inStock']}")
print(f"Features: {product['features']}")

JavaScript:

const schema = {
  type: 'object',
  properties: {
    productName: { type: 'string' },
    price: { type: 'number' },
    currency: { type: 'string' },
    description: { type: 'string' },
    features: {
      type: 'array',
      items: { type: 'string' }
    },
    inStock: { type: 'boolean' },
    rating: { type: 'number' }
  },
  required: ['productName', 'price']
};

const result = await app.scrapeUrl('https://example.com/product/laptop', {
  formats: ['extract'],
  extract: {
    schema: schema,
    systemPrompt: 'Extract product information from this e-commerce page',
    prompt: 'Extract all product details including name, price, features, and availability'
  }
});

const product = result.extract;
console.log(`Product: ${product.productName}`);
console.log(`Price: ${product.price} ${product.currency}`);
console.log(`In Stock: ${product.inStock}`);

Batch Extraction

Extract data from multiple pages efficiently:

Python:

import asyncio

async def scrape_multiple_products(urls):
    app = FirecrawlApp(api_key=os.getenv('FIRECRAWL_API_KEY'))

    schema = {
        'type': 'object',
        'properties': {
            'productName': {'type': 'string'},
            'price': {'type': 'number'}
        }
    }

    products = []
    for url in urls:
        result = app.scrape_url(
            url,
            params={
                'formats': ['extract'],
                'extract': {'schema': schema}
            }
        )
        products.append(result['extract'])

    return products

# Usage
urls = [
    'https://example.com/product1',
    'https://example.com/product2',
    'https://example.com/product3'
]

products = asyncio.run(scrape_multiple_products(urls))
for product in products:
    print(f"{product['productName']}: ${product['price']}")

JavaScript:

async function scrapeMultipleProducts(urls) {
  const schema = {
    type: 'object',
    properties: {
      productName: { type: 'string' },
      price: { type: 'number' }
    }
  };

  const promises = urls.map(url =>
    app.scrapeUrl(url, {
      formats: ['extract'],
      extract: { schema }
    })
  );

  const results = await Promise.all(promises);
  return results.map(r => r.extract);
}

// Usage
const urls = [
  'https://example.com/product1',
  'https://example.com/product2',
  'https://example.com/product3'
];

const products = await scrapeMultipleProducts(urls);
products.forEach(product => {
  console.log(`${product.productName}: $${product.price}`);
});

Handling Authentication

For pages that require login or authentication, you can pass custom headers and cookies:

Python:

result = app.scrape_url(
    'https://example.com/protected',
    params={
        'formats': ['markdown'],
        'headers': {
            'Authorization': 'Bearer your_token_here',
            'Cookie': 'session_id=abc123; user_pref=xyz'
        }
    }
)

JavaScript:

const result = await app.scrapeUrl('https://example.com/protected', {
  formats: ['markdown'],
  headers: {
    'Authorization': 'Bearer your_token_here',
    'Cookie': 'session_id=abc123; user_pref=xyz'
  }
});

This technique is similar to how you handle authentication in Puppeteer, but Firecrawl manages the browser session automatically.

Working with Screenshots

Firecrawl can capture screenshots of pages, which is useful for visual verification or archiving:

Python:

import base64

result = app.scrape_url(
    'https://example.com',
    params={
        'formats': ['markdown', 'screenshot']
    }
)

# Save the screenshot
if 'screenshot' in result:
    screenshot_data = base64.b64decode(result['screenshot'])
    with open('page_screenshot.png', 'wb') as f:
        f.write(screenshot_data)
    print("Screenshot saved!")

JavaScript:

import fs from 'fs';

const result = await app.scrapeUrl('https://example.com', {
  formats: ['markdown', 'screenshot']
});

// Save the screenshot
if (result.screenshot) {
  const buffer = Buffer.from(result.screenshot, 'base64');
  fs.writeFileSync('page_screenshot.png', buffer);
  console.log("Screenshot saved!");
}

Error Handling and Best Practices

Implementing Retry Logic

Python:

import time

def scrape_with_retry(url, max_retries=3):
    app = FirecrawlApp(api_key=os.getenv('FIRECRAWL_API_KEY'))

    for attempt in range(max_retries):
        try:
            result = app.scrape_url(
                url,
                params={
                    'formats': ['markdown'],
                    'timeout': 30000  # 30 seconds
                }
            )
            return result
        except Exception as e:
            print(f"Attempt {attempt + 1} failed: {e}")
            if attempt < max_retries - 1:
                time.sleep(2 ** attempt)  # Exponential backoff
            else:
                raise

# Usage
try:
    data = scrape_with_retry('https://example.com')
    print(data['markdown'])
except Exception as e:
    print(f"Failed after all retries: {e}")

JavaScript:

async function scrapeWithRetry(url, maxRetries = 3) {
  for (let attempt = 0; attempt < maxRetries; attempt++) {
    try {
      const result = await app.scrapeUrl(url, {
        formats: ['markdown'],
        timeout: 30000
      });
      return result;
    } catch (error) {
      console.log(`Attempt ${attempt + 1} failed: ${error.message}`);
      if (attempt < maxRetries - 1) {
        await new Promise(resolve =>
          setTimeout(resolve, Math.pow(2, attempt) * 1000)
        );
      } else {
        throw error;
      }
    }
  }
}

// Usage
try {
  const data = await scrapeWithRetry('https://example.com');
  console.log(data.markdown);
} catch (error) {
  console.log(`Failed after all retries: ${error.message}`);
}

Best Practices for Beginners

  1. Start Small - Test with a single page before crawling entire websites
  2. Use Environment Variables - Never hardcode API keys in your source code
  3. Implement Rate Limiting - Respect API limits and avoid throttling
  4. Handle Errors Gracefully - Always use try-catch blocks and implement retries
  5. Cache Results - Store scraped data to avoid redundant API calls
  6. Monitor Usage - Track your credit consumption through the dashboard
  7. Test Selectors - Use includeTags and excludeTags to improve accuracy
  8. Set Appropriate Timeouts - Adjust based on page complexity

Saving and Exporting Data

Saving to JSON

Python:

import json

result = app.scrape_url('https://example.com')

# Save as JSON
with open('scraped_data.json', 'w', encoding='utf-8') as f:
    json.dump(result, f, indent=2, ensure_ascii=False)

print("Data saved to scraped_data.json")

JavaScript:

import fs from 'fs';

const result = await app.scrapeUrl('https://example.com');

// Save as JSON
fs.writeFileSync(
  'scraped_data.json',
  JSON.stringify(result, null, 2)
);

console.log("Data saved to scraped_data.json");

Saving to CSV

Python:

import csv

# Assuming you've extracted structured data
products = [
    {'name': 'Product 1', 'price': 29.99},
    {'name': 'Product 2', 'price': 39.99}
]

with open('products.csv', 'w', newline='', encoding='utf-8') as f:
    writer = csv.DictWriter(f, fieldnames=['name', 'price'])
    writer.writeheader()
    writer.writerows(products)

JavaScript:

import fs from 'fs';
import { stringify } from 'csv-stringify/sync';

const products = [
  { name: 'Product 1', price: 29.99 },
  { name: 'Product 2', price: 39.99 }
];

const csv = stringify(products, { header: true });
fs.writeFileSync('products.csv', csv);

Common Use Cases

1. Blog Content Extraction

Perfect for content aggregation or analysis:

result = app.scrape_url(
    'https://example.com/blog/article',
    params={
        'formats': ['markdown'],
        'onlyMainContent': True,
        'includeTags': ['article', '.post-content']
    }
)

2. E-commerce Price Monitoring

Extract product prices for comparison:

schema = {
    'type': 'object',
    'properties': {
        'productName': {'type': 'string'},
        'currentPrice': {'type': 'number'},
        'originalPrice': {'type': 'number'},
        'inStock': {'type': 'boolean'}
    }
}

3. Job Listing Aggregation

Collect job postings from multiple sites:

schema = {
    'type': 'object',
    'properties': {
        'jobTitle': {'type': 'string'},
        'company': {'type': 'string'},
        'location': {'type': 'string'},
        'salary': {'type': 'string'},
        'description': {'type': 'string'}
    }
}

Next Steps

Now that you understand the basics of Firecrawl, you can:

  1. Learn how to use Firecrawl with Python for more Python-specific examples
  2. Learn how to use Firecrawl with Node.js for advanced JavaScript patterns
  3. Explore handling JavaScript-rendered websites for complex scraping scenarios
  4. Review the official Firecrawl documentation for advanced features
  5. Join the Firecrawl community for support and best practices

Conclusion

Firecrawl simplifies web scraping by handling the complex infrastructure, browser automation, and anti-bot measures automatically. This tutorial covered the essential concepts: basic scraping, crawling multiple pages, extracting structured data, and implementing best practices.

As a beginner, start with simple single-page scraping, gradually move to crawling, and then experiment with structured data extraction. With Firecrawl's managed approach, you can focus on extracting value from data rather than maintaining scraping infrastructure.

Remember to always respect websites' terms of service, implement proper error handling, and monitor your API usage to build reliable, sustainable scraping solutions.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon