Table of contents

Can Firecrawl Extract Images from Web Pages?

Yes, Firecrawl can extract images from web pages, though its image extraction capabilities work differently than traditional web scraping tools. Firecrawl is designed to convert web pages into clean, LLM-ready formats (primarily Markdown), and it includes image references in this output. Understanding how Firecrawl handles images is essential for developers building web scraping pipelines that need to capture visual content.

How Firecrawl Handles Image Extraction

Firecrawl processes web pages and converts them to Markdown format, which includes image references as Markdown image syntax. When Firecrawl scrapes a page, it extracts image URLs and preserves them in the output along with alt text when available.

Basic Image Extraction with Firecrawl

Here's how to extract images using Firecrawl's Python SDK:

from firecrawl import FirecrawlApp

# Initialize Firecrawl
app = FirecrawlApp(api_key='your_api_key')

# Scrape a page
result = app.scrape_url('https://example.com/gallery')

# The markdown content includes image references
print(result['markdown'])

# Access metadata which may include images
if 'metadata' in result:
    print(result['metadata'])

The Markdown output will contain image references like this:

![Product Image](https://example.com/images/product.jpg)
![Gallery Photo](https://example.com/images/gallery-1.jpg)

JavaScript/Node.js Implementation

Here's how to extract images using Firecrawl's JavaScript SDK:

import FirecrawlApp from '@mendable/firecrawl-js';

const app = new FirecrawlApp({ apiKey: 'your_api_key' });

async function extractImages(url) {
  try {
    const result = await app.scrapeUrl(url, {
      formats: ['markdown', 'html']
    });

    // Extract images from markdown
    const imageRegex = /!\[([^\]]*)\]\(([^\)]+)\)/g;
    const images = [];
    let match;

    while ((match = imageRegex.exec(result.markdown)) !== null) {
      images.push({
        alt: match[1],
        url: match[2]
      });
    }

    console.log('Extracted images:', images);
    return images;
  } catch (error) {
    console.error('Error extracting images:', error);
  }
}

extractImages('https://example.com/products');

Advanced Image Extraction Techniques

Extracting Image Metadata

To get more detailed information about images, you can combine Firecrawl's HTML output with custom parsing:

import re
from firecrawl import FirecrawlApp

app = FirecrawlApp(api_key='your_api_key')

def extract_detailed_images(url):
    # Request both markdown and HTML formats
    result = app.scrape_url(url, params={
        'formats': ['markdown', 'html']
    })

    images = []

    # Parse markdown for image references
    markdown = result.get('markdown', '')
    image_pattern = r'!\[([^\]]*)\]\(([^\)]+)\)'

    for match in re.finditer(image_pattern, markdown):
        alt_text = match.group(1)
        image_url = match.group(2)

        images.append({
            'url': image_url,
            'alt_text': alt_text,
            'type': 'content_image'
        })

    return images

# Example usage
images = extract_detailed_images('https://example.com/blog/post')
for img in images:
    print(f"Image URL: {img['url']}")
    print(f"Alt Text: {img['alt_text']}\n")

Crawling Multiple Pages for Images

When you need to extract images from multiple pages, use Firecrawl's crawl functionality:

import FirecrawlApp from '@mendable/firecrawl-js';

const app = new FirecrawlApp({ apiKey: 'your_api_key' });

async function crawlAndExtractImages(baseUrl) {
  const crawlResult = await app.crawlUrl(baseUrl, {
    limit: 100,
    scrapeOptions: {
      formats: ['markdown']
    }
  });

  const allImages = [];

  for (const page of crawlResult.data) {
    const imageRegex = /!\[([^\]]*)\]\(([^\)]+)\)/g;
    let match;

    while ((match = imageRegex.exec(page.markdown)) !== null) {
      allImages.push({
        pageUrl: page.metadata.sourceURL,
        imageUrl: match[2],
        altText: match[1]
      });
    }
  }

  return allImages;
}

// Crawl a website for images
crawlAndExtractImages('https://example.com')
  .then(images => {
    console.log(`Found ${images.length} images across all pages`);
    console.log(images);
  });

Filtering and Processing Images

Filter Images by Type or URL Pattern

from firecrawl import FirecrawlApp
from urllib.parse import urlparse
import re

app = FirecrawlApp(api_key='your_api_key')

def extract_filtered_images(url, filter_extensions=['.jpg', '.png', '.webp']):
    result = app.scrape_url(url, params={'formats': ['markdown']})
    markdown = result.get('markdown', '')

    image_pattern = r'!\[([^\]]*)\]\(([^\)]+)\)'
    filtered_images = []

    for match in re.finditer(image_pattern, markdown):
        image_url = match.group(2)

        # Filter by extension
        if any(image_url.lower().endswith(ext) for ext in filter_extensions):
            filtered_images.append({
                'url': image_url,
                'alt': match.group(1),
                'extension': image_url.split('.')[-1].lower()
            })

    return filtered_images

# Extract only JPG and PNG images
images = extract_filtered_images('https://example.com', ['.jpg', '.png'])
print(f"Found {len(images)} JPG/PNG images")

Downloading Extracted Images

Once you've extracted image URLs, you can download them:

import requests
from firecrawl import FirecrawlApp
import os
import re

app = FirecrawlApp(api_key='your_api_key')

def download_images_from_page(url, download_dir='images'):
    # Create download directory
    os.makedirs(download_dir, exist_ok=True)

    # Scrape page
    result = app.scrape_url(url, params={'formats': ['markdown']})
    markdown = result.get('markdown', '')

    # Extract image URLs
    image_pattern = r'!\[([^\]]*)\]\(([^\)]+)\)'

    for idx, match in enumerate(re.finditer(image_pattern, markdown)):
        image_url = match.group(2)

        # Handle relative URLs
        if not image_url.startswith('http'):
            from urllib.parse import urljoin
            image_url = urljoin(url, image_url)

        try:
            # Download image
            response = requests.get(image_url, timeout=10)
            response.raise_for_status()

            # Generate filename
            filename = f"image_{idx}.{image_url.split('.')[-1]}"
            filepath = os.path.join(download_dir, filename)

            # Save image
            with open(filepath, 'wb') as f:
                f.write(response.content)

            print(f"Downloaded: {filename}")
        except Exception as e:
            print(f"Failed to download {image_url}: {e}")

# Download all images from a page
download_images_from_page('https://example.com/gallery')

Handling JavaScript-Rendered Images

Firecrawl uses headless browser technology similar to monitoring network requests in browser automation, which means it can extract images that are loaded dynamically via JavaScript. This is a significant advantage over simple HTML parsers.

import FirecrawlApp from '@mendable/firecrawl-js';

const app = new FirecrawlApp({ apiKey: 'your_api_key' });

async function extractDynamicImages(url) {
  // Firecrawl waits for JavaScript to load
  const result = await app.scrapeUrl(url, {
    formats: ['markdown'],
    waitFor: 2000  // Wait additional time for lazy-loaded images
  });

  const imageRegex = /!\[([^\]]*)\]\(([^\)]+)\)/g;
  const images = [];
  let match;

  while ((match = imageRegex.exec(result.markdown)) !== null) {
    images.push({
      url: match[2],
      alt: match[1]
    });
  }

  return images;
}

// Extract images from a dynamic page
extractDynamicImages('https://example.com/spa-gallery')
  .then(images => console.log('Dynamic images:', images));

Extracting Images from Specific Sections

You can use Firecrawl's LLM extraction features to target specific image content:

from firecrawl import FirecrawlApp

app = FirecrawlApp(api_key='your_api_key')

# Use LLM extraction to get structured image data
result = app.scrape_url('https://example.com/products', params={
    'formats': ['markdown', 'extract'],
    'extract': {
        'schema': {
            'type': 'object',
            'properties': {
                'product_images': {
                    'type': 'array',
                    'items': {
                        'type': 'object',
                        'properties': {
                            'url': {'type': 'string'},
                            'caption': {'type': 'string'},
                            'is_primary': {'type': 'boolean'}
                        }
                    }
                }
            }
        }
    }
})

if 'extract' in result:
    product_images = result['extract'].get('product_images', [])
    for img in product_images:
        print(f"Product Image: {img['url']}")
        print(f"Caption: {img['caption']}")
        print(f"Primary: {img['is_primary']}\n")

Comparing Firecrawl to Traditional Image Extraction

Unlike traditional web scrapers that use CSS selectors or XPath, Firecrawl's approach has several advantages:

  1. JavaScript Support: Automatically handles dynamically loaded images
  2. Clean Output: Provides images in a structured Markdown format
  3. Alt Text Preservation: Maintains accessibility information
  4. LLM Integration: Easy to feed extracted data into AI models

However, for scenarios requiring precise DOM manipulation or interacting with specific DOM elements, traditional tools may offer more control.

Best Practices for Image Extraction

1. Handle Relative URLs

from urllib.parse import urljoin

def normalize_image_url(image_url, base_url):
    """Convert relative URLs to absolute URLs"""
    if not image_url.startswith('http'):
        return urljoin(base_url, image_url)
    return image_url

2. Implement Rate Limiting

async function extractImagesWithRateLimit(urls, delayMs = 1000) {
  const allImages = [];

  for (const url of urls) {
    const images = await extractImages(url);
    allImages.push(...images);

    // Wait before next request
    await new Promise(resolve => setTimeout(resolve, delayMs));
  }

  return allImages;
}

3. Error Handling

from firecrawl import FirecrawlApp

app = FirecrawlApp(api_key='your_api_key')

def safe_extract_images(url):
    try:
        result = app.scrape_url(url, params={'formats': ['markdown']})
        # Extract images...
        return images
    except Exception as e:
        print(f"Error extracting images from {url}: {e}")
        return []

Conclusion

Firecrawl can effectively extract images from web pages by converting HTML to Markdown and preserving image references with their URLs and alt text. While it doesn't provide pixel-level image analysis, it excels at capturing image metadata and URLs from both static and JavaScript-rendered pages. For developers building web scraping pipelines, Firecrawl offers a clean, LLM-friendly approach to image extraction that integrates well with modern AI workflows.

For more complex scenarios involving dynamic content, consider exploring how to handle AJAX requests in browser automation to ensure all images are fully loaded before extraction.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon