Can Firecrawl Extract Images from Web Pages?
Yes, Firecrawl can extract images from web pages, though its image extraction capabilities work differently than traditional web scraping tools. Firecrawl is designed to convert web pages into clean, LLM-ready formats (primarily Markdown), and it includes image references in this output. Understanding how Firecrawl handles images is essential for developers building web scraping pipelines that need to capture visual content.
How Firecrawl Handles Image Extraction
Firecrawl processes web pages and converts them to Markdown format, which includes image references as Markdown image syntax. When Firecrawl scrapes a page, it extracts image URLs and preserves them in the output along with alt text when available.
Basic Image Extraction with Firecrawl
Here's how to extract images using Firecrawl's Python SDK:
from firecrawl import FirecrawlApp
# Initialize Firecrawl
app = FirecrawlApp(api_key='your_api_key')
# Scrape a page
result = app.scrape_url('https://example.com/gallery')
# The markdown content includes image references
print(result['markdown'])
# Access metadata which may include images
if 'metadata' in result:
print(result['metadata'])
The Markdown output will contain image references like this:


JavaScript/Node.js Implementation
Here's how to extract images using Firecrawl's JavaScript SDK:
import FirecrawlApp from '@mendable/firecrawl-js';
const app = new FirecrawlApp({ apiKey: 'your_api_key' });
async function extractImages(url) {
try {
const result = await app.scrapeUrl(url, {
formats: ['markdown', 'html']
});
// Extract images from markdown
const imageRegex = /!\[([^\]]*)\]\(([^\)]+)\)/g;
const images = [];
let match;
while ((match = imageRegex.exec(result.markdown)) !== null) {
images.push({
alt: match[1],
url: match[2]
});
}
console.log('Extracted images:', images);
return images;
} catch (error) {
console.error('Error extracting images:', error);
}
}
extractImages('https://example.com/products');
Advanced Image Extraction Techniques
Extracting Image Metadata
To get more detailed information about images, you can combine Firecrawl's HTML output with custom parsing:
import re
from firecrawl import FirecrawlApp
app = FirecrawlApp(api_key='your_api_key')
def extract_detailed_images(url):
# Request both markdown and HTML formats
result = app.scrape_url(url, params={
'formats': ['markdown', 'html']
})
images = []
# Parse markdown for image references
markdown = result.get('markdown', '')
image_pattern = r'!\[([^\]]*)\]\(([^\)]+)\)'
for match in re.finditer(image_pattern, markdown):
alt_text = match.group(1)
image_url = match.group(2)
images.append({
'url': image_url,
'alt_text': alt_text,
'type': 'content_image'
})
return images
# Example usage
images = extract_detailed_images('https://example.com/blog/post')
for img in images:
print(f"Image URL: {img['url']}")
print(f"Alt Text: {img['alt_text']}\n")
Crawling Multiple Pages for Images
When you need to extract images from multiple pages, use Firecrawl's crawl functionality:
import FirecrawlApp from '@mendable/firecrawl-js';
const app = new FirecrawlApp({ apiKey: 'your_api_key' });
async function crawlAndExtractImages(baseUrl) {
const crawlResult = await app.crawlUrl(baseUrl, {
limit: 100,
scrapeOptions: {
formats: ['markdown']
}
});
const allImages = [];
for (const page of crawlResult.data) {
const imageRegex = /!\[([^\]]*)\]\(([^\)]+)\)/g;
let match;
while ((match = imageRegex.exec(page.markdown)) !== null) {
allImages.push({
pageUrl: page.metadata.sourceURL,
imageUrl: match[2],
altText: match[1]
});
}
}
return allImages;
}
// Crawl a website for images
crawlAndExtractImages('https://example.com')
.then(images => {
console.log(`Found ${images.length} images across all pages`);
console.log(images);
});
Filtering and Processing Images
Filter Images by Type or URL Pattern
from firecrawl import FirecrawlApp
from urllib.parse import urlparse
import re
app = FirecrawlApp(api_key='your_api_key')
def extract_filtered_images(url, filter_extensions=['.jpg', '.png', '.webp']):
result = app.scrape_url(url, params={'formats': ['markdown']})
markdown = result.get('markdown', '')
image_pattern = r'!\[([^\]]*)\]\(([^\)]+)\)'
filtered_images = []
for match in re.finditer(image_pattern, markdown):
image_url = match.group(2)
# Filter by extension
if any(image_url.lower().endswith(ext) for ext in filter_extensions):
filtered_images.append({
'url': image_url,
'alt': match.group(1),
'extension': image_url.split('.')[-1].lower()
})
return filtered_images
# Extract only JPG and PNG images
images = extract_filtered_images('https://example.com', ['.jpg', '.png'])
print(f"Found {len(images)} JPG/PNG images")
Downloading Extracted Images
Once you've extracted image URLs, you can download them:
import requests
from firecrawl import FirecrawlApp
import os
import re
app = FirecrawlApp(api_key='your_api_key')
def download_images_from_page(url, download_dir='images'):
# Create download directory
os.makedirs(download_dir, exist_ok=True)
# Scrape page
result = app.scrape_url(url, params={'formats': ['markdown']})
markdown = result.get('markdown', '')
# Extract image URLs
image_pattern = r'!\[([^\]]*)\]\(([^\)]+)\)'
for idx, match in enumerate(re.finditer(image_pattern, markdown)):
image_url = match.group(2)
# Handle relative URLs
if not image_url.startswith('http'):
from urllib.parse import urljoin
image_url = urljoin(url, image_url)
try:
# Download image
response = requests.get(image_url, timeout=10)
response.raise_for_status()
# Generate filename
filename = f"image_{idx}.{image_url.split('.')[-1]}"
filepath = os.path.join(download_dir, filename)
# Save image
with open(filepath, 'wb') as f:
f.write(response.content)
print(f"Downloaded: {filename}")
except Exception as e:
print(f"Failed to download {image_url}: {e}")
# Download all images from a page
download_images_from_page('https://example.com/gallery')
Handling JavaScript-Rendered Images
Firecrawl uses headless browser technology similar to monitoring network requests in browser automation, which means it can extract images that are loaded dynamically via JavaScript. This is a significant advantage over simple HTML parsers.
import FirecrawlApp from '@mendable/firecrawl-js';
const app = new FirecrawlApp({ apiKey: 'your_api_key' });
async function extractDynamicImages(url) {
// Firecrawl waits for JavaScript to load
const result = await app.scrapeUrl(url, {
formats: ['markdown'],
waitFor: 2000 // Wait additional time for lazy-loaded images
});
const imageRegex = /!\[([^\]]*)\]\(([^\)]+)\)/g;
const images = [];
let match;
while ((match = imageRegex.exec(result.markdown)) !== null) {
images.push({
url: match[2],
alt: match[1]
});
}
return images;
}
// Extract images from a dynamic page
extractDynamicImages('https://example.com/spa-gallery')
.then(images => console.log('Dynamic images:', images));
Extracting Images from Specific Sections
You can use Firecrawl's LLM extraction features to target specific image content:
from firecrawl import FirecrawlApp
app = FirecrawlApp(api_key='your_api_key')
# Use LLM extraction to get structured image data
result = app.scrape_url('https://example.com/products', params={
'formats': ['markdown', 'extract'],
'extract': {
'schema': {
'type': 'object',
'properties': {
'product_images': {
'type': 'array',
'items': {
'type': 'object',
'properties': {
'url': {'type': 'string'},
'caption': {'type': 'string'},
'is_primary': {'type': 'boolean'}
}
}
}
}
}
}
})
if 'extract' in result:
product_images = result['extract'].get('product_images', [])
for img in product_images:
print(f"Product Image: {img['url']}")
print(f"Caption: {img['caption']}")
print(f"Primary: {img['is_primary']}\n")
Comparing Firecrawl to Traditional Image Extraction
Unlike traditional web scrapers that use CSS selectors or XPath, Firecrawl's approach has several advantages:
- JavaScript Support: Automatically handles dynamically loaded images
- Clean Output: Provides images in a structured Markdown format
- Alt Text Preservation: Maintains accessibility information
- LLM Integration: Easy to feed extracted data into AI models
However, for scenarios requiring precise DOM manipulation or interacting with specific DOM elements, traditional tools may offer more control.
Best Practices for Image Extraction
1. Handle Relative URLs
from urllib.parse import urljoin
def normalize_image_url(image_url, base_url):
"""Convert relative URLs to absolute URLs"""
if not image_url.startswith('http'):
return urljoin(base_url, image_url)
return image_url
2. Implement Rate Limiting
async function extractImagesWithRateLimit(urls, delayMs = 1000) {
const allImages = [];
for (const url of urls) {
const images = await extractImages(url);
allImages.push(...images);
// Wait before next request
await new Promise(resolve => setTimeout(resolve, delayMs));
}
return allImages;
}
3. Error Handling
from firecrawl import FirecrawlApp
app = FirecrawlApp(api_key='your_api_key')
def safe_extract_images(url):
try:
result = app.scrape_url(url, params={'formats': ['markdown']})
# Extract images...
return images
except Exception as e:
print(f"Error extracting images from {url}: {e}")
return []
Conclusion
Firecrawl can effectively extract images from web pages by converting HTML to Markdown and preserving image references with their URLs and alt text. While it doesn't provide pixel-level image analysis, it excels at capturing image metadata and URLs from both static and JavaScript-rendered pages. For developers building web scraping pipelines, Firecrawl offers a clean, LLM-friendly approach to image extraction that integrates well with modern AI workflows.
For more complex scenarios involving dynamic content, consider exploring how to handle AJAX requests in browser automation to ensure all images are fully loaded before extraction.