How do I use Firecrawl to render HTML with JavaScript?
Firecrawl is designed to handle JavaScript-rendered websites out of the box, making it an excellent choice for scraping modern web applications, single-page applications (SPAs), and dynamic content. Unlike traditional web scrapers that only fetch static HTML, Firecrawl uses headless browser technology to execute JavaScript and wait for content to load before extracting data.
Understanding JavaScript Rendering in Firecrawl
Firecrawl automatically renders JavaScript by default when you use its API endpoints. This means that when you make a request to scrape a page, Firecrawl:
- Launches a headless browser (typically Chromium-based)
- Navigates to the target URL
- Executes all JavaScript code on the page
- Waits for dynamic content to load
- Returns the fully rendered HTML
This process is similar to how to handle AJAX requests using Puppeteer, but Firecrawl abstracts away the complexity of browser automation.
Basic JavaScript Rendering with Firecrawl
Using Python
Here's how to scrape a JavaScript-rendered page using Firecrawl's Python SDK:
from firecrawl import FirecrawlApp
# Initialize the Firecrawl client
app = FirecrawlApp(api_key='your_api_key_here')
# Scrape a JavaScript-heavy website
result = app.scrape_url('https://example.com/spa-application')
# Access the fully rendered HTML
html_content = result['html']
# Access the markdown version (cleaned and formatted)
markdown_content = result['markdown']
# Access extracted metadata
metadata = result['metadata']
print(f"Title: {metadata['title']}")
print(f"Description: {metadata['description']}")
Using Node.js
With the Node.js SDK, the process is equally straightforward:
import FirecrawlApp from '@mendable/firecrawl-js';
// Initialize the Firecrawl client
const app = new FirecrawlApp({ apiKey: 'your_api_key_here' });
// Scrape a JavaScript-rendered page
async function scrapePage() {
try {
const result = await app.scrapeUrl('https://example.com/spa-application');
// Access the fully rendered HTML
console.log('HTML:', result.html);
// Access the markdown version
console.log('Markdown:', result.markdown);
// Access metadata
console.log('Title:', result.metadata.title);
console.log('Description:', result.metadata.description);
} catch (error) {
console.error('Error scraping page:', error);
}
}
scrapePage();
Advanced JavaScript Rendering Options
Wait for Specific Elements
Sometimes you need to wait for specific elements to appear before considering the page fully loaded. Firecrawl provides the waitFor
parameter:
from firecrawl import FirecrawlApp
app = FirecrawlApp(api_key='your_api_key_here')
# Wait for a specific CSS selector before scraping
result = app.scrape_url(
'https://example.com/dynamic-content',
params={
'waitFor': 5000 # Wait up to 5 seconds for content to load
}
)
This is particularly useful when dealing with lazy-loaded content or animations, similar to using the waitFor function in Puppeteer.
Using Direct API Calls
If you prefer to use the REST API directly without an SDK:
curl -X POST https://api.firecrawl.dev/v0/scrape \
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer YOUR_API_KEY' \
-d '{
"url": "https://example.com/spa-application",
"formats": ["html", "markdown"],
"waitFor": 3000
}'
The response will include the fully rendered HTML:
{
"success": true,
"data": {
"html": "<html>...</html>",
"markdown": "# Page Title\n\nContent...",
"metadata": {
"title": "Page Title",
"description": "Page description",
"language": "en",
"sourceURL": "https://example.com/spa-application"
}
}
}
Handling Different Types of JavaScript Content
Single Page Applications (SPAs)
SPAs built with React, Vue, Angular, or similar frameworks require JavaScript execution to render content. Firecrawl handles these automatically:
from firecrawl import FirecrawlApp
app = FirecrawlApp(api_key='your_api_key_here')
# Scrape a React application
react_app = app.scrape_url('https://example.com/react-app')
# Scrape a Vue.js application
vue_app = app.scrape_url('https://example.com/vue-app')
# Scrape an Angular application
angular_app = app.scrape_url('https://example.com/angular-app')
# All will return fully rendered HTML with JavaScript executed
Lazy-Loaded Content
For pages that load content as users scroll or interact:
import FirecrawlApp from '@mendable/firecrawl-js';
const app = new FirecrawlApp({ apiKey: 'your_api_key_here' });
async function scrapeLazyContent() {
const result = await app.scrapeUrl('https://example.com/lazy-load', {
waitFor: 5000, // Give extra time for lazy loading
formats: ['html', 'markdown']
});
return result;
}
Infinite Scroll Pages
While Firecrawl renders JavaScript, it doesn't automatically trigger infinite scroll. For such cases, you might need to use the crawl endpoint with pagination:
from firecrawl import FirecrawlApp
app = FirecrawlApp(api_key='your_api_key_here')
# Crawl multiple pages with JavaScript rendering
crawl_result = app.crawl_url(
'https://example.com/infinite-scroll',
params={
'limit': 50, # Maximum pages to crawl
'waitFor': 3000
}
)
# Process each crawled page
for page in crawl_result['data']:
print(f"URL: {page['metadata']['sourceURL']}")
print(f"HTML Length: {len(page['html'])}")
Extracting Data from JavaScript-Rendered Pages
Using the Extract Endpoint
Firecrawl's extract endpoint allows you to define a schema and extract structured data from JavaScript-rendered pages:
from firecrawl import FirecrawlApp
app = FirecrawlApp(api_key='your_api_key_here')
# Define the schema for data extraction
schema = {
"type": "object",
"properties": {
"product_name": {"type": "string"},
"price": {"type": "number"},
"availability": {"type": "string"},
"reviews": {
"type": "array",
"items": {
"type": "object",
"properties": {
"author": {"type": "string"},
"rating": {"type": "number"},
"comment": {"type": "string"}
}
}
}
},
"required": ["product_name", "price"]
}
# Extract structured data from a JavaScript-heavy e-commerce page
result = app.extract_url(
'https://example.com/product/123',
schema=schema
)
print(result['data'])
Using the Map Endpoint for Site Discovery
Before scraping, you can map out a website's structure:
import FirecrawlApp from '@mendable/firecrawl-js';
const app = new FirecrawlApp({ apiKey: 'your_api_key_here' });
async function mapWebsite() {
const result = await app.mapUrl('https://example.com');
// Get all discovered URLs
console.log('Discovered URLs:', result.links);
// Now scrape each URL with JavaScript rendering
for (const url of result.links) {
const pageData = await app.scrapeUrl(url, {
formats: ['html', 'markdown']
});
console.log(`Scraped: ${url}`);
}
}
mapWebsite();
Best Practices for JavaScript Rendering
1. Optimize Wait Times
Don't set unnecessarily long wait times. Monitor your target pages and adjust:
# Too long - wastes time and credits
result = app.scrape_url(url, params={'waitFor': 30000})
# Better - appropriate for most JavaScript apps
result = app.scrape_url(url, params={'waitFor': 3000})
# Best - no wait if page loads quickly
result = app.scrape_url(url) # Default behavior is usually sufficient
2. Handle Errors Gracefully
JavaScript rendering can fail for various reasons. Always implement error handling:
from firecrawl import FirecrawlApp
import logging
app = FirecrawlApp(api_key='your_api_key_here')
def safe_scrape(url):
try:
result = app.scrape_url(url, params={'waitFor': 5000})
if result.get('success'):
return result['data']
else:
logging.error(f"Scraping failed for {url}: {result.get('error')}")
return None
except Exception as e:
logging.error(f"Exception while scraping {url}: {str(e)}")
return None
# Use the safe scraper
data = safe_scrape('https://example.com/spa-page')
if data:
print(f"Successfully scraped: {data['metadata']['title']}")
3. Respect Rate Limits
Firecrawl has rate limits to ensure service quality. Implement proper throttling:
import FirecrawlApp from '@mendable/firecrawl-js';
const app = new FirecrawlApp({ apiKey: 'your_api_key_here' });
async function scrapeWithRateLimit(urls, delayMs = 1000) {
const results = [];
for (const url of urls) {
try {
const result = await app.scrapeUrl(url);
results.push(result);
// Wait before next request
if (urls.indexOf(url) < urls.length - 1) {
await new Promise(resolve => setTimeout(resolve, delayMs));
}
} catch (error) {
console.error(`Error scraping ${url}:`, error);
}
}
return results;
}
const urls = [
'https://example.com/page1',
'https://example.com/page2',
'https://example.com/page3'
];
scrapeWithRateLimit(urls, 2000); // 2 second delay between requests
4. Choose the Right Output Format
Firecrawl supports multiple output formats. Choose based on your needs:
from firecrawl import FirecrawlApp
app = FirecrawlApp(api_key='your_api_key_here')
# Get HTML for detailed DOM manipulation
html_result = app.scrape_url(url, params={'formats': ['html']})
# Get Markdown for cleaner text extraction
markdown_result = app.scrape_url(url, params={'formats': ['markdown']})
# Get both formats
both_formats = app.scrape_url(url, params={'formats': ['html', 'markdown']})
Troubleshooting JavaScript Rendering
Page Not Fully Loaded
If content is missing, increase the wait time or check for specific elements:
# Increase wait time
result = app.scrape_url(url, params={'waitFor': 10000})
# Or use crawl with more options
result = app.crawl_url(
url,
params={
'waitFor': 5000,
'limit': 1
}
)
Performance Issues
If scraping is slow:
- Reduce the
waitFor
parameter - Use the
formats
parameter to only get what you need - Consider using the crawl endpoint for batch operations
# Optimized for performance
result = app.scrape_url(
url,
params={
'formats': ['markdown'], # Skip HTML if not needed
'waitFor': 2000 # Minimal wait time
}
)
Handling Dynamic URLs
For SPAs that use hash routing or query parameters:
const urls = [
'https://example.com/#/products/1',
'https://example.com/#/products/2',
'https://example.com/?page=1',
'https://example.com/?page=2'
];
for (const url of urls) {
const result = await app.scrapeUrl(url, {
waitFor: 3000,
formats: ['markdown']
});
console.log(`Scraped: ${url}`);
}
Comparison with Other Tools
Firecrawl's JavaScript rendering capability is similar to crawling single page applications with Puppeteer, but with several advantages:
- No infrastructure management: No need to maintain headless browsers
- Built-in retries: Automatic retry logic for failed requests
- Scalability: Handles concurrent requests without managing browser instances
- Simplified API: Clean, consistent interface across all endpoints
Conclusion
Firecrawl makes JavaScript rendering simple and accessible through its API. Whether you're scraping modern SPAs, handling AJAX-loaded content, or extracting data from dynamic websites, Firecrawl's built-in browser automation handles the complexity for you.
Key takeaways:
- JavaScript rendering is enabled by default in Firecrawl
- Use the
waitFor
parameter for pages with delayed content loading - Choose appropriate output formats (HTML, Markdown) based on your needs
- Implement error handling and respect rate limits
- Use the extract endpoint for structured data from JavaScript-heavy pages
By following these best practices, you can efficiently scrape JavaScript-rendered websites without the overhead of managing headless browsers or complex automation scripts.