Can Firecrawl Extract URL Lists from Web Pages?
Yes, Firecrawl can extract URL lists from web pages through multiple approaches. The platform provides several methods to discover and extract links, including its crawl endpoint that automatically discovers URLs, its scrape endpoint that can extract links from specific pages, and structured data extraction capabilities that can target links with precision.
Understanding Firecrawl's URL Extraction Methods
Firecrawl offers three primary methods for extracting URLs from web pages:
- Automatic link discovery during crawling - The crawl endpoint finds and follows links automatically
- Manual link extraction from HTML - Parse links from the markdown or HTML output
- Structured data extraction - Use schemas to extract specific link data with metadata
Each method serves different use cases depending on whether you need to discover links across multiple pages or extract specific URLs from a single page.
Method 1: Using the Crawl Endpoint for URL Discovery
The crawl endpoint is the most straightforward way to extract URL lists from websites. When you initiate a crawl, Firecrawl automatically discovers all links within the specified domain or subdomain.
Python Example
from firecrawl import FirecrawlApp
# Initialize Firecrawl
app = FirecrawlApp(api_key='your_api_key')
# Start a crawl to discover URLs
crawl_result = app.crawl_url(
'https://example.com',
params={
'limit': 100,
'scrapeOptions': {
'formats': ['markdown', 'links']
}
}
)
# Extract all discovered URLs
discovered_urls = []
for page in crawl_result['data']:
discovered_urls.append(page['metadata']['sourceURL'])
print(f"Discovered {len(discovered_urls)} URLs")
for url in discovered_urls:
print(url)
JavaScript/Node.js Example
import FirecrawlApp from '@mendable/firecrawl-js';
const app = new FirecrawlApp({ apiKey: 'your_api_key' });
async function extractUrls() {
// Crawl the website
const crawlResult = await app.crawlUrl('https://example.com', {
limit: 100,
scrapeOptions: {
formats: ['markdown', 'links']
}
});
// Extract discovered URLs
const discoveredUrls = crawlResult.data.map(page =>
page.metadata.sourceURL
);
console.log(`Discovered ${discoveredUrls.length} URLs`);
discoveredUrls.forEach(url => console.log(url));
}
extractUrls();
Method 2: Extracting Links from a Single Page
If you need to extract links from a specific page without crawling the entire site, use the scrape endpoint with link extraction enabled.
Python Example
from firecrawl import FirecrawlApp
app = FirecrawlApp(api_key='your_api_key')
# Scrape a single page and extract links
scrape_result = app.scrape_url(
'https://example.com/page',
params={
'formats': ['markdown', 'links', 'html']
}
)
# Access extracted links
if 'links' in scrape_result:
links = scrape_result['links']
print(f"Found {len(links)} links:")
for link in links:
print(link)
JavaScript Example
import FirecrawlApp from '@mendable/firecrawl-js';
const app = new FirecrawlApp({ apiKey: 'your_api_key' });
async function extractLinksFromPage() {
const scrapeResult = await app.scrapeUrl('https://example.com/page', {
formats: ['markdown', 'links', 'html']
});
if (scrapeResult.links) {
console.log(`Found ${scrapeResult.links.length} links:`);
scrapeResult.links.forEach(link => console.log(link));
}
}
extractLinksFromPage();
Method 3: Structured Link Extraction with Schemas
For more precise control over which links to extract and their associated metadata, use Firecrawl's structured data extraction feature. This is particularly useful when you need to extract links along with their anchor text, context, or other attributes.
Python Example with Schema
from firecrawl import FirecrawlApp
app = FirecrawlApp(api_key='your_api_key')
# Define a schema for link extraction
link_schema = {
'type': 'object',
'properties': {
'navigation_links': {
'type': 'array',
'items': {
'type': 'object',
'properties': {
'url': {'type': 'string'},
'text': {'type': 'string'},
'description': {'type': 'string'}
}
}
},
'article_links': {
'type': 'array',
'items': {
'type': 'object',
'properties': {
'url': {'type': 'string'},
'title': {'type': 'string'},
'category': {'type': 'string'}
}
}
}
}
}
# Extract structured link data
result = app.scrape_url(
'https://example.com/blog',
params={
'formats': ['extract'],
'extract': {
'schema': link_schema
}
}
)
# Access structured link data
nav_links = result['extract']['navigation_links']
article_links = result['extract']['article_links']
print(f"Navigation links: {len(nav_links)}")
print(f"Article links: {len(article_links)}")
JavaScript Example with Schema
import FirecrawlApp from '@mendable/firecrawl-js';
const app = new FirecrawlApp({ apiKey: 'your_api_key' });
const linkSchema = {
type: 'object',
properties: {
navigation_links: {
type: 'array',
items: {
type: 'object',
properties: {
url: { type: 'string' },
text: { type: 'string' },
description: { type: 'string' }
}
}
},
article_links: {
type: 'array',
items: {
type: 'object',
properties: {
url: { type: 'string' },
title: { type: 'string' },
category: { type: 'string' }
}
}
}
}
};
async function extractStructuredLinks() {
const result = await app.scrapeUrl('https://example.com/blog', {
formats: ['extract'],
extract: {
schema: linkSchema
}
});
const navLinks = result.extract.navigation_links;
const articleLinks = result.extract.article_links;
console.log(`Navigation links: ${navLinks.length}`);
console.log(`Article links: ${articleLinks.length}`);
}
extractStructuredLinks();
Filtering and Processing Extracted URLs
Once you've extracted URLs, you'll often need to filter and process them. Here are common patterns:
Filtering by URL Pattern
import re
def filter_urls(urls, pattern=None, exclude_pattern=None):
filtered = urls
if pattern:
filtered = [url for url in filtered if re.search(pattern, url)]
if exclude_pattern:
filtered = [url for url in filtered if not re.search(exclude_pattern, url)]
return filtered
# Example: Extract only blog post URLs
all_urls = ['https://example.com/blog/post-1', 'https://example.com/about',
'https://example.com/blog/post-2']
blog_urls = filter_urls(all_urls, pattern=r'/blog/')
print(blog_urls) # Only blog URLs
Deduplicating URLs
function deduplicateUrls(urls) {
return [...new Set(urls)];
}
// Remove duplicates
const urls = ['https://example.com/page1', 'https://example.com/page2',
'https://example.com/page1'];
const uniqueUrls = deduplicateUrls(urls);
console.log(uniqueUrls); // ['https://example.com/page1', 'https://example.com/page2']
Advanced URL Extraction Techniques
Extracting Links with Specific Attributes
When you need to extract links with specific attributes (like download links, external links, or links with specific CSS classes), you can combine Firecrawl's HTML output with custom parsing:
from bs4 import BeautifulSoup
from firecrawl import FirecrawlApp
app = FirecrawlApp(api_key='your_api_key')
# Get HTML content
result = app.scrape_url('https://example.com', params={'formats': ['html']})
html = result['html']
# Parse with BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')
# Extract external links
external_links = []
for link in soup.find_all('a', href=True):
href = link['href']
if href.startswith('http') and 'example.com' not in href:
external_links.append({
'url': href,
'text': link.get_text(strip=True),
'rel': link.get('rel', [])
})
print(f"Found {len(external_links)} external links")
Crawling with URL Patterns
You can control which URLs Firecrawl discovers during a crawl by specifying include and exclude patterns:
from firecrawl import FirecrawlApp
app = FirecrawlApp(api_key='your_api_key')
# Crawl only specific sections
crawl_result = app.crawl_url(
'https://example.com',
params={
'limit': 500,
'includePaths': ['/blog/*', '/articles/*'],
'excludePaths': ['/admin/*', '/private/*'],
'scrapeOptions': {
'formats': ['links']
}
}
)
# Extract URLs matching the patterns
matching_urls = [page['metadata']['sourceURL'] for page in crawl_result['data']]
Handling Dynamic Content and JavaScript-Rendered Links
Firecrawl excels at extracting links from JavaScript-rendered pages, similar to how you would handle AJAX requests using Puppeteer. The platform automatically waits for JavaScript to execute before extracting content:
from firecrawl import FirecrawlApp
app = FirecrawlApp(api_key='your_api_key')
# Scrape a JavaScript-heavy page
result = app.scrape_url(
'https://example.com/spa-page',
params={
'formats': ['links'],
'waitFor': 5000 # Wait 5 seconds for JavaScript to render
}
)
# Links will include dynamically loaded content
dynamic_links = result['links']
Exporting URL Lists
After extracting URLs, you'll often want to export them for further processing:
Export to CSV
import csv
def export_urls_to_csv(urls, filename='urls.csv'):
with open(filename, 'w', newline='') as csvfile:
writer = csv.writer(csvfile)
writer.writerow(['URL'])
for url in urls:
writer.writerow([url])
# Export discovered URLs
export_urls_to_csv(discovered_urls)
Export to JSON
import fs from 'fs';
function exportUrlsToJson(urls, filename = 'urls.json') {
const data = {
urls: urls,
count: urls.length,
extracted_at: new Date().toISOString()
};
fs.writeFileSync(filename, JSON.stringify(data, null, 2));
}
// Export discovered URLs
exportUrlsToJson(discoveredUrls);
Best Practices for URL Extraction
Set appropriate limits: When crawling large sites, use the
limit
parameter to control the number of pages crawled and avoid excessive API usage.Use include/exclude patterns: Narrow down your crawl to specific sections of a website to improve efficiency and reduce noise.
Handle relative URLs: Convert relative URLs to absolute URLs for consistency:
from urllib.parse import urljoin
base_url = 'https://example.com'
relative_url = '/page/about'
absolute_url = urljoin(base_url, relative_url)
Respect rate limits: Firecrawl handles rate limiting automatically, but be mindful of your API quota when processing large URL lists.
Monitor crawl progress: For large crawls, use the async crawl endpoint to avoid timeouts, similar to techniques used when monitoring network requests in Puppeteer.
Comparing Firecrawl to Traditional Link Extraction
Unlike traditional web scraping tools that require you to manually configure browser automation or parse HTML with CSS selectors, Firecrawl provides a simplified API that handles:
- JavaScript rendering: Automatically executes JavaScript before extracting links
- Link normalization: Converts relative URLs to absolute URLs
- Duplicate detection: Identifies and handles duplicate URLs during crawling
- Sitemap support: Can crawl websites using their sitemap for comprehensive URL discovery
This makes Firecrawl particularly efficient for URL extraction tasks compared to manually navigating to different pages using Puppeteer or other browser automation tools.
Conclusion
Firecrawl provides robust capabilities for extracting URL lists from web pages through its crawl and scrape endpoints. Whether you need to discover all links on a website, extract specific URLs from a single page, or capture structured link data with metadata, Firecrawl offers flexible solutions that handle JavaScript rendering and link normalization automatically.
The choice between crawling, scraping, and structured extraction depends on your specific use case: use crawling for site-wide URL discovery, scraping for single-page link extraction, and structured extraction when you need precise control over which links to extract and their associated data.