Table of contents

Can Firecrawl Extract URL Lists from Web Pages?

Yes, Firecrawl can extract URL lists from web pages through multiple approaches. The platform provides several methods to discover and extract links, including its crawl endpoint that automatically discovers URLs, its scrape endpoint that can extract links from specific pages, and structured data extraction capabilities that can target links with precision.

Understanding Firecrawl's URL Extraction Methods

Firecrawl offers three primary methods for extracting URLs from web pages:

  1. Automatic link discovery during crawling - The crawl endpoint finds and follows links automatically
  2. Manual link extraction from HTML - Parse links from the markdown or HTML output
  3. Structured data extraction - Use schemas to extract specific link data with metadata

Each method serves different use cases depending on whether you need to discover links across multiple pages or extract specific URLs from a single page.

Method 1: Using the Crawl Endpoint for URL Discovery

The crawl endpoint is the most straightforward way to extract URL lists from websites. When you initiate a crawl, Firecrawl automatically discovers all links within the specified domain or subdomain.

Python Example

from firecrawl import FirecrawlApp

# Initialize Firecrawl
app = FirecrawlApp(api_key='your_api_key')

# Start a crawl to discover URLs
crawl_result = app.crawl_url(
    'https://example.com',
    params={
        'limit': 100,
        'scrapeOptions': {
            'formats': ['markdown', 'links']
        }
    }
)

# Extract all discovered URLs
discovered_urls = []
for page in crawl_result['data']:
    discovered_urls.append(page['metadata']['sourceURL'])

print(f"Discovered {len(discovered_urls)} URLs")
for url in discovered_urls:
    print(url)

JavaScript/Node.js Example

import FirecrawlApp from '@mendable/firecrawl-js';

const app = new FirecrawlApp({ apiKey: 'your_api_key' });

async function extractUrls() {
    // Crawl the website
    const crawlResult = await app.crawlUrl('https://example.com', {
        limit: 100,
        scrapeOptions: {
            formats: ['markdown', 'links']
        }
    });

    // Extract discovered URLs
    const discoveredUrls = crawlResult.data.map(page =>
        page.metadata.sourceURL
    );

    console.log(`Discovered ${discoveredUrls.length} URLs`);
    discoveredUrls.forEach(url => console.log(url));
}

extractUrls();

Method 2: Extracting Links from a Single Page

If you need to extract links from a specific page without crawling the entire site, use the scrape endpoint with link extraction enabled.

Python Example

from firecrawl import FirecrawlApp

app = FirecrawlApp(api_key='your_api_key')

# Scrape a single page and extract links
scrape_result = app.scrape_url(
    'https://example.com/page',
    params={
        'formats': ['markdown', 'links', 'html']
    }
)

# Access extracted links
if 'links' in scrape_result:
    links = scrape_result['links']
    print(f"Found {len(links)} links:")
    for link in links:
        print(link)

JavaScript Example

import FirecrawlApp from '@mendable/firecrawl-js';

const app = new FirecrawlApp({ apiKey: 'your_api_key' });

async function extractLinksFromPage() {
    const scrapeResult = await app.scrapeUrl('https://example.com/page', {
        formats: ['markdown', 'links', 'html']
    });

    if (scrapeResult.links) {
        console.log(`Found ${scrapeResult.links.length} links:`);
        scrapeResult.links.forEach(link => console.log(link));
    }
}

extractLinksFromPage();

Method 3: Structured Link Extraction with Schemas

For more precise control over which links to extract and their associated metadata, use Firecrawl's structured data extraction feature. This is particularly useful when you need to extract links along with their anchor text, context, or other attributes.

Python Example with Schema

from firecrawl import FirecrawlApp

app = FirecrawlApp(api_key='your_api_key')

# Define a schema for link extraction
link_schema = {
    'type': 'object',
    'properties': {
        'navigation_links': {
            'type': 'array',
            'items': {
                'type': 'object',
                'properties': {
                    'url': {'type': 'string'},
                    'text': {'type': 'string'},
                    'description': {'type': 'string'}
                }
            }
        },
        'article_links': {
            'type': 'array',
            'items': {
                'type': 'object',
                'properties': {
                    'url': {'type': 'string'},
                    'title': {'type': 'string'},
                    'category': {'type': 'string'}
                }
            }
        }
    }
}

# Extract structured link data
result = app.scrape_url(
    'https://example.com/blog',
    params={
        'formats': ['extract'],
        'extract': {
            'schema': link_schema
        }
    }
)

# Access structured link data
nav_links = result['extract']['navigation_links']
article_links = result['extract']['article_links']

print(f"Navigation links: {len(nav_links)}")
print(f"Article links: {len(article_links)}")

JavaScript Example with Schema

import FirecrawlApp from '@mendable/firecrawl-js';

const app = new FirecrawlApp({ apiKey: 'your_api_key' });

const linkSchema = {
    type: 'object',
    properties: {
        navigation_links: {
            type: 'array',
            items: {
                type: 'object',
                properties: {
                    url: { type: 'string' },
                    text: { type: 'string' },
                    description: { type: 'string' }
                }
            }
        },
        article_links: {
            type: 'array',
            items: {
                type: 'object',
                properties: {
                    url: { type: 'string' },
                    title: { type: 'string' },
                    category: { type: 'string' }
                }
            }
        }
    }
};

async function extractStructuredLinks() {
    const result = await app.scrapeUrl('https://example.com/blog', {
        formats: ['extract'],
        extract: {
            schema: linkSchema
        }
    });

    const navLinks = result.extract.navigation_links;
    const articleLinks = result.extract.article_links;

    console.log(`Navigation links: ${navLinks.length}`);
    console.log(`Article links: ${articleLinks.length}`);
}

extractStructuredLinks();

Filtering and Processing Extracted URLs

Once you've extracted URLs, you'll often need to filter and process them. Here are common patterns:

Filtering by URL Pattern

import re

def filter_urls(urls, pattern=None, exclude_pattern=None):
    filtered = urls

    if pattern:
        filtered = [url for url in filtered if re.search(pattern, url)]

    if exclude_pattern:
        filtered = [url for url in filtered if not re.search(exclude_pattern, url)]

    return filtered

# Example: Extract only blog post URLs
all_urls = ['https://example.com/blog/post-1', 'https://example.com/about',
            'https://example.com/blog/post-2']
blog_urls = filter_urls(all_urls, pattern=r'/blog/')
print(blog_urls)  # Only blog URLs

Deduplicating URLs

function deduplicateUrls(urls) {
    return [...new Set(urls)];
}

// Remove duplicates
const urls = ['https://example.com/page1', 'https://example.com/page2',
              'https://example.com/page1'];
const uniqueUrls = deduplicateUrls(urls);
console.log(uniqueUrls);  // ['https://example.com/page1', 'https://example.com/page2']

Advanced URL Extraction Techniques

Extracting Links with Specific Attributes

When you need to extract links with specific attributes (like download links, external links, or links with specific CSS classes), you can combine Firecrawl's HTML output with custom parsing:

from bs4 import BeautifulSoup
from firecrawl import FirecrawlApp

app = FirecrawlApp(api_key='your_api_key')

# Get HTML content
result = app.scrape_url('https://example.com', params={'formats': ['html']})
html = result['html']

# Parse with BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')

# Extract external links
external_links = []
for link in soup.find_all('a', href=True):
    href = link['href']
    if href.startswith('http') and 'example.com' not in href:
        external_links.append({
            'url': href,
            'text': link.get_text(strip=True),
            'rel': link.get('rel', [])
        })

print(f"Found {len(external_links)} external links")

Crawling with URL Patterns

You can control which URLs Firecrawl discovers during a crawl by specifying include and exclude patterns:

from firecrawl import FirecrawlApp

app = FirecrawlApp(api_key='your_api_key')

# Crawl only specific sections
crawl_result = app.crawl_url(
    'https://example.com',
    params={
        'limit': 500,
        'includePaths': ['/blog/*', '/articles/*'],
        'excludePaths': ['/admin/*', '/private/*'],
        'scrapeOptions': {
            'formats': ['links']
        }
    }
)

# Extract URLs matching the patterns
matching_urls = [page['metadata']['sourceURL'] for page in crawl_result['data']]

Handling Dynamic Content and JavaScript-Rendered Links

Firecrawl excels at extracting links from JavaScript-rendered pages, similar to how you would handle AJAX requests using Puppeteer. The platform automatically waits for JavaScript to execute before extracting content:

from firecrawl import FirecrawlApp

app = FirecrawlApp(api_key='your_api_key')

# Scrape a JavaScript-heavy page
result = app.scrape_url(
    'https://example.com/spa-page',
    params={
        'formats': ['links'],
        'waitFor': 5000  # Wait 5 seconds for JavaScript to render
    }
)

# Links will include dynamically loaded content
dynamic_links = result['links']

Exporting URL Lists

After extracting URLs, you'll often want to export them for further processing:

Export to CSV

import csv

def export_urls_to_csv(urls, filename='urls.csv'):
    with open(filename, 'w', newline='') as csvfile:
        writer = csv.writer(csvfile)
        writer.writerow(['URL'])
        for url in urls:
            writer.writerow([url])

# Export discovered URLs
export_urls_to_csv(discovered_urls)

Export to JSON

import fs from 'fs';

function exportUrlsToJson(urls, filename = 'urls.json') {
    const data = {
        urls: urls,
        count: urls.length,
        extracted_at: new Date().toISOString()
    };

    fs.writeFileSync(filename, JSON.stringify(data, null, 2));
}

// Export discovered URLs
exportUrlsToJson(discoveredUrls);

Best Practices for URL Extraction

  1. Set appropriate limits: When crawling large sites, use the limit parameter to control the number of pages crawled and avoid excessive API usage.

  2. Use include/exclude patterns: Narrow down your crawl to specific sections of a website to improve efficiency and reduce noise.

  3. Handle relative URLs: Convert relative URLs to absolute URLs for consistency:

from urllib.parse import urljoin

base_url = 'https://example.com'
relative_url = '/page/about'
absolute_url = urljoin(base_url, relative_url)
  1. Respect rate limits: Firecrawl handles rate limiting automatically, but be mindful of your API quota when processing large URL lists.

  2. Monitor crawl progress: For large crawls, use the async crawl endpoint to avoid timeouts, similar to techniques used when monitoring network requests in Puppeteer.

Comparing Firecrawl to Traditional Link Extraction

Unlike traditional web scraping tools that require you to manually configure browser automation or parse HTML with CSS selectors, Firecrawl provides a simplified API that handles:

  • JavaScript rendering: Automatically executes JavaScript before extracting links
  • Link normalization: Converts relative URLs to absolute URLs
  • Duplicate detection: Identifies and handles duplicate URLs during crawling
  • Sitemap support: Can crawl websites using their sitemap for comprehensive URL discovery

This makes Firecrawl particularly efficient for URL extraction tasks compared to manually navigating to different pages using Puppeteer or other browser automation tools.

Conclusion

Firecrawl provides robust capabilities for extracting URL lists from web pages through its crawl and scrape endpoints. Whether you need to discover all links on a website, extract specific URLs from a single page, or capture structured link data with metadata, Firecrawl offers flexible solutions that handle JavaScript rendering and link normalization automatically.

The choice between crawling, scraping, and structured extraction depends on your specific use case: use crawling for site-wide URL discovery, scraping for single-page link extraction, and structured extraction when you need precise control over which links to extract and their associated data.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon