Table of contents

How do I integrate Firecrawl with LangChain?

Firecrawl is a powerful web scraping and crawling API that converts websites into clean, LLM-ready markdown or structured data. When integrated with LangChain, it becomes an essential tool for building Retrieval-Augmented Generation (RAG) applications, chatbots, and AI agents that need to access and process web content.

LangChain provides official integrations with Firecrawl through both document loaders and tools, making it straightforward to incorporate web scraping capabilities into your AI workflows.

Understanding Firecrawl and LangChain Integration

The integration between Firecrawl and LangChain serves two primary use cases:

  1. Document Loading: Using Firecrawl as a document loader to scrape and ingest web content into your RAG pipeline
  2. Agent Tools: Providing Firecrawl capabilities as tools that AI agents can use to retrieve web information dynamically

Both approaches leverage Firecrawl's ability to handle JavaScript-rendered content, bypass anti-bot measures, and convert HTML into clean markdown format suitable for LLM processing.

Prerequisites

Before integrating Firecrawl with LangChain, you'll need:

  • A Firecrawl API key (get one from firecrawl.dev)
  • Python 3.8+ or Node.js 16+ installed
  • LangChain library installed in your environment

Python Integration

Installation

First, install the required packages:

pip install langchain langchain-community firecrawl-py

Using Firecrawl as a Document Loader

The FireCrawlLoader class allows you to scrape web pages and load them as LangChain documents:

from langchain_community.document_loaders import FireCrawlLoader
import os

# Set your Firecrawl API key
os.environ["FIRECRAWL_API_KEY"] = "your-api-key-here"

# Initialize the loader with a URL
loader = FireCrawlLoader(
    url="https://example.com/docs",
    mode="scrape"  # Options: "scrape" or "crawl"
)

# Load documents
documents = loader.load()

# Access the content
for doc in documents:
    print(f"URL: {doc.metadata['url']}")
    print(f"Content: {doc.page_content[:200]}...")

Crawling Multiple Pages

To crawl an entire website or section, use the crawl mode:

from langchain_community.document_loaders import FireCrawlLoader

loader = FireCrawlLoader(
    url="https://docs.example.com",
    mode="crawl",
    params={
        'crawlerOptions': {
            'limit': 100,  # Maximum pages to crawl
            'maxDepth': 3   # Maximum crawl depth
        }
    }
)

# This will return documents from all crawled pages
documents = loader.load()
print(f"Loaded {len(documents)} documents")

Building a RAG Application with Firecrawl

Here's a complete example that combines Firecrawl document loading with a vector store and retrieval chain:

from langchain_community.document_loaders import FireCrawlLoader
from langchain_community.vectorstores import FAISS
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.chains import RetrievalQA

# 1. Load documents from website
loader = FireCrawlLoader(
    url="https://docs.myapp.com",
    mode="crawl",
    params={'crawlerOptions': {'limit': 50}}
)
documents = loader.load()

# 2. Split documents into chunks
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200
)
splits = text_splitter.split_documents(documents)

# 3. Create vector store
embeddings = OpenAIEmbeddings()
vectorstore = FAISS.from_documents(splits, embeddings)

# 4. Create retrieval chain
llm = ChatOpenAI(model="gpt-4", temperature=0)
qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=vectorstore.as_retriever(search_kwargs={"k": 3})
)

# 5. Query the system
response = qa_chain.invoke({"query": "How do I configure authentication?"})
print(response["result"])

Using Firecrawl as an Agent Tool

For dynamic web scraping within an AI agent workflow, use Firecrawl as a tool:

from langchain.agents import AgentType, initialize_agent
from langchain_community.tools.firecrawl import FirecrawlScrapeWebsiteTool
from langchain_openai import ChatOpenAI

# Initialize Firecrawl tool
firecrawl_tool = FirecrawlScrapeWebsiteTool(
    api_key="your-api-key-here"
)

# Create agent with Firecrawl tool
llm = ChatOpenAI(model="gpt-4", temperature=0)
agent = initialize_agent(
    tools=[firecrawl_tool],
    llm=llm,
    agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION,
    verbose=True
)

# Agent can now scrape websites as needed
response = agent.run(
    "Go to https://example.com/pricing and tell me their monthly subscription cost"
)
print(response)

JavaScript/TypeScript Integration

Installation

Install the required packages:

npm install langchain @mendable/firecrawl-js

Using Firecrawl Document Loader in JavaScript

import { FireCrawlLoader } from "langchain/document_loaders/web/firecrawl";

const loader = new FireCrawlLoader({
  url: "https://docs.example.com",
  apiKey: process.env.FIRECRAWL_API_KEY,
  mode: "scrape"
});

const documents = await loader.load();

documents.forEach(doc => {
  console.log(`URL: ${doc.metadata.url}`);
  console.log(`Content: ${doc.pageContent.substring(0, 200)}...`);
});

Crawling with JavaScript

import { FireCrawlLoader } from "langchain/document_loaders/web/firecrawl";

const loader = new FireCrawlLoader({
  url: "https://docs.example.com",
  apiKey: process.env.FIRECRAWL_API_KEY,
  mode: "crawl",
  params: {
    crawlerOptions: {
      limit: 100,
      maxDepth: 3
    }
  }
});

const documents = await loader.load();
console.log(`Loaded ${documents.length} documents`);

Complete RAG Implementation in TypeScript

import { FireCrawlLoader } from "langchain/document_loaders/web/firecrawl";
import { RecursiveCharacterTextSplitter } from "langchain/text_splitter";
import { OpenAIEmbeddings } from "@langchain/openai";
import { MemoryVectorStore } from "langchain/vectorstores/memory";
import { RetrievalQAChain } from "langchain/chains";
import { ChatOpenAI } from "@langchain/openai";

// Load documents
const loader = new FireCrawlLoader({
  url: "https://docs.example.com",
  apiKey: process.env.FIRECRAWL_API_KEY,
  mode: "crawl",
  params: { crawlerOptions: { limit: 50 } }
});

const docs = await loader.load();

// Split documents
const textSplitter = new RecursiveCharacterTextSplitter({
  chunkSize: 1000,
  chunkOverlap: 200
});

const splits = await textSplitter.splitDocuments(docs);

// Create vector store
const embeddings = new OpenAIEmbeddings();
const vectorStore = await MemoryVectorStore.fromDocuments(
  splits,
  embeddings
);

// Create QA chain
const llm = new ChatOpenAI({ modelName: "gpt-4", temperature: 0 });
const chain = RetrievalQAChain.fromLLM(
  llm,
  vectorStore.asRetriever({ k: 3 })
);

// Query
const response = await chain.call({
  query: "How do I configure authentication?"
});

console.log(response.text);

Advanced Configuration Options

Customizing Scrape Parameters

Firecrawl supports various parameters to customize scraping behavior:

loader = FireCrawlLoader(
    url="https://example.com",
    mode="scrape",
    params={
        'pageOptions': {
            'onlyMainContent': True,  # Extract only main content
            'includeHtml': False,     # Return markdown only
            'waitFor': 1000           # Wait time in milliseconds
        }
    }
)

Handling Authentication

For scraping authenticated pages:

loader = FireCrawlLoader(
    url="https://app.example.com/dashboard",
    mode="scrape",
    params={
        'pageOptions': {
            'headers': {
                'Authorization': 'Bearer your-token-here'
            }
        }
    }
)

Selective Crawling with URL Patterns

Control which pages to crawl using include/exclude patterns:

loader = FireCrawlLoader(
    url="https://docs.example.com",
    mode="crawl",
    params={
        'crawlerOptions': {
            'includes': ['/docs/**'],  # Only crawl docs section
            'excludes': ['/blog/**'],  # Skip blog section
            'limit': 100
        }
    }
)

Best Practices

  1. Rate Limiting: Respect the target website's resources by setting appropriate crawl limits. Similar to handling timeouts in browser automation, proper configuration prevents overloading servers.

  2. Error Handling: Always implement error handling for network issues and API failures:

from langchain_community.document_loaders import FireCrawlLoader

try:
    loader = FireCrawlLoader(url="https://example.com", mode="scrape")
    documents = loader.load()
except Exception as e:
    print(f"Failed to load documents: {e}")
    # Implement fallback logic
  1. Caching: Cache scraped content to avoid redundant API calls:
import pickle
from pathlib import Path

cache_file = Path("firecrawl_cache.pkl")

if cache_file.exists():
    with open(cache_file, "rb") as f:
        documents = pickle.load(f)
else:
    loader = FireCrawlLoader(url="https://example.com", mode="crawl")
    documents = loader.load()
    with open(cache_file, "wb") as f:
        pickle.dump(documents, f)
  1. Content Validation: Verify that scraped content meets quality standards before ingestion:
def is_valid_document(doc):
    return len(doc.page_content) > 100 and doc.metadata.get('url')

valid_documents = [doc for doc in documents if is_valid_document(doc)]

Monitoring and Debugging

Enable verbose logging to debug integration issues:

import logging

logging.basicConfig(level=logging.DEBUG)
logger = logging.getLogger("langchain.document_loaders.firecrawl")

Track crawl progress and results:

loader = FireCrawlLoader(url="https://example.com", mode="crawl")
documents = loader.load()

print(f"Total documents: {len(documents)}")
print(f"Unique URLs: {len(set(doc.metadata['url'] for doc in documents))}")
print(f"Total content length: {sum(len(doc.page_content) for doc in documents)}")

Common Use Cases

Documentation Q&A Bot

Build a chatbot that answers questions about your product documentation by crawling your docs site and creating a RAG system.

Competitive Intelligence

Monitor competitor websites by periodically crawling them and analyzing changes using LangChain's analysis capabilities.

Content Aggregation

Aggregate content from multiple sources for newsletter generation, market research, or trend analysis.

Data Pipeline for AI Training

Create a data pipeline that scrapes web content, processes it with LangChain, and prepares it for fine-tuning language models.

Troubleshooting

Issue: Documents are empty or incomplete - Solution: Increase the waitFor parameter to allow JavaScript to render, similar to waiting for JavaScript to load in dynamic scraping

Issue: API rate limit exceeded - Solution: Implement exponential backoff or reduce crawl limits

Issue: Metadata missing from documents - Solution: Ensure you're using the latest version of langchain-community

Issue: Unable to scrape authenticated pages - Solution: Verify authentication headers are correctly configured in pageOptions

Conclusion

Integrating Firecrawl with LangChain provides a powerful combination for building AI applications that need to access and process web content. The official document loaders and tools make it easy to incorporate web scraping into RAG pipelines, agent workflows, and other LLM-powered applications.

By following the examples and best practices outlined above, you can build robust, production-ready systems that leverage the strengths of both Firecrawl's web scraping capabilities and LangChain's AI orchestration framework.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon