How do I integrate Firecrawl with LangChain?
Firecrawl is a powerful web scraping and crawling API that converts websites into clean, LLM-ready markdown or structured data. When integrated with LangChain, it becomes an essential tool for building Retrieval-Augmented Generation (RAG) applications, chatbots, and AI agents that need to access and process web content.
LangChain provides official integrations with Firecrawl through both document loaders and tools, making it straightforward to incorporate web scraping capabilities into your AI workflows.
Understanding Firecrawl and LangChain Integration
The integration between Firecrawl and LangChain serves two primary use cases:
- Document Loading: Using Firecrawl as a document loader to scrape and ingest web content into your RAG pipeline
- Agent Tools: Providing Firecrawl capabilities as tools that AI agents can use to retrieve web information dynamically
Both approaches leverage Firecrawl's ability to handle JavaScript-rendered content, bypass anti-bot measures, and convert HTML into clean markdown format suitable for LLM processing.
Prerequisites
Before integrating Firecrawl with LangChain, you'll need:
- A Firecrawl API key (get one from firecrawl.dev)
- Python 3.8+ or Node.js 16+ installed
- LangChain library installed in your environment
Python Integration
Installation
First, install the required packages:
pip install langchain langchain-community firecrawl-py
Using Firecrawl as a Document Loader
The FireCrawlLoader
class allows you to scrape web pages and load them as LangChain documents:
from langchain_community.document_loaders import FireCrawlLoader
import os
# Set your Firecrawl API key
os.environ["FIRECRAWL_API_KEY"] = "your-api-key-here"
# Initialize the loader with a URL
loader = FireCrawlLoader(
url="https://example.com/docs",
mode="scrape" # Options: "scrape" or "crawl"
)
# Load documents
documents = loader.load()
# Access the content
for doc in documents:
print(f"URL: {doc.metadata['url']}")
print(f"Content: {doc.page_content[:200]}...")
Crawling Multiple Pages
To crawl an entire website or section, use the crawl
mode:
from langchain_community.document_loaders import FireCrawlLoader
loader = FireCrawlLoader(
url="https://docs.example.com",
mode="crawl",
params={
'crawlerOptions': {
'limit': 100, # Maximum pages to crawl
'maxDepth': 3 # Maximum crawl depth
}
}
)
# This will return documents from all crawled pages
documents = loader.load()
print(f"Loaded {len(documents)} documents")
Building a RAG Application with Firecrawl
Here's a complete example that combines Firecrawl document loading with a vector store and retrieval chain:
from langchain_community.document_loaders import FireCrawlLoader
from langchain_community.vectorstores import FAISS
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.chains import RetrievalQA
# 1. Load documents from website
loader = FireCrawlLoader(
url="https://docs.myapp.com",
mode="crawl",
params={'crawlerOptions': {'limit': 50}}
)
documents = loader.load()
# 2. Split documents into chunks
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200
)
splits = text_splitter.split_documents(documents)
# 3. Create vector store
embeddings = OpenAIEmbeddings()
vectorstore = FAISS.from_documents(splits, embeddings)
# 4. Create retrieval chain
llm = ChatOpenAI(model="gpt-4", temperature=0)
qa_chain = RetrievalQA.from_chain_type(
llm=llm,
chain_type="stuff",
retriever=vectorstore.as_retriever(search_kwargs={"k": 3})
)
# 5. Query the system
response = qa_chain.invoke({"query": "How do I configure authentication?"})
print(response["result"])
Using Firecrawl as an Agent Tool
For dynamic web scraping within an AI agent workflow, use Firecrawl as a tool:
from langchain.agents import AgentType, initialize_agent
from langchain_community.tools.firecrawl import FirecrawlScrapeWebsiteTool
from langchain_openai import ChatOpenAI
# Initialize Firecrawl tool
firecrawl_tool = FirecrawlScrapeWebsiteTool(
api_key="your-api-key-here"
)
# Create agent with Firecrawl tool
llm = ChatOpenAI(model="gpt-4", temperature=0)
agent = initialize_agent(
tools=[firecrawl_tool],
llm=llm,
agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION,
verbose=True
)
# Agent can now scrape websites as needed
response = agent.run(
"Go to https://example.com/pricing and tell me their monthly subscription cost"
)
print(response)
JavaScript/TypeScript Integration
Installation
Install the required packages:
npm install langchain @mendable/firecrawl-js
Using Firecrawl Document Loader in JavaScript
import { FireCrawlLoader } from "langchain/document_loaders/web/firecrawl";
const loader = new FireCrawlLoader({
url: "https://docs.example.com",
apiKey: process.env.FIRECRAWL_API_KEY,
mode: "scrape"
});
const documents = await loader.load();
documents.forEach(doc => {
console.log(`URL: ${doc.metadata.url}`);
console.log(`Content: ${doc.pageContent.substring(0, 200)}...`);
});
Crawling with JavaScript
import { FireCrawlLoader } from "langchain/document_loaders/web/firecrawl";
const loader = new FireCrawlLoader({
url: "https://docs.example.com",
apiKey: process.env.FIRECRAWL_API_KEY,
mode: "crawl",
params: {
crawlerOptions: {
limit: 100,
maxDepth: 3
}
}
});
const documents = await loader.load();
console.log(`Loaded ${documents.length} documents`);
Complete RAG Implementation in TypeScript
import { FireCrawlLoader } from "langchain/document_loaders/web/firecrawl";
import { RecursiveCharacterTextSplitter } from "langchain/text_splitter";
import { OpenAIEmbeddings } from "@langchain/openai";
import { MemoryVectorStore } from "langchain/vectorstores/memory";
import { RetrievalQAChain } from "langchain/chains";
import { ChatOpenAI } from "@langchain/openai";
// Load documents
const loader = new FireCrawlLoader({
url: "https://docs.example.com",
apiKey: process.env.FIRECRAWL_API_KEY,
mode: "crawl",
params: { crawlerOptions: { limit: 50 } }
});
const docs = await loader.load();
// Split documents
const textSplitter = new RecursiveCharacterTextSplitter({
chunkSize: 1000,
chunkOverlap: 200
});
const splits = await textSplitter.splitDocuments(docs);
// Create vector store
const embeddings = new OpenAIEmbeddings();
const vectorStore = await MemoryVectorStore.fromDocuments(
splits,
embeddings
);
// Create QA chain
const llm = new ChatOpenAI({ modelName: "gpt-4", temperature: 0 });
const chain = RetrievalQAChain.fromLLM(
llm,
vectorStore.asRetriever({ k: 3 })
);
// Query
const response = await chain.call({
query: "How do I configure authentication?"
});
console.log(response.text);
Advanced Configuration Options
Customizing Scrape Parameters
Firecrawl supports various parameters to customize scraping behavior:
loader = FireCrawlLoader(
url="https://example.com",
mode="scrape",
params={
'pageOptions': {
'onlyMainContent': True, # Extract only main content
'includeHtml': False, # Return markdown only
'waitFor': 1000 # Wait time in milliseconds
}
}
)
Handling Authentication
For scraping authenticated pages:
loader = FireCrawlLoader(
url="https://app.example.com/dashboard",
mode="scrape",
params={
'pageOptions': {
'headers': {
'Authorization': 'Bearer your-token-here'
}
}
}
)
Selective Crawling with URL Patterns
Control which pages to crawl using include/exclude patterns:
loader = FireCrawlLoader(
url="https://docs.example.com",
mode="crawl",
params={
'crawlerOptions': {
'includes': ['/docs/**'], # Only crawl docs section
'excludes': ['/blog/**'], # Skip blog section
'limit': 100
}
}
)
Best Practices
Rate Limiting: Respect the target website's resources by setting appropriate crawl limits. Similar to handling timeouts in browser automation, proper configuration prevents overloading servers.
Error Handling: Always implement error handling for network issues and API failures:
from langchain_community.document_loaders import FireCrawlLoader
try:
loader = FireCrawlLoader(url="https://example.com", mode="scrape")
documents = loader.load()
except Exception as e:
print(f"Failed to load documents: {e}")
# Implement fallback logic
- Caching: Cache scraped content to avoid redundant API calls:
import pickle
from pathlib import Path
cache_file = Path("firecrawl_cache.pkl")
if cache_file.exists():
with open(cache_file, "rb") as f:
documents = pickle.load(f)
else:
loader = FireCrawlLoader(url="https://example.com", mode="crawl")
documents = loader.load()
with open(cache_file, "wb") as f:
pickle.dump(documents, f)
- Content Validation: Verify that scraped content meets quality standards before ingestion:
def is_valid_document(doc):
return len(doc.page_content) > 100 and doc.metadata.get('url')
valid_documents = [doc for doc in documents if is_valid_document(doc)]
Monitoring and Debugging
Enable verbose logging to debug integration issues:
import logging
logging.basicConfig(level=logging.DEBUG)
logger = logging.getLogger("langchain.document_loaders.firecrawl")
Track crawl progress and results:
loader = FireCrawlLoader(url="https://example.com", mode="crawl")
documents = loader.load()
print(f"Total documents: {len(documents)}")
print(f"Unique URLs: {len(set(doc.metadata['url'] for doc in documents))}")
print(f"Total content length: {sum(len(doc.page_content) for doc in documents)}")
Common Use Cases
Documentation Q&A Bot
Build a chatbot that answers questions about your product documentation by crawling your docs site and creating a RAG system.
Competitive Intelligence
Monitor competitor websites by periodically crawling them and analyzing changes using LangChain's analysis capabilities.
Content Aggregation
Aggregate content from multiple sources for newsletter generation, market research, or trend analysis.
Data Pipeline for AI Training
Create a data pipeline that scrapes web content, processes it with LangChain, and prepares it for fine-tuning language models.
Troubleshooting
Issue: Documents are empty or incomplete
- Solution: Increase the waitFor
parameter to allow JavaScript to render, similar to waiting for JavaScript to load in dynamic scraping
Issue: API rate limit exceeded - Solution: Implement exponential backoff or reduce crawl limits
Issue: Metadata missing from documents - Solution: Ensure you're using the latest version of langchain-community
Issue: Unable to scrape authenticated pages
- Solution: Verify authentication headers are correctly configured in pageOptions
Conclusion
Integrating Firecrawl with LangChain provides a powerful combination for building AI applications that need to access and process web content. The official document loaders and tools make it easy to incorporate web scraping into RAG pipelines, agent workflows, and other LLM-powered applications.
By following the examples and best practices outlined above, you can build robust, production-ready systems that leverage the strengths of both Firecrawl's web scraping capabilities and LangChain's AI orchestration framework.