How do I build a custom MCP server for web scraping?
Building a custom MCP (Model Context Protocol) server for web scraping allows you to create specialized tools that AI assistants can use to extract data from websites. An MCP server acts as a bridge between AI models like Claude and your web scraping infrastructure, providing structured interfaces for scraping operations.
Understanding MCP Architecture
The Model Context Protocol is an open standard that enables AI models to interact with external tools and data sources. When you build an MCP server for web scraping, you're creating a service that exposes scraping capabilities through a standardized interface that AI assistants can discover and use.
An MCP server consists of three main components:
- Tools - Functions that the AI can call to perform specific scraping tasks
- Resources - Static or dynamic data sources that provide context
- Prompts - Pre-defined templates for common scraping workflows
Setting Up Your Development Environment
Prerequisites
Before building your custom MCP server, ensure you have the following installed:
# For Python-based MCP servers
python3 --version # Python 3.10 or higher
pip install mcp
# For TypeScript-based MCP servers
node --version # Node.js 18 or higher
npm install -g @modelcontextprotocol/sdk
Project Initialization
Create a new directory for your MCP server:
mkdir scraping-mcp-server
cd scraping-mcp-server
For a Python project:
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
pip install mcp beautifulsoup4 requests
For a TypeScript project:
npm init -y
npm install @modelcontextprotocol/sdk puppeteer cheerio axios
npm install --save-dev @types/node typescript ts-node
Building a Python MCP Server
Here's a complete example of a Python-based MCP server for web scraping:
from mcp.server import Server, NotificationOptions
from mcp.server.models import InitializationOptions
import mcp.server.stdio
import mcp.types as types
import requests
from bs4 import BeautifulSoup
import json
# Create MCP server instance
app = Server("web-scraper")
@app.list_tools()
async def list_tools() -> list[types.Tool]:
"""List available scraping tools."""
return [
types.Tool(
name="scrape_html",
description="Scrape HTML content from a URL",
inputSchema={
"type": "object",
"properties": {
"url": {
"type": "string",
"description": "The URL to scrape"
},
"selector": {
"type": "string",
"description": "CSS selector to extract specific elements"
}
},
"required": ["url"]
}
),
types.Tool(
name="extract_links",
description="Extract all links from a webpage",
inputSchema={
"type": "object",
"properties": {
"url": {
"type": "string",
"description": "The URL to extract links from"
},
"filter_pattern": {
"type": "string",
"description": "Optional regex pattern to filter links"
}
},
"required": ["url"]
}
)
]
@app.call_tool()
async def call_tool(name: str, arguments: dict) -> list[types.TextContent]:
"""Handle tool calls from the AI assistant."""
if name == "scrape_html":
url = arguments["url"]
selector = arguments.get("selector")
try:
response = requests.get(url, timeout=10)
response.raise_for_status()
soup = BeautifulSoup(response.content, 'html.parser')
if selector:
elements = soup.select(selector)
content = "\n".join([elem.get_text(strip=True) for elem in elements])
else:
content = soup.get_text(strip=True)
return [types.TextContent(
type="text",
text=json.dumps({
"url": url,
"content": content[:5000], # Limit content size
"status": "success"
})
)]
except Exception as e:
return [types.TextContent(
type="text",
text=json.dumps({"error": str(e), "status": "failed"})
)]
elif name == "extract_links":
url = arguments["url"]
filter_pattern = arguments.get("filter_pattern")
try:
response = requests.get(url, timeout=10)
response.raise_for_status()
soup = BeautifulSoup(response.content, 'html.parser')
links = []
for link in soup.find_all('a', href=True):
href = link['href']
if filter_pattern:
import re
if re.search(filter_pattern, href):
links.append(href)
else:
links.append(href)
return [types.TextContent(
type="text",
text=json.dumps({
"url": url,
"links": links[:100], # Limit to 100 links
"total_count": len(links),
"status": "success"
})
)]
except Exception as e:
return [types.TextContent(
type="text",
text=json.dumps({"error": str(e), "status": "failed"})
)]
raise ValueError(f"Unknown tool: {name}")
async def main():
"""Run the MCP server."""
async with mcp.server.stdio.stdio_server() as (read_stream, write_stream):
await app.run(
read_stream,
write_stream,
InitializationOptions(
server_name="web-scraper",
server_version="1.0.0",
capabilities=app.get_capabilities(
notification_options=NotificationOptions(),
experimental_capabilities={}
)
)
)
if __name__ == "__main__":
import asyncio
asyncio.run(main())
Building a TypeScript MCP Server
For more complex scraping scenarios that require browser automation with Puppeteer, here's a TypeScript implementation:
import { Server } from "@modelcontextprotocol/sdk/server/index.js";
import { StdioServerTransport } from "@modelcontextprotocol/sdk/server/stdio.js";
import {
CallToolRequestSchema,
ListToolsRequestSchema,
} from "@modelcontextprotocol/sdk/types.js";
import puppeteer from "puppeteer";
import * as cheerio from "cheerio";
// Create MCP server
const server = new Server(
{
name: "puppeteer-scraper",
version: "1.0.0",
},
{
capabilities: {
tools: {},
},
}
);
// Define available tools
server.setRequestHandler(ListToolsRequestSchema, async () => {
return {
tools: [
{
name: "scrape_dynamic_page",
description: "Scrape JavaScript-rendered content using Puppeteer",
inputSchema: {
type: "object",
properties: {
url: {
type: "string",
description: "The URL to scrape",
},
waitForSelector: {
type: "string",
description: "CSS selector to wait for before scraping",
},
extractSelector: {
type: "string",
description: "CSS selector to extract content from",
},
},
required: ["url"],
},
},
{
name: "take_screenshot",
description: "Take a screenshot of a webpage",
inputSchema: {
type: "object",
properties: {
url: {
type: "string",
description: "The URL to screenshot",
},
fullPage: {
type: "boolean",
description: "Capture full page or viewport only",
},
},
required: ["url"],
},
},
],
};
});
// Handle tool calls
server.setRequestHandler(CallToolRequestSchema, async (request) => {
const { name, arguments: args } = request.params;
if (name === "scrape_dynamic_page") {
const browser = await puppeteer.launch({ headless: "new" });
try {
const page = await browser.newPage();
await page.goto(args.url as string, { waitUntil: "networkidle2" });
// Wait for specific selector if provided
if (args.waitForSelector) {
await page.waitForSelector(args.waitForSelector as string, {
timeout: 10000,
});
}
const content = await page.content();
const $ = cheerio.load(content);
let extractedData: string;
if (args.extractSelector) {
extractedData = $(args.extractSelector as string)
.map((_, el) => $(el).text().trim())
.get()
.join("\n");
} else {
extractedData = $("body").text().trim();
}
return {
content: [
{
type: "text",
text: JSON.stringify({
url: args.url,
content: extractedData.slice(0, 5000),
status: "success",
}),
},
],
};
} catch (error) {
return {
content: [
{
type: "text",
text: JSON.stringify({
error: error.message,
status: "failed",
}),
},
],
isError: true,
};
} finally {
await browser.close();
}
}
if (name === "take_screenshot") {
const browser = await puppeteer.launch({ headless: "new" });
try {
const page = await browser.newPage();
await page.goto(args.url as string, { waitUntil: "networkidle2" });
const screenshot = await page.screenshot({
encoding: "base64",
fullPage: args.fullPage as boolean || false,
});
return {
content: [
{
type: "image",
data: screenshot,
mimeType: "image/png",
},
],
};
} catch (error) {
return {
content: [
{
type: "text",
text: JSON.stringify({
error: error.message,
status: "failed",
}),
},
],
isError: true,
};
} finally {
await browser.close();
}
}
throw new Error(`Unknown tool: ${name}`);
});
// Start the server
async function main() {
const transport = new StdioServerTransport();
await server.connect(transport);
console.error("Puppeteer MCP server running on stdio");
}
main().catch((error) => {
console.error("Server error:", error);
process.exit(1);
});
Adding Resources to Your MCP Server
Resources provide static or dynamic data that can be referenced by AI assistants. Here's how to add resource support:
@app.list_resources()
async def list_resources() -> list[types.Resource]:
"""List available resources."""
return [
types.Resource(
uri="scraper://config",
name="Scraper Configuration",
mimeType="application/json",
description="Current scraper configuration and limits"
)
]
@app.read_resource()
async def read_resource(uri: str) -> str:
"""Read a resource by URI."""
if uri == "scraper://config":
config = {
"max_concurrent_requests": 5,
"timeout_seconds": 30,
"user_agent": "CustomScraperBot/1.0",
"respect_robots_txt": True
}
return json.dumps(config, indent=2)
raise ValueError(f"Unknown resource: {uri}")
Configuring Your MCP Server
Create a configuration file for Claude Desktop or other MCP clients:
{
"mcpServers": {
"web-scraper": {
"command": "python",
"args": ["/path/to/your/scraper_server.py"],
"env": {
"USER_AGENT": "CustomBot/1.0"
}
},
"puppeteer-scraper": {
"command": "node",
"args": ["/path/to/your/dist/index.js"]
}
}
}
On macOS, add this to ~/Library/Application Support/Claude/claude_desktop_config.json
. On Windows, use %APPDATA%\Claude\claude_desktop_config.json
.
Advanced Features
Handling Rate Limiting
Implement rate limiting to avoid overwhelming target servers:
import asyncio
from collections import defaultdict
import time
class RateLimiter:
def __init__(self, max_requests=10, time_window=60):
self.max_requests = max_requests
self.time_window = time_window
self.requests = defaultdict(list)
async def acquire(self, domain: str):
now = time.time()
# Remove old requests
self.requests[domain] = [
req_time for req_time in self.requests[domain]
if now - req_time < self.time_window
]
if len(self.requests[domain]) >= self.max_requests:
wait_time = self.time_window - (now - self.requests[domain][0])
await asyncio.sleep(wait_time)
self.requests[domain].append(now)
# Usage in your tool
rate_limiter = RateLimiter(max_requests=10, time_window=60)
@app.call_tool()
async def call_tool(name: str, arguments: dict):
if name == "scrape_html":
from urllib.parse import urlparse
domain = urlparse(arguments["url"]).netloc
await rate_limiter.acquire(domain)
# ... rest of scraping logic
Error Handling and Retries
Implement robust error handling for your scraping operations:
async function scrapeWithRetry(
url: string,
maxRetries: number = 3
): Promise<string> {
for (let attempt = 0; attempt < maxRetries; attempt++) {
try {
const browser = await puppeteer.launch({ headless: "new" });
const page = await browser.newPage();
await page.goto(url, {
waitUntil: "networkidle2",
timeout: 30000,
});
const content = await page.content();
await browser.close();
return content;
} catch (error) {
if (attempt === maxRetries - 1) throw error;
// Exponential backoff
await new Promise(resolve =>
setTimeout(resolve, Math.pow(2, attempt) * 1000)
);
}
}
throw new Error("Max retries exceeded");
}
Testing Your MCP Server
Create a simple test script to verify your MCP server works correctly:
import asyncio
import json
from mcp import ClientSession, StdioServerParameters
from mcp.client.stdio import stdio_client
async def test_scraper():
server_params = StdioServerParameters(
command="python",
args=["scraper_server.py"]
)
async with stdio_client(server_params) as (read, write):
async with ClientSession(read, write) as session:
await session.initialize()
# List available tools
tools = await session.list_tools()
print("Available tools:", json.dumps(tools, indent=2))
# Call a tool
result = await session.call_tool(
"scrape_html",
arguments={"url": "https://example.com"}
)
print("Scrape result:", json.dumps(result, indent=2))
if __name__ == "__main__":
asyncio.run(test_scraper())
Deployment Considerations
When deploying your MCP server for production use:
- Security: Validate and sanitize all URLs to prevent SSRF attacks
- Resource Limits: Set appropriate memory and CPU limits
- Logging: Implement comprehensive logging for debugging
- Monitoring: Track success rates, response times, and error rates
- Caching: Consider caching frequently accessed pages to reduce load
Conclusion
Building a custom MCP server for web scraping empowers AI assistants with specialized scraping capabilities tailored to your needs. Whether you choose Python for simplicity or TypeScript for advanced browser automation, the Model Context Protocol provides a standardized way to expose your scraping tools to AI models. Start with basic HTML scraping and gradually add more sophisticated features like JavaScript rendering, screenshot capture, and data extraction as your requirements grow.
By following the examples and best practices outlined in this guide, you can create a robust, scalable MCP server that seamlessly integrates web scraping capabilities into AI-powered workflows.