How do I Set Up an MCP Server for Web Scraping?
Setting up an MCP (Model Context Protocol) server for web scraping enables you to create a standardized interface for AI assistants and applications to interact with web scraping capabilities. MCP servers act as intermediaries that expose web scraping tools, resources, and prompts through a unified protocol, making it easier to integrate scraping functionality into AI-powered workflows.
Understanding MCP Servers for Web Scraping
The Model Context Protocol (MCP) is an open standard that allows AI applications to securely connect to data sources and tools. For web scraping, an MCP server provides:
- Standardized Tools: Expose scraping functions as MCP tools that AI assistants can call
- Resource Management: Serve scraped data and scraping configurations as MCP resources
- Prompt Templates: Provide pre-built prompts for common scraping tasks
- Security: Control access to scraping capabilities through proper authentication
Prerequisites
Before setting up an MCP server for web scraping, ensure you have:
- Node.js 16 or higher installed
- Basic understanding of TypeScript or JavaScript
- Familiarity with web scraping concepts
- A scraping library (Puppeteer, Playwright, or an API like WebScraping.AI)
Installation and Setup
Step 1: Install the MCP SDK
First, create a new Node.js project and install the MCP SDK:
mkdir mcp-webscraping-server
cd mcp-webscraping-server
npm init -y
npm install @modelcontextprotocol/sdk
npm install puppeteer # or your preferred scraping library
npm install typescript @types/node ts-node --save-dev
Step 2: Initialize TypeScript Configuration
Create a tsconfig.json
file:
npx tsc --init
Update the configuration to support modern JavaScript features:
{
"compilerOptions": {
"target": "ES2020",
"module": "commonjs",
"lib": ["ES2020"],
"outDir": "./dist",
"rootDir": "./src",
"strict": true,
"esModuleInterop": true,
"skipLibCheck": true,
"forceConsistentCasingInFileNames": true
},
"include": ["src/**/*"],
"exclude": ["node_modules"]
}
Step 3: Create the MCP Server
Create a file src/index.ts
with your MCP server implementation:
import { Server } from "@modelcontextprotocol/sdk/server/index.js";
import { StdioServerTransport } from "@modelcontextprotocol/sdk/server/stdio.js";
import {
CallToolRequestSchema,
ListToolsRequestSchema,
} from "@modelcontextprotocol/sdk/types.js";
import puppeteer from "puppeteer";
// Create the MCP server
const server = new Server(
{
name: "webscraping-mcp-server",
version: "1.0.0",
},
{
capabilities: {
tools: {},
},
}
);
// Define available scraping tools
server.setRequestHandler(ListToolsRequestSchema, async () => {
return {
tools: [
{
name: "scrape_html",
description: "Scrape the HTML content from a given URL",
inputSchema: {
type: "object",
properties: {
url: {
type: "string",
description: "The URL to scrape",
},
waitForSelector: {
type: "string",
description: "Optional CSS selector to wait for before scraping",
},
},
required: ["url"],
},
},
{
name: "scrape_text",
description: "Extract clean text content from a webpage",
inputSchema: {
type: "object",
properties: {
url: {
type: "string",
description: "The URL to scrape",
},
selector: {
type: "string",
description: "Optional CSS selector to extract text from",
},
},
required: ["url"],
},
},
{
name: "scrape_with_javascript",
description: "Execute custom JavaScript on a page and return results",
inputSchema: {
type: "object",
properties: {
url: {
type: "string",
description: "The URL to visit",
},
script: {
type: "string",
description: "JavaScript code to execute on the page",
},
},
required: ["url", "script"],
},
},
],
};
});
// Implement tool execution handlers
server.setRequestHandler(CallToolRequestSchema, async (request) => {
const { name, arguments: args } = request.params;
try {
if (name === "scrape_html") {
const browser = await puppeteer.launch({ headless: true });
const page = await browser.newPage();
await page.goto(args.url as string, { waitUntil: "networkidle0" });
if (args.waitForSelector) {
await page.waitForSelector(args.waitForSelector as string);
}
const html = await page.content();
await browser.close();
return {
content: [
{
type: "text",
text: html,
},
],
};
}
if (name === "scrape_text") {
const browser = await puppeteer.launch({ headless: true });
const page = await browser.newPage();
await page.goto(args.url as string, { waitUntil: "networkidle0" });
let text: string;
if (args.selector) {
text = await page.$eval(
args.selector as string,
(el) => el.textContent || ""
);
} else {
text = await page.evaluate(() => document.body.innerText);
}
await browser.close();
return {
content: [
{
type: "text",
text: text.trim(),
},
],
};
}
if (name === "scrape_with_javascript") {
const browser = await puppeteer.launch({ headless: true });
const page = await browser.newPage();
await page.goto(args.url as string, { waitUntil: "networkidle0" });
const result = await page.evaluate(args.script as string);
await browser.close();
return {
content: [
{
type: "text",
text: JSON.stringify(result, null, 2),
},
],
};
}
throw new Error(`Unknown tool: ${name}`);
} catch (error) {
return {
content: [
{
type: "text",
text: `Error: ${error instanceof Error ? error.message : String(error)}`,
},
],
isError: true,
};
}
});
// Start the server
async function main() {
const transport = new StdioServerTransport();
await server.connect(transport);
console.error("WebScraping MCP Server running on stdio");
}
main().catch((error) => {
console.error("Server error:", error);
process.exit(1);
});
Step 4: Add Build Scripts
Update your package.json
to include build and start scripts:
{
"name": "mcp-webscraping-server",
"version": "1.0.0",
"type": "module",
"scripts": {
"build": "tsc",
"start": "node dist/index.js",
"dev": "ts-node src/index.ts"
},
"dependencies": {
"@modelcontextprotocol/sdk": "^0.5.0",
"puppeteer": "^21.0.0"
},
"devDependencies": {
"@types/node": "^20.0.0",
"ts-node": "^10.9.0",
"typescript": "^5.0.0"
}
}
Building the Server
Compile the TypeScript code:
npm run build
This creates the compiled JavaScript files in the dist
directory.
Configuring Claude Desktop to Use Your MCP Server
To use your MCP server with Claude Desktop, add it to the configuration file:
On macOS
Edit ~/Library/Application Support/Claude/claude_desktop_config.json
:
{
"mcpServers": {
"webscraping": {
"command": "node",
"args": ["/absolute/path/to/mcp-webscraping-server/dist/index.js"]
}
}
}
On Windows
Edit %APPDATA%\Claude\claude_desktop_config.json
:
{
"mcpServers": {
"webscraping": {
"command": "node",
"args": ["C:\\absolute\\path\\to\\mcp-webscraping-server\\dist\\index.js"]
}
}
}
Advanced Features
Adding API-Based Scraping
For production use, consider integrating a robust scraping API instead of running browsers locally. Here's how to modify the server to use WebScraping.AI:
import fetch from "node-fetch";
const WEBSCRAPING_AI_API_KEY = process.env.WEBSCRAPING_AI_API_KEY;
server.setRequestHandler(CallToolRequestSchema, async (request) => {
const { name, arguments: args } = request.params;
if (name === "scrape_with_api") {
const url = args.url as string;
const apiUrl = `https://api.webscraping.ai/html?api_key=${WEBSCRAPING_AI_API_KEY}&url=${encodeURIComponent(url)}`;
const response = await fetch(apiUrl);
const html = await response.text();
return {
content: [
{
type: "text",
text: html,
},
],
};
}
// ... other handlers
});
Implementing Resource Providers
MCP servers can also expose resources. Here's how to add scraped data as a resource:
import {
ListResourcesRequestSchema,
ReadResourceRequestSchema,
} from "@modelcontextprotocol/sdk/types.js";
// Store cached scraping results
const scrapedCache = new Map<string, string>();
server.setRequestHandler(ListResourcesRequestSchema, async () => {
return {
resources: Array.from(scrapedCache.keys()).map((url) => ({
uri: `scraped://${url}`,
name: `Scraped content from ${url}`,
mimeType: "text/html",
})),
};
});
server.setRequestHandler(ReadResourceRequestSchema, async (request) => {
const url = request.params.uri.replace("scraped://", "");
const content = scrapedCache.get(url);
if (!content) {
throw new Error(`No cached content for ${url}`);
}
return {
contents: [
{
uri: request.params.uri,
mimeType: "text/html",
text: content,
},
],
};
});
Error Handling and Timeouts
Implement robust error handling similar to how Puppeteer handles timeouts:
async function scrapeWithTimeout(url: string, timeoutMs: number = 30000) {
const browser = await puppeteer.launch({ headless: true });
const page = await browser.newPage();
try {
await page.goto(url, {
waitUntil: "networkidle0",
timeout: timeoutMs
});
const html = await page.content();
return html;
} catch (error) {
if (error.name === 'TimeoutError') {
throw new Error(`Page load timed out after ${timeoutMs}ms`);
}
throw error;
} finally {
await browser.close();
}
}
Testing Your MCP Server
Create a test script to verify your server works correctly:
// test/test-server.ts
import { Client } from "@modelcontextprotocol/sdk/client/index.js";
import { StdioClientTransport } from "@modelcontextprotocol/sdk/client/stdio.js";
import { spawn } from "child_process";
async function testServer() {
const serverProcess = spawn("node", ["dist/index.js"]);
const transport = new StdioClientTransport({
stdin: serverProcess.stdin,
stdout: serverProcess.stdout,
});
const client = new Client(
{
name: "test-client",
version: "1.0.0",
},
{
capabilities: {},
}
);
await client.connect(transport);
// List available tools
const tools = await client.listTools();
console.log("Available tools:", tools);
// Test scraping
const result = await client.callTool({
name: "scrape_html",
arguments: {
url: "https://example.com",
},
});
console.log("Scraping result:", result);
await client.close();
serverProcess.kill();
}
testServer().catch(console.error);
Run the test:
npm run build && npx ts-node test/test-server.ts
Best Practices
- Rate Limiting: Implement rate limiting to avoid overwhelming target websites
- User Agents: Rotate user agents to avoid detection, similar to handling browser sessions in Puppeteer
- Caching: Cache results to reduce redundant requests
- Error Handling: Provide clear error messages and implement retries
- Resource Cleanup: Always close browser instances to prevent memory leaks
- Security: Validate and sanitize all input URLs and scripts
- Logging: Implement comprehensive logging for debugging
Common Use Cases
Scraping Dynamic Content
When working with JavaScript-heavy sites, you may need to handle AJAX requests using Puppeteer or wait for specific elements to load:
{
name: "scrape_dynamic_content",
description: "Scrape content from dynamic websites",
inputSchema: {
type: "object",
properties: {
url: { type: "string" },
waitForSelector: { type: "string" },
waitForTimeout: { type: "number", default: 5000 },
},
required: ["url", "waitForSelector"],
},
}
Monitoring Network Requests
For advanced scraping, you might want to capture API calls:
async function scrapeWithNetworkMonitoring(url: string) {
const browser = await puppeteer.launch({ headless: true });
const page = await browser.newPage();
const requests: any[] = [];
page.on('request', (request) => {
requests.push({
url: request.url(),
method: request.method(),
});
});
await page.goto(url, { waitUntil: "networkidle0" });
await browser.close();
return { html: await page.content(), requests };
}
Troubleshooting
Server Not Starting
- Verify Node.js version:
node --version
(should be 16+) - Check for TypeScript compilation errors:
npm run build
- Ensure all dependencies are installed:
npm install
Claude Desktop Not Detecting Server
- Verify the path in
claude_desktop_config.json
is absolute - Check that the compiled
dist/index.js
file exists - Restart Claude Desktop after configuration changes
- Check Claude Desktop logs for error messages
Scraping Failures
- Test URLs manually in a browser first
- Increase timeout values for slow-loading pages
- Check if the website blocks automated access
- Verify network connectivity and DNS resolution
Conclusion
Setting up an MCP server for web scraping creates a powerful, reusable interface for AI assistants to interact with web data. By following this guide, you've created a foundation that can be extended with additional tools, resources, and capabilities as your scraping needs evolve. The standardized MCP protocol ensures compatibility with various AI applications while maintaining security and control over your scraping infrastructure.
For production deployments, consider using managed scraping services like WebScraping.AI to handle the complexity of browser automation, proxy rotation, and anti-bot bypass mechanisms, allowing your MCP server to focus on orchestrating scraping workflows rather than managing infrastructure.