How do I Set Up an MCP Server for Web Scraping?

Setting up an MCP (Model Context Protocol) server for web scraping enables you to create a standardized interface for AI assistants and applications to interact with web scraping capabilities. MCP servers act as intermediaries that expose web scraping tools, resources, and prompts through a unified protocol, making it easier to integrate scraping functionality into AI-powered workflows.

Understanding MCP Servers for Web Scraping

The Model Context Protocol (MCP) is an open standard that allows AI applications to securely connect to data sources and tools. For web scraping, an MCP server provides:

Standardized Tools: Expose scraping functions as MCP tools that AI assistants can call
Resource Management: Serve scraped data and scraping configurations as MCP resources
Prompt Templates: Provide pre-built prompts for common scraping tasks
Security: Control access to scraping capabilities through proper authentication

Prerequisites

Before setting up an MCP server for web scraping, ensure you have:

Node.js 16 or higher installed
Basic understanding of TypeScript or JavaScript
Familiarity with web scraping concepts
A scraping library (Puppeteer, Playwright, or an API like WebScraping.AI)

Installation and Setup

Step 1: Install the MCP SDK

First, create a new Node.js project and install the MCP SDK:

mkdir mcp-webscraping-server
cd mcp-webscraping-server
npm init -y
npm install @modelcontextprotocol/sdk
npm install puppeteer  # or your preferred scraping library
npm install typescript @types/node ts-node --save-dev

Step 2: Initialize TypeScript Configuration

Create a tsconfig.json file:

npx tsc --init

Update the configuration to support modern JavaScript features:

{
  "compilerOptions": {
    "target": "ES2020",
    "module": "commonjs",
    "lib": ["ES2020"],
    "outDir": "./dist",
    "rootDir": "./src",
    "strict": true,
    "esModuleInterop": true,
    "skipLibCheck": true,
    "forceConsistentCasingInFileNames": true
  },
  "include": ["src/**/*"],
  "exclude": ["node_modules"]
}

Step 3: Create the MCP Server

Create a file src/index.ts with your MCP server implementation:

import { Server } from "@modelcontextprotocol/sdk/server/index.js";
import { StdioServerTransport } from "@modelcontextprotocol/sdk/server/stdio.js";
import {
  CallToolRequestSchema,
  ListToolsRequestSchema,
} from "@modelcontextprotocol/sdk/types.js";
import puppeteer from "puppeteer";

// Create the MCP server
const server = new Server(
  {
    name: "webscraping-mcp-server",
    version: "1.0.0",
  },
  {
    capabilities: {
      tools: {},
    },
  }
);

// Define available scraping tools
server.setRequestHandler(ListToolsRequestSchema, async () => {
  return {
    tools: [
      {
        name: "scrape_html",
        description: "Scrape the HTML content from a given URL",
        inputSchema: {
          type: "object",
          properties: {
            url: {
              type: "string",
              description: "The URL to scrape",
            },
            waitForSelector: {
              type: "string",
              description: "Optional CSS selector to wait for before scraping",
            },
          },
          required: ["url"],
        },
      },
      {
        name: "scrape_text",
        description: "Extract clean text content from a webpage",
        inputSchema: {
          type: "object",
          properties: {
            url: {
              type: "string",
              description: "The URL to scrape",
            },
            selector: {
              type: "string",
              description: "Optional CSS selector to extract text from",
            },
          },
          required: ["url"],
        },
      },
      {
        name: "scrape_with_javascript",
        description: "Execute custom JavaScript on a page and return results",
        inputSchema: {
          type: "object",
          properties: {
            url: {
              type: "string",
              description: "The URL to visit",
            },
            script: {
              type: "string",
              description: "JavaScript code to execute on the page",
            },
          },
          required: ["url", "script"],
        },
      },
    ],
  };
});

// Implement tool execution handlers
server.setRequestHandler(CallToolRequestSchema, async (request) => {
  const { name, arguments: args } = request.params;

  try {
    if (name === "scrape_html") {
      const browser = await puppeteer.launch({ headless: true });
      const page = await browser.newPage();

      await page.goto(args.url as string, { waitUntil: "networkidle0" });

      if (args.waitForSelector) {
        await page.waitForSelector(args.waitForSelector as string);
      }

      const html = await page.content();
      await browser.close();

      return {
        content: [
          {
            type: "text",
            text: html,
          },
        ],
      };
    }

    if (name === "scrape_text") {
      const browser = await puppeteer.launch({ headless: true });
      const page = await browser.newPage();

      await page.goto(args.url as string, { waitUntil: "networkidle0" });

      let text: string;
      if (args.selector) {
        text = await page.$eval(
          args.selector as string,
          (el) => el.textContent || ""
        );
      } else {
        text = await page.evaluate(() => document.body.innerText);
      }

      await browser.close();

      return {
        content: [
          {
            type: "text",
            text: text.trim(),
          },
        ],
      };
    }

    if (name === "scrape_with_javascript") {
      const browser = await puppeteer.launch({ headless: true });
      const page = await browser.newPage();

      await page.goto(args.url as string, { waitUntil: "networkidle0" });

      const result = await page.evaluate(args.script as string);
      await browser.close();

      return {
        content: [
          {
            type: "text",
            text: JSON.stringify(result, null, 2),
          },
        ],
      };
    }

    throw new Error(`Unknown tool: ${name}`);
  } catch (error) {
    return {
      content: [
        {
          type: "text",
          text: `Error: ${error instanceof Error ? error.message : String(error)}`,
        },
      ],
      isError: true,
    };
  }
});

// Start the server
async function main() {
  const transport = new StdioServerTransport();
  await server.connect(transport);
  console.error("WebScraping MCP Server running on stdio");
}

main().catch((error) => {
  console.error("Server error:", error);
  process.exit(1);
});

Step 4: Add Build Scripts

Update your package.json to include build and start scripts:

{
  "name": "mcp-webscraping-server",
  "version": "1.0.0",
  "type": "module",
  "scripts": {
    "build": "tsc",
    "start": "node dist/index.js",
    "dev": "ts-node src/index.ts"
  },
  "dependencies": {
    "@modelcontextprotocol/sdk": "^0.5.0",
    "puppeteer": "^21.0.0"
  },
  "devDependencies": {
    "@types/node": "^20.0.0",
    "ts-node": "^10.9.0",
    "typescript": "^5.0.0"
  }
}

Building the Server

Compile the TypeScript code:

npm run build

This creates the compiled JavaScript files in the dist directory.

Configuring Claude Desktop to Use Your MCP Server

To use your MCP server with Claude Desktop, add it to the configuration file:

On macOS

Edit ~/Library/Application Support/Claude/claude_desktop_config.json:

{
  "mcpServers": {
    "webscraping": {
      "command": "node",
      "args": ["/absolute/path/to/mcp-webscraping-server/dist/index.js"]
    }
  }
}

On Windows

Edit %APPDATA%\Claude\claude_desktop_config.json:

{
  "mcpServers": {
    "webscraping": {
      "command": "node",
      "args": ["C:\\absolute\\path\\to\\mcp-webscraping-server\\dist\\index.js"]
    }
  }
}

Advanced Features

Adding API-Based Scraping

For production use, consider integrating a robust scraping API instead of running browsers locally. Here's how to modify the server to use WebScraping.AI:

import fetch from "node-fetch";

const WEBSCRAPING_AI_API_KEY = process.env.WEBSCRAPING_AI_API_KEY;

server.setRequestHandler(CallToolRequestSchema, async (request) => {
  const { name, arguments: args } = request.params;

  if (name === "scrape_with_api") {
    const url = args.url as string;
    const apiUrl = `https://api.webscraping.ai/html?api_key=${WEBSCRAPING_AI_API_KEY}&url=${encodeURIComponent(url)}`;

    const response = await fetch(apiUrl);
    const html = await response.text();

    return {
      content: [
        {
          type: "text",
          text: html,
        },
      ],
    };
  }

  // ... other handlers
});

Implementing Resource Providers

MCP servers can also expose resources. Here's how to add scraped data as a resource:

import {
  ListResourcesRequestSchema,
  ReadResourceRequestSchema,
} from "@modelcontextprotocol/sdk/types.js";

// Store cached scraping results
const scrapedCache = new Map<string, string>();

server.setRequestHandler(ListResourcesRequestSchema, async () => {
  return {
    resources: Array.from(scrapedCache.keys()).map((url) => ({
      uri: `scraped://${url}`,
      name: `Scraped content from ${url}`,
      mimeType: "text/html",
    })),
  };
});

server.setRequestHandler(ReadResourceRequestSchema, async (request) => {
  const url = request.params.uri.replace("scraped://", "");
  const content = scrapedCache.get(url);

  if (!content) {
    throw new Error(`No cached content for ${url}`);
  }

  return {
    contents: [
      {
        uri: request.params.uri,
        mimeType: "text/html",
        text: content,
      },
    ],
  };
});

Error Handling and Timeouts

Implement robust error handling similar to how Puppeteer handles timeouts:

async function scrapeWithTimeout(url: string, timeoutMs: number = 30000) {
  const browser = await puppeteer.launch({ headless: true });
  const page = await browser.newPage();

  try {
    await page.goto(url, {
      waitUntil: "networkidle0",
      timeout: timeoutMs
    });

    const html = await page.content();
    return html;
  } catch (error) {
    if (error.name === 'TimeoutError') {
      throw new Error(`Page load timed out after ${timeoutMs}ms`);
    }
    throw error;
  } finally {
    await browser.close();
  }
}

Testing Your MCP Server

Create a test script to verify your server works correctly:

// test/test-server.ts
import { Client } from "@modelcontextprotocol/sdk/client/index.js";
import { StdioClientTransport } from "@modelcontextprotocol/sdk/client/stdio.js";
import { spawn } from "child_process";

async function testServer() {
  const serverProcess = spawn("node", ["dist/index.js"]);
  const transport = new StdioClientTransport({
    stdin: serverProcess.stdin,
    stdout: serverProcess.stdout,
  });

  const client = new Client(
    {
      name: "test-client",
      version: "1.0.0",
    },
    {
      capabilities: {},
    }
  );

  await client.connect(transport);

  // List available tools
  const tools = await client.listTools();
  console.log("Available tools:", tools);

  // Test scraping
  const result = await client.callTool({
    name: "scrape_html",
    arguments: {
      url: "https://example.com",
    },
  });

  console.log("Scraping result:", result);

  await client.close();
  serverProcess.kill();
}

testServer().catch(console.error);

Run the test:

npm run build && npx ts-node test/test-server.ts

Best Practices

Rate Limiting: Implement rate limiting to avoid overwhelming target websites
User Agents: Rotate user agents to avoid detection, similar to handling browser sessions in Puppeteer
Caching: Cache results to reduce redundant requests
Error Handling: Provide clear error messages and implement retries
Resource Cleanup: Always close browser instances to prevent memory leaks
Security: Validate and sanitize all input URLs and scripts
Logging: Implement comprehensive logging for debugging

Common Use Cases

Scraping Dynamic Content

When working with JavaScript-heavy sites, you may need to handle AJAX requests using Puppeteer or wait for specific elements to load:

{
  name: "scrape_dynamic_content",
  description: "Scrape content from dynamic websites",
  inputSchema: {
    type: "object",
    properties: {
      url: { type: "string" },
      waitForSelector: { type: "string" },
      waitForTimeout: { type: "number", default: 5000 },
    },
    required: ["url", "waitForSelector"],
  },
}

Monitoring Network Requests

For advanced scraping, you might want to capture API calls:

async function scrapeWithNetworkMonitoring(url: string) {
  const browser = await puppeteer.launch({ headless: true });
  const page = await browser.newPage();

  const requests: any[] = [];

  page.on('request', (request) => {
    requests.push({
      url: request.url(),
      method: request.method(),
    });
  });

  await page.goto(url, { waitUntil: "networkidle0" });
  await browser.close();

  return { html: await page.content(), requests };
}

Troubleshooting

Server Not Starting

Verify Node.js version: node --version (should be 16+)
Check for TypeScript compilation errors: npm run build
Ensure all dependencies are installed: npm install

Claude Desktop Not Detecting Server

Verify the path in claude_desktop_config.json is absolute
Check that the compiled dist/index.js file exists
Restart Claude Desktop after configuration changes
Check Claude Desktop logs for error messages

Scraping Failures

Test URLs manually in a browser first
Increase timeout values for slow-loading pages
Check if the website blocks automated access
Verify network connectivity and DNS resolution

Conclusion

Setting up an MCP server for web scraping creates a powerful, reusable interface for AI assistants to interact with web data. By following this guide, you've created a foundation that can be extended with additional tools, resources, and capabilities as your scraping needs evolve. The standardized MCP protocol ensures compatibility with various AI applications while maintaining security and control over your scraping infrastructure.

For production deployments, consider using managed scraping services like WebScraping.AI to handle the complexity of browser automation, proxy rotation, and anti-bot bypass mechanisms, allowing your MCP server to focus on orchestrating scraping workflows rather than managing infrastructure.

Table of contents