Table of contents

How do I build a custom MCP server for web scraping?

Building a custom MCP (Model Context Protocol) server for web scraping allows you to create specialized tools that AI assistants can use to extract data from websites. An MCP server acts as a bridge between AI models like Claude and your web scraping infrastructure, providing structured interfaces for scraping operations.

Understanding MCP Architecture

The Model Context Protocol is an open standard that enables AI models to interact with external tools and data sources. When you build an MCP server for web scraping, you're creating a service that exposes scraping capabilities through a standardized interface that AI assistants can discover and use.

An MCP server consists of three main components:

  1. Tools - Functions that the AI can call to perform specific scraping tasks
  2. Resources - Static or dynamic data sources that provide context
  3. Prompts - Pre-defined templates for common scraping workflows

Setting Up Your Development Environment

Prerequisites

Before building your custom MCP server, ensure you have the following installed:

# For Python-based MCP servers
python3 --version  # Python 3.10 or higher
pip install mcp

# For TypeScript-based MCP servers
node --version  # Node.js 18 or higher
npm install -g @modelcontextprotocol/sdk

Project Initialization

Create a new directory for your MCP server:

mkdir scraping-mcp-server
cd scraping-mcp-server

For a Python project:

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate
pip install mcp beautifulsoup4 requests

For a TypeScript project:

npm init -y
npm install @modelcontextprotocol/sdk puppeteer cheerio axios
npm install --save-dev @types/node typescript ts-node

Building a Python MCP Server

Here's a complete example of a Python-based MCP server for web scraping:

from mcp.server import Server, NotificationOptions
from mcp.server.models import InitializationOptions
import mcp.server.stdio
import mcp.types as types
import requests
from bs4 import BeautifulSoup
import json

# Create MCP server instance
app = Server("web-scraper")

@app.list_tools()
async def list_tools() -> list[types.Tool]:
    """List available scraping tools."""
    return [
        types.Tool(
            name="scrape_html",
            description="Scrape HTML content from a URL",
            inputSchema={
                "type": "object",
                "properties": {
                    "url": {
                        "type": "string",
                        "description": "The URL to scrape"
                    },
                    "selector": {
                        "type": "string",
                        "description": "CSS selector to extract specific elements"
                    }
                },
                "required": ["url"]
            }
        ),
        types.Tool(
            name="extract_links",
            description="Extract all links from a webpage",
            inputSchema={
                "type": "object",
                "properties": {
                    "url": {
                        "type": "string",
                        "description": "The URL to extract links from"
                    },
                    "filter_pattern": {
                        "type": "string",
                        "description": "Optional regex pattern to filter links"
                    }
                },
                "required": ["url"]
            }
        )
    ]

@app.call_tool()
async def call_tool(name: str, arguments: dict) -> list[types.TextContent]:
    """Handle tool calls from the AI assistant."""

    if name == "scrape_html":
        url = arguments["url"]
        selector = arguments.get("selector")

        try:
            response = requests.get(url, timeout=10)
            response.raise_for_status()
            soup = BeautifulSoup(response.content, 'html.parser')

            if selector:
                elements = soup.select(selector)
                content = "\n".join([elem.get_text(strip=True) for elem in elements])
            else:
                content = soup.get_text(strip=True)

            return [types.TextContent(
                type="text",
                text=json.dumps({
                    "url": url,
                    "content": content[:5000],  # Limit content size
                    "status": "success"
                })
            )]
        except Exception as e:
            return [types.TextContent(
                type="text",
                text=json.dumps({"error": str(e), "status": "failed"})
            )]

    elif name == "extract_links":
        url = arguments["url"]
        filter_pattern = arguments.get("filter_pattern")

        try:
            response = requests.get(url, timeout=10)
            response.raise_for_status()
            soup = BeautifulSoup(response.content, 'html.parser')

            links = []
            for link in soup.find_all('a', href=True):
                href = link['href']
                if filter_pattern:
                    import re
                    if re.search(filter_pattern, href):
                        links.append(href)
                else:
                    links.append(href)

            return [types.TextContent(
                type="text",
                text=json.dumps({
                    "url": url,
                    "links": links[:100],  # Limit to 100 links
                    "total_count": len(links),
                    "status": "success"
                })
            )]
        except Exception as e:
            return [types.TextContent(
                type="text",
                text=json.dumps({"error": str(e), "status": "failed"})
            )]

    raise ValueError(f"Unknown tool: {name}")

async def main():
    """Run the MCP server."""
    async with mcp.server.stdio.stdio_server() as (read_stream, write_stream):
        await app.run(
            read_stream,
            write_stream,
            InitializationOptions(
                server_name="web-scraper",
                server_version="1.0.0",
                capabilities=app.get_capabilities(
                    notification_options=NotificationOptions(),
                    experimental_capabilities={}
                )
            )
        )

if __name__ == "__main__":
    import asyncio
    asyncio.run(main())

Building a TypeScript MCP Server

For more complex scraping scenarios that require browser automation with Puppeteer, here's a TypeScript implementation:

import { Server } from "@modelcontextprotocol/sdk/server/index.js";
import { StdioServerTransport } from "@modelcontextprotocol/sdk/server/stdio.js";
import {
  CallToolRequestSchema,
  ListToolsRequestSchema,
} from "@modelcontextprotocol/sdk/types.js";
import puppeteer from "puppeteer";
import * as cheerio from "cheerio";

// Create MCP server
const server = new Server(
  {
    name: "puppeteer-scraper",
    version: "1.0.0",
  },
  {
    capabilities: {
      tools: {},
    },
  }
);

// Define available tools
server.setRequestHandler(ListToolsRequestSchema, async () => {
  return {
    tools: [
      {
        name: "scrape_dynamic_page",
        description: "Scrape JavaScript-rendered content using Puppeteer",
        inputSchema: {
          type: "object",
          properties: {
            url: {
              type: "string",
              description: "The URL to scrape",
            },
            waitForSelector: {
              type: "string",
              description: "CSS selector to wait for before scraping",
            },
            extractSelector: {
              type: "string",
              description: "CSS selector to extract content from",
            },
          },
          required: ["url"],
        },
      },
      {
        name: "take_screenshot",
        description: "Take a screenshot of a webpage",
        inputSchema: {
          type: "object",
          properties: {
            url: {
              type: "string",
              description: "The URL to screenshot",
            },
            fullPage: {
              type: "boolean",
              description: "Capture full page or viewport only",
            },
          },
          required: ["url"],
        },
      },
    ],
  };
});

// Handle tool calls
server.setRequestHandler(CallToolRequestSchema, async (request) => {
  const { name, arguments: args } = request.params;

  if (name === "scrape_dynamic_page") {
    const browser = await puppeteer.launch({ headless: "new" });
    try {
      const page = await browser.newPage();
      await page.goto(args.url as string, { waitUntil: "networkidle2" });

      // Wait for specific selector if provided
      if (args.waitForSelector) {
        await page.waitForSelector(args.waitForSelector as string, {
          timeout: 10000,
        });
      }

      const content = await page.content();
      const $ = cheerio.load(content);

      let extractedData: string;
      if (args.extractSelector) {
        extractedData = $(args.extractSelector as string)
          .map((_, el) => $(el).text().trim())
          .get()
          .join("\n");
      } else {
        extractedData = $("body").text().trim();
      }

      return {
        content: [
          {
            type: "text",
            text: JSON.stringify({
              url: args.url,
              content: extractedData.slice(0, 5000),
              status: "success",
            }),
          },
        ],
      };
    } catch (error) {
      return {
        content: [
          {
            type: "text",
            text: JSON.stringify({
              error: error.message,
              status: "failed",
            }),
          },
        ],
        isError: true,
      };
    } finally {
      await browser.close();
    }
  }

  if (name === "take_screenshot") {
    const browser = await puppeteer.launch({ headless: "new" });
    try {
      const page = await browser.newPage();
      await page.goto(args.url as string, { waitUntil: "networkidle2" });

      const screenshot = await page.screenshot({
        encoding: "base64",
        fullPage: args.fullPage as boolean || false,
      });

      return {
        content: [
          {
            type: "image",
            data: screenshot,
            mimeType: "image/png",
          },
        ],
      };
    } catch (error) {
      return {
        content: [
          {
            type: "text",
            text: JSON.stringify({
              error: error.message,
              status: "failed",
            }),
          },
        ],
        isError: true,
      };
    } finally {
      await browser.close();
    }
  }

  throw new Error(`Unknown tool: ${name}`);
});

// Start the server
async function main() {
  const transport = new StdioServerTransport();
  await server.connect(transport);
  console.error("Puppeteer MCP server running on stdio");
}

main().catch((error) => {
  console.error("Server error:", error);
  process.exit(1);
});

Adding Resources to Your MCP Server

Resources provide static or dynamic data that can be referenced by AI assistants. Here's how to add resource support:

@app.list_resources()
async def list_resources() -> list[types.Resource]:
    """List available resources."""
    return [
        types.Resource(
            uri="scraper://config",
            name="Scraper Configuration",
            mimeType="application/json",
            description="Current scraper configuration and limits"
        )
    ]

@app.read_resource()
async def read_resource(uri: str) -> str:
    """Read a resource by URI."""
    if uri == "scraper://config":
        config = {
            "max_concurrent_requests": 5,
            "timeout_seconds": 30,
            "user_agent": "CustomScraperBot/1.0",
            "respect_robots_txt": True
        }
        return json.dumps(config, indent=2)

    raise ValueError(f"Unknown resource: {uri}")

Configuring Your MCP Server

Create a configuration file for Claude Desktop or other MCP clients:

{
  "mcpServers": {
    "web-scraper": {
      "command": "python",
      "args": ["/path/to/your/scraper_server.py"],
      "env": {
        "USER_AGENT": "CustomBot/1.0"
      }
    },
    "puppeteer-scraper": {
      "command": "node",
      "args": ["/path/to/your/dist/index.js"]
    }
  }
}

On macOS, add this to ~/Library/Application Support/Claude/claude_desktop_config.json. On Windows, use %APPDATA%\Claude\claude_desktop_config.json.

Advanced Features

Handling Rate Limiting

Implement rate limiting to avoid overwhelming target servers:

import asyncio
from collections import defaultdict
import time

class RateLimiter:
    def __init__(self, max_requests=10, time_window=60):
        self.max_requests = max_requests
        self.time_window = time_window
        self.requests = defaultdict(list)

    async def acquire(self, domain: str):
        now = time.time()
        # Remove old requests
        self.requests[domain] = [
            req_time for req_time in self.requests[domain]
            if now - req_time < self.time_window
        ]

        if len(self.requests[domain]) >= self.max_requests:
            wait_time = self.time_window - (now - self.requests[domain][0])
            await asyncio.sleep(wait_time)

        self.requests[domain].append(now)

# Usage in your tool
rate_limiter = RateLimiter(max_requests=10, time_window=60)

@app.call_tool()
async def call_tool(name: str, arguments: dict):
    if name == "scrape_html":
        from urllib.parse import urlparse
        domain = urlparse(arguments["url"]).netloc
        await rate_limiter.acquire(domain)
        # ... rest of scraping logic

Error Handling and Retries

Implement robust error handling for your scraping operations:

async function scrapeWithRetry(
  url: string,
  maxRetries: number = 3
): Promise<string> {
  for (let attempt = 0; attempt < maxRetries; attempt++) {
    try {
      const browser = await puppeteer.launch({ headless: "new" });
      const page = await browser.newPage();

      await page.goto(url, {
        waitUntil: "networkidle2",
        timeout: 30000,
      });

      const content = await page.content();
      await browser.close();

      return content;
    } catch (error) {
      if (attempt === maxRetries - 1) throw error;

      // Exponential backoff
      await new Promise(resolve =>
        setTimeout(resolve, Math.pow(2, attempt) * 1000)
      );
    }
  }

  throw new Error("Max retries exceeded");
}

Testing Your MCP Server

Create a simple test script to verify your MCP server works correctly:

import asyncio
import json
from mcp import ClientSession, StdioServerParameters
from mcp.client.stdio import stdio_client

async def test_scraper():
    server_params = StdioServerParameters(
        command="python",
        args=["scraper_server.py"]
    )

    async with stdio_client(server_params) as (read, write):
        async with ClientSession(read, write) as session:
            await session.initialize()

            # List available tools
            tools = await session.list_tools()
            print("Available tools:", json.dumps(tools, indent=2))

            # Call a tool
            result = await session.call_tool(
                "scrape_html",
                arguments={"url": "https://example.com"}
            )
            print("Scrape result:", json.dumps(result, indent=2))

if __name__ == "__main__":
    asyncio.run(test_scraper())

Deployment Considerations

When deploying your MCP server for production use:

  1. Security: Validate and sanitize all URLs to prevent SSRF attacks
  2. Resource Limits: Set appropriate memory and CPU limits
  3. Logging: Implement comprehensive logging for debugging
  4. Monitoring: Track success rates, response times, and error rates
  5. Caching: Consider caching frequently accessed pages to reduce load

Conclusion

Building a custom MCP server for web scraping empowers AI assistants with specialized scraping capabilities tailored to your needs. Whether you choose Python for simplicity or TypeScript for advanced browser automation, the Model Context Protocol provides a standardized way to expose your scraping tools to AI models. Start with basic HTML scraping and gradually add more sophisticated features like JavaScript rendering, screenshot capture, and data extraction as your requirements grow.

By following the examples and best practices outlined in this guide, you can create a robust, scalable MCP server that seamlessly integrates web scraping capabilities into AI-powered workflows.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon