What is an MCP Server and How Does It Work?

Model Context Protocol (MCP) is an open-source protocol developed by Anthropic that enables AI assistants like Claude to securely connect with external data sources, tools, and services. MCP servers act as intermediaries that expose specific functionality to AI models through a standardized interface, making it easier to integrate web scraping, database access, API interactions, and other capabilities into AI-powered workflows.

Understanding the MCP Architecture

MCP follows a client-server architecture where:

MCP Host: The application embedding the AI model (like Claude Desktop or an IDE)
MCP Client: The component within the host that communicates with MCP servers
MCP Server: A lightweight service that exposes specific tools, resources, or prompts to the AI model
Transport Layer: The communication mechanism (typically stdio or HTTP/SSE)

This architecture allows AI assistants to access real-time data, execute code, interact with APIs, and perform web scraping operations without requiring direct integration into the AI model itself.

Core Components of MCP Servers

1. Resources

Resources represent data that the AI can read. In web scraping contexts, resources might include:

Cached HTML content from previously scraped pages
Configuration files with scraping rules
Database records containing scraped data
API response templates

Example resource definition in TypeScript:

server.setRequestHandler(ListResourcesRequestSchema, async () => {
  return {
    resources: [
      {
        uri: "scraper://config/settings",
        name: "Scraper Configuration",
        mimeType: "application/json",
        description: "Current web scraper settings"
      },
      {
        uri: "scraper://cache/latest",
        name: "Latest Scraped Content",
        mimeType: "text/html",
        description: "Most recently scraped webpage"
      }
    ]
  };
});

2. Tools

Tools are functions that the AI can execute. For web scraping, tools might include:

HTTP request execution
HTML parsing and data extraction
Screenshot capture
Browser automation similar to handling browser sessions in Puppeteer
Proxy management

Example tool implementation in Python:

from mcp.server import Server
from mcp.types import Tool, TextContent
import httpx
from bs4 import BeautifulSoup

app = Server("web-scraper")

@app.list_tools()
async def list_tools() -> list[Tool]:
    return [
        Tool(
            name="scrape_webpage",
            description="Scrape content from a webpage using HTTP requests",
            inputSchema={
                "type": "object",
                "properties": {
                    "url": {
                        "type": "string",
                        "description": "The URL to scrape"
                    },
                    "selector": {
                        "type": "string",
                        "description": "CSS selector to extract specific elements"
                    },
                    "use_javascript": {
                        "type": "boolean",
                        "description": "Whether to render JavaScript"
                    }
                },
                "required": ["url"]
            }
        )
    ]

@app.call_tool()
async def call_tool(name: str, arguments: dict) -> list[TextContent]:
    if name == "scrape_webpage":
        url = arguments["url"]
        selector = arguments.get("selector")

        async with httpx.AsyncClient() as client:
            response = await client.get(url)
            soup = BeautifulSoup(response.text, 'html.parser')

            if selector:
                elements = soup.select(selector)
                content = "\n".join([el.get_text() for el in elements])
            else:
                content = soup.get_text()

            return [TextContent(
                type="text",
                text=f"Scraped content from {url}:\n\n{content}"
            )]

3. Prompts

Prompts are reusable templates that help guide the AI for specific tasks. For web scraping:

@app.list_prompts()
async def list_prompts() -> list[Prompt]:
    return [
        Prompt(
            name="extract_product_data",
            description="Extract structured product information from e-commerce pages",
            arguments=[
                PromptArgument(
                    name="url",
                    description="E-commerce product page URL",
                    required=True
                )
            ]
        )
    ]

How MCP Servers Work in Practice

Connection Flow

Discovery: The MCP client discovers available servers through configuration
Initialization: The client establishes a connection (typically via stdio)
Capability Negotiation: Client and server exchange supported features
Request/Response: The AI makes requests through the client to the server
Execution: The server executes the requested operation and returns results

Transport Mechanisms

Stdio Transport (most common for local tools):

{
  "mcpServers": {
    "web-scraper": {
      "command": "python",
      "args": ["-m", "mcp_scraper_server"]
    }
  }
}

HTTP with SSE Transport (for remote servers):

{
  "mcpServers": {
    "remote-scraper": {
      "url": "https://scraper.example.com/mcp",
      "transport": "sse"
    }
  }
}

Building a Web Scraping MCP Server

Here's a complete example of a simple MCP server for web scraping in JavaScript:

#!/usr/bin/env node

import { Server } from "@modelcontextprotocol/sdk/server/index.js";
import { StdioServerTransport } from "@modelcontextprotocol/sdk/server/stdio.js";
import axios from "axios";
import * as cheerio from "cheerio";

const server = new Server(
  {
    name: "web-scraper-mcp",
    version: "1.0.0",
  },
  {
    capabilities: {
      tools: {},
      resources: {}
    },
  }
);

// Define scraping tools
server.setRequestHandler("tools/list", async () => {
  return {
    tools: [
      {
        name: "fetch_html",
        description: "Fetch HTML content from a URL",
        inputSchema: {
          type: "object",
          properties: {
            url: {
              type: "string",
              description: "URL to fetch"
            },
            headers: {
              type: "object",
              description: "Optional HTTP headers"
            }
          },
          required: ["url"]
        }
      },
      {
        name: "extract_data",
        description: "Extract data using CSS selectors",
        inputSchema: {
          type: "object",
          properties: {
            html: {
              type: "string",
              description: "HTML content to parse"
            },
            selector: {
              type: "string",
              description: "CSS selector"
            },
            attribute: {
              type: "string",
              description: "Optional attribute to extract"
            }
          },
          required: ["html", "selector"]
        }
      }
    ]
  };
});

// Handle tool execution
server.setRequestHandler("tools/call", async (request) => {
  const { name, arguments: args } = request.params;

  try {
    if (name === "fetch_html") {
      const response = await axios.get(args.url, {
        headers: args.headers || {
          'User-Agent': 'Mozilla/5.0 (compatible; MCPScraper/1.0)'
        },
        timeout: 10000
      });

      return {
        content: [
          {
            type: "text",
            text: response.data
          }
        ]
      };
    }

    if (name === "extract_data") {
      const $ = cheerio.load(args.html);
      const elements = $(args.selector);

      const results = [];
      elements.each((i, el) => {
        if (args.attribute) {
          results.push($(el).attr(args.attribute));
        } else {
          results.push($(el).text().trim());
        }
      });

      return {
        content: [
          {
            type: "text",
            text: JSON.stringify(results, null, 2)
          }
        ]
      };
    }

    throw new Error(`Unknown tool: ${name}`);
  } catch (error) {
    return {
      content: [
        {
          type: "text",
          text: `Error: ${error.message}`
        }
      ],
      isError: true
    };
  }
});

// Start the server
async function main() {
  const transport = new StdioServerTransport();
  await server.connect(transport);
  console.error("Web Scraper MCP server running on stdio");
}

main().catch(console.error);

Advanced Web Scraping with MCP

Handling JavaScript-Heavy Sites

For sites that require JavaScript rendering, similar to handling AJAX requests using Puppeteer, you can integrate browser automation:

from playwright.async_api import async_playwright

@app.call_tool()
async def call_tool(name: str, arguments: dict):
    if name == "scrape_dynamic":
        url = arguments["url"]

        async with async_playwright() as p:
            browser = await p.chromium.launch()
            page = await browser.new_page()

            await page.goto(url)
            await page.wait_for_load_state('networkidle')

            content = await page.content()
            await browser.close()

            return [TextContent(type="text", text=content)]

Error Handling and Retries

Robust error handling is crucial for production scraping:

async function fetchWithRetry(url, maxRetries = 3) {
  for (let i = 0; i < maxRetries; i++) {
    try {
      const response = await axios.get(url, { timeout: 10000 });
      return response.data;
    } catch (error) {
      if (i === maxRetries - 1) throw error;

      const delay = Math.pow(2, i) * 1000; // Exponential backoff
      await new Promise(resolve => setTimeout(resolve, delay));
    }
  }
}

Rate Limiting and Proxy Support

import asyncio
from datetime import datetime, timedelta

class RateLimiter:
    def __init__(self, requests_per_second=1):
        self.rate = requests_per_second
        self.last_request = None

    async def acquire(self):
        if self.last_request:
            elapsed = (datetime.now() - self.last_request).total_seconds()
            wait_time = (1 / self.rate) - elapsed
            if wait_time > 0:
                await asyncio.sleep(wait_time)
        self.last_request = datetime.now()

# Usage in tool
rate_limiter = RateLimiter(requests_per_second=2)

@app.call_tool()
async def call_tool(name: str, arguments: dict):
    await rate_limiter.acquire()
    # ... perform scraping

Security Considerations

When building MCP servers for web scraping:

Input Validation: Always validate and sanitize URLs and parameters
Rate Limiting: Implement request throttling to avoid overwhelming target sites
Access Control: Restrict which domains can be scraped
Error Disclosure: Don't expose sensitive error details to the AI
Resource Limits: Set timeouts and memory limits

ALLOWED_DOMAINS = ['example.com', 'api.example.org']

def validate_url(url: str) -> bool:
    from urllib.parse import urlparse
    domain = urlparse(url).netloc
    return any(domain.endswith(allowed) for allowed in ALLOWED_DOMAINS)

Configuration and Deployment

Local Development Setup

Install dependencies:

npm install @modelcontextprotocol/sdk axios cheerio
# or
pip install mcp httpx beautifulsoup4

Configure Claude Desktop (claude_desktop_config.json):

{
  "mcpServers": {
    "web-scraper": {
      "command": "node",
      "args": ["/path/to/scraper-server.js"]
    }
  }
}

Testing Your MCP Server

# Test with the MCP Inspector
npx @modelcontextprotocol/inspector node scraper-server.js

# Or test with Python
python -m mcp.cli.inspector python -m your_mcp_server

Use Cases for Web Scraping MCP Servers

Competitive Intelligence: Automated monitoring of competitor websites
Price Tracking: Real-time price comparison across e-commerce sites
Content Aggregation: Collecting articles, news, or research papers
SEO Analysis: Extracting meta tags, headers, and structured data
Lead Generation: Gathering contact information from business directories
Market Research: Analyzing product reviews and customer sentiment
Data Validation: Verifying information across multiple sources

Best Practices

Respect robots.txt: Check and honor robots.txt directives
Use Appropriate User-Agents: Identify your scraper properly
Implement Caching: Store results to minimize redundant requests
Handle Pagination: Support multi-page data extraction efficiently
Monitor Performance: Track success rates and response times
Graceful Degradation: Fall back to simpler methods when complex ones fail
Documentation: Clearly document available tools and their parameters

Conclusion

MCP servers provide a powerful, standardized way to integrate web scraping capabilities into AI-assisted workflows. By implementing the Model Context Protocol, you can create reusable, secure, and maintainable scraping tools that work seamlessly with AI assistants like Claude. Whether you're building simple HTTP-based scrapers or complex browser automation tools similar to interacting with DOM elements in Puppeteer, MCP offers a flexible framework for exposing these capabilities to AI models.

The protocol's extensibility means you can start simple and gradually add more sophisticated features like JavaScript rendering, proxy rotation, and anti-bot detection as your needs grow. With proper error handling, rate limiting, and security measures, MCP servers can become reliable components in your data extraction infrastructure.

Table of contents