Table of contents

What is an MCP Server and How Does It Work?

Model Context Protocol (MCP) is an open-source protocol developed by Anthropic that enables AI assistants like Claude to securely connect with external data sources, tools, and services. MCP servers act as intermediaries that expose specific functionality to AI models through a standardized interface, making it easier to integrate web scraping, database access, API interactions, and other capabilities into AI-powered workflows.

Understanding the MCP Architecture

MCP follows a client-server architecture where:

  • MCP Host: The application embedding the AI model (like Claude Desktop or an IDE)
  • MCP Client: The component within the host that communicates with MCP servers
  • MCP Server: A lightweight service that exposes specific tools, resources, or prompts to the AI model
  • Transport Layer: The communication mechanism (typically stdio or HTTP/SSE)

This architecture allows AI assistants to access real-time data, execute code, interact with APIs, and perform web scraping operations without requiring direct integration into the AI model itself.

Core Components of MCP Servers

1. Resources

Resources represent data that the AI can read. In web scraping contexts, resources might include:

  • Cached HTML content from previously scraped pages
  • Configuration files with scraping rules
  • Database records containing scraped data
  • API response templates

Example resource definition in TypeScript:

server.setRequestHandler(ListResourcesRequestSchema, async () => {
  return {
    resources: [
      {
        uri: "scraper://config/settings",
        name: "Scraper Configuration",
        mimeType: "application/json",
        description: "Current web scraper settings"
      },
      {
        uri: "scraper://cache/latest",
        name: "Latest Scraped Content",
        mimeType: "text/html",
        description: "Most recently scraped webpage"
      }
    ]
  };
});

2. Tools

Tools are functions that the AI can execute. For web scraping, tools might include:

Example tool implementation in Python:

from mcp.server import Server
from mcp.types import Tool, TextContent
import httpx
from bs4 import BeautifulSoup

app = Server("web-scraper")

@app.list_tools()
async def list_tools() -> list[Tool]:
    return [
        Tool(
            name="scrape_webpage",
            description="Scrape content from a webpage using HTTP requests",
            inputSchema={
                "type": "object",
                "properties": {
                    "url": {
                        "type": "string",
                        "description": "The URL to scrape"
                    },
                    "selector": {
                        "type": "string",
                        "description": "CSS selector to extract specific elements"
                    },
                    "use_javascript": {
                        "type": "boolean",
                        "description": "Whether to render JavaScript"
                    }
                },
                "required": ["url"]
            }
        )
    ]

@app.call_tool()
async def call_tool(name: str, arguments: dict) -> list[TextContent]:
    if name == "scrape_webpage":
        url = arguments["url"]
        selector = arguments.get("selector")

        async with httpx.AsyncClient() as client:
            response = await client.get(url)
            soup = BeautifulSoup(response.text, 'html.parser')

            if selector:
                elements = soup.select(selector)
                content = "\n".join([el.get_text() for el in elements])
            else:
                content = soup.get_text()

            return [TextContent(
                type="text",
                text=f"Scraped content from {url}:\n\n{content}"
            )]

3. Prompts

Prompts are reusable templates that help guide the AI for specific tasks. For web scraping:

@app.list_prompts()
async def list_prompts() -> list[Prompt]:
    return [
        Prompt(
            name="extract_product_data",
            description="Extract structured product information from e-commerce pages",
            arguments=[
                PromptArgument(
                    name="url",
                    description="E-commerce product page URL",
                    required=True
                )
            ]
        )
    ]

How MCP Servers Work in Practice

Connection Flow

  1. Discovery: The MCP client discovers available servers through configuration
  2. Initialization: The client establishes a connection (typically via stdio)
  3. Capability Negotiation: Client and server exchange supported features
  4. Request/Response: The AI makes requests through the client to the server
  5. Execution: The server executes the requested operation and returns results

Transport Mechanisms

Stdio Transport (most common for local tools):

{
  "mcpServers": {
    "web-scraper": {
      "command": "python",
      "args": ["-m", "mcp_scraper_server"]
    }
  }
}

HTTP with SSE Transport (for remote servers):

{
  "mcpServers": {
    "remote-scraper": {
      "url": "https://scraper.example.com/mcp",
      "transport": "sse"
    }
  }
}

Building a Web Scraping MCP Server

Here's a complete example of a simple MCP server for web scraping in JavaScript:

#!/usr/bin/env node

import { Server } from "@modelcontextprotocol/sdk/server/index.js";
import { StdioServerTransport } from "@modelcontextprotocol/sdk/server/stdio.js";
import axios from "axios";
import * as cheerio from "cheerio";

const server = new Server(
  {
    name: "web-scraper-mcp",
    version: "1.0.0",
  },
  {
    capabilities: {
      tools: {},
      resources: {}
    },
  }
);

// Define scraping tools
server.setRequestHandler("tools/list", async () => {
  return {
    tools: [
      {
        name: "fetch_html",
        description: "Fetch HTML content from a URL",
        inputSchema: {
          type: "object",
          properties: {
            url: {
              type: "string",
              description: "URL to fetch"
            },
            headers: {
              type: "object",
              description: "Optional HTTP headers"
            }
          },
          required: ["url"]
        }
      },
      {
        name: "extract_data",
        description: "Extract data using CSS selectors",
        inputSchema: {
          type: "object",
          properties: {
            html: {
              type: "string",
              description: "HTML content to parse"
            },
            selector: {
              type: "string",
              description: "CSS selector"
            },
            attribute: {
              type: "string",
              description: "Optional attribute to extract"
            }
          },
          required: ["html", "selector"]
        }
      }
    ]
  };
});

// Handle tool execution
server.setRequestHandler("tools/call", async (request) => {
  const { name, arguments: args } = request.params;

  try {
    if (name === "fetch_html") {
      const response = await axios.get(args.url, {
        headers: args.headers || {
          'User-Agent': 'Mozilla/5.0 (compatible; MCPScraper/1.0)'
        },
        timeout: 10000
      });

      return {
        content: [
          {
            type: "text",
            text: response.data
          }
        ]
      };
    }

    if (name === "extract_data") {
      const $ = cheerio.load(args.html);
      const elements = $(args.selector);

      const results = [];
      elements.each((i, el) => {
        if (args.attribute) {
          results.push($(el).attr(args.attribute));
        } else {
          results.push($(el).text().trim());
        }
      });

      return {
        content: [
          {
            type: "text",
            text: JSON.stringify(results, null, 2)
          }
        ]
      };
    }

    throw new Error(`Unknown tool: ${name}`);
  } catch (error) {
    return {
      content: [
        {
          type: "text",
          text: `Error: ${error.message}`
        }
      ],
      isError: true
    };
  }
});

// Start the server
async function main() {
  const transport = new StdioServerTransport();
  await server.connect(transport);
  console.error("Web Scraper MCP server running on stdio");
}

main().catch(console.error);

Advanced Web Scraping with MCP

Handling JavaScript-Heavy Sites

For sites that require JavaScript rendering, similar to handling AJAX requests using Puppeteer, you can integrate browser automation:

from playwright.async_api import async_playwright

@app.call_tool()
async def call_tool(name: str, arguments: dict):
    if name == "scrape_dynamic":
        url = arguments["url"]

        async with async_playwright() as p:
            browser = await p.chromium.launch()
            page = await browser.new_page()

            await page.goto(url)
            await page.wait_for_load_state('networkidle')

            content = await page.content()
            await browser.close()

            return [TextContent(type="text", text=content)]

Error Handling and Retries

Robust error handling is crucial for production scraping:

async function fetchWithRetry(url, maxRetries = 3) {
  for (let i = 0; i < maxRetries; i++) {
    try {
      const response = await axios.get(url, { timeout: 10000 });
      return response.data;
    } catch (error) {
      if (i === maxRetries - 1) throw error;

      const delay = Math.pow(2, i) * 1000; // Exponential backoff
      await new Promise(resolve => setTimeout(resolve, delay));
    }
  }
}

Rate Limiting and Proxy Support

import asyncio
from datetime import datetime, timedelta

class RateLimiter:
    def __init__(self, requests_per_second=1):
        self.rate = requests_per_second
        self.last_request = None

    async def acquire(self):
        if self.last_request:
            elapsed = (datetime.now() - self.last_request).total_seconds()
            wait_time = (1 / self.rate) - elapsed
            if wait_time > 0:
                await asyncio.sleep(wait_time)
        self.last_request = datetime.now()

# Usage in tool
rate_limiter = RateLimiter(requests_per_second=2)

@app.call_tool()
async def call_tool(name: str, arguments: dict):
    await rate_limiter.acquire()
    # ... perform scraping

Security Considerations

When building MCP servers for web scraping:

  1. Input Validation: Always validate and sanitize URLs and parameters
  2. Rate Limiting: Implement request throttling to avoid overwhelming target sites
  3. Access Control: Restrict which domains can be scraped
  4. Error Disclosure: Don't expose sensitive error details to the AI
  5. Resource Limits: Set timeouts and memory limits
ALLOWED_DOMAINS = ['example.com', 'api.example.org']

def validate_url(url: str) -> bool:
    from urllib.parse import urlparse
    domain = urlparse(url).netloc
    return any(domain.endswith(allowed) for allowed in ALLOWED_DOMAINS)

Configuration and Deployment

Local Development Setup

Install dependencies:

npm install @modelcontextprotocol/sdk axios cheerio
# or
pip install mcp httpx beautifulsoup4

Configure Claude Desktop (claude_desktop_config.json):

{
  "mcpServers": {
    "web-scraper": {
      "command": "node",
      "args": ["/path/to/scraper-server.js"]
    }
  }
}

Testing Your MCP Server

# Test with the MCP Inspector
npx @modelcontextprotocol/inspector node scraper-server.js

# Or test with Python
python -m mcp.cli.inspector python -m your_mcp_server

Use Cases for Web Scraping MCP Servers

  1. Competitive Intelligence: Automated monitoring of competitor websites
  2. Price Tracking: Real-time price comparison across e-commerce sites
  3. Content Aggregation: Collecting articles, news, or research papers
  4. SEO Analysis: Extracting meta tags, headers, and structured data
  5. Lead Generation: Gathering contact information from business directories
  6. Market Research: Analyzing product reviews and customer sentiment
  7. Data Validation: Verifying information across multiple sources

Best Practices

  1. Respect robots.txt: Check and honor robots.txt directives
  2. Use Appropriate User-Agents: Identify your scraper properly
  3. Implement Caching: Store results to minimize redundant requests
  4. Handle Pagination: Support multi-page data extraction efficiently
  5. Monitor Performance: Track success rates and response times
  6. Graceful Degradation: Fall back to simpler methods when complex ones fail
  7. Documentation: Clearly document available tools and their parameters

Conclusion

MCP servers provide a powerful, standardized way to integrate web scraping capabilities into AI-assisted workflows. By implementing the Model Context Protocol, you can create reusable, secure, and maintainable scraping tools that work seamlessly with AI assistants like Claude. Whether you're building simple HTTP-based scrapers or complex browser automation tools similar to interacting with DOM elements in Puppeteer, MCP offers a flexible framework for exposing these capabilities to AI models.

The protocol's extensibility means you can start simple and gradually add more sophisticated features like JavaScript rendering, proxy rotation, and anti-bot detection as your needs grow. With proper error handling, rate limiting, and security measures, MCP servers can become reliable components in your data extraction infrastructure.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon