What is the MCP SDK and How Do I Use It?

The MCP SDK (Model Context Protocol Software Development Kit) is a collection of official libraries and tools that enable developers to build MCP servers and clients in multiple programming languages. For web scraping developers, the MCP SDK provides the foundation for creating AI-powered automation tools that can intelligently scrape websites, extract data, and integrate with external APIs through a standardized protocol.

The SDK abstracts away the complexity of the Model Context Protocol specification, providing high-level APIs for exposing tools, resources, and prompts that AI assistants like Claude can use to perform web scraping tasks.

Available MCP SDKs

Anthropic and the open-source community maintain MCP SDKs for multiple programming languages:

Python: mcp - Ideal for data science and ML-focused scraping workflows
TypeScript/JavaScript: @modelcontextprotocol/sdk - Perfect for Node.js and web-based automation
Java: MCP Java SDK - For enterprise Java applications
Kotlin: MCP Kotlin SDK - Android and JVM applications
C#/.NET: MCP.NET SDK - Windows and cross-platform .NET apps
Go: MCP Go SDK - High-performance concurrent scraping
PHP: MCP PHP SDK - WordPress plugins and web applications
Ruby: MCP Ruby SDK - Rails applications and scripts
Rust: MCP Rust SDK - Systems programming and performance-critical applications
Swift: MCP Swift SDK - iOS and macOS applications

For web scraping, the Python and TypeScript SDKs are the most commonly used due to their rich ecosystem of HTTP clients, HTML parsers, and data processing libraries.

Installing the MCP SDK

Python Installation

# Install the MCP SDK for Python
pip install mcp

# Install additional dependencies for web scraping
pip install httpx beautifulsoup4 pandas

# For async HTTP requests
pip install aiohttp

TypeScript/JavaScript Installation

# Create a new Node.js project
mkdir webscraping-mcp-server
cd webscraping-mcp-server
npm init -y

# Install the MCP SDK
npm install @modelcontextprotocol/sdk

# Install web scraping dependencies
npm install axios cheerio

# For TypeScript development
npm install --save-dev typescript @types/node
npm install tsx

Core MCP SDK Concepts

1. Server Architecture

MCP servers built with the SDK communicate via standard input/output (stdio), making them easy to integrate with any MCP host. The SDK handles:

Protocol serialization/deserialization
Message routing and validation
Error handling and recovery
Capability negotiation with clients

2. Request Handlers

The SDK uses a request-response pattern where servers register handlers for specific operations:

List Tools: Advertise available scraping operations
Call Tool: Execute scraping tasks with parameters
List Resources: Expose data sources (cached HTML, databases)
Read Resource: Retrieve resource content
List Prompts: Provide templated scraping workflows

3. Transport Layer

The SDK supports multiple transport mechanisms:

Stdio Transport: Standard input/output (most common)
HTTP/SSE Transport: Server-sent events for web-based clients
WebSocket Transport: Bidirectional real-time communication

Building a Web Scraping MCP Server with Python

Let's build a comprehensive MCP server that integrates with WebScraping.AI's API:

import asyncio
import os
from typing import Any
import httpx
from mcp.server import Server
from mcp.server.stdio import stdio_server
from mcp.types import Tool, TextContent, ImageContent

# Initialize server
app = Server("webscraping-ai-mcp")

# API configuration
WEBSCRAPING_AI_KEY = os.environ.get("WEBSCRAPING_AI_API_KEY")
BASE_URL = "https://api.webscraping.ai"

@app.list_tools()
async def list_tools() -> list[Tool]:
    """Define available web scraping tools"""
    return [
        Tool(
            name="scrape_html",
            description="Scrape raw HTML from any URL with JavaScript rendering support",
            inputSchema={
                "type": "object",
                "properties": {
                    "url": {
                        "type": "string",
                        "description": "The URL to scrape"
                    },
                    "js": {
                        "type": "boolean",
                        "description": "Enable JavaScript rendering",
                        "default": True
                    },
                    "wait_for": {
                        "type": "string",
                        "description": "CSS selector to wait for before extracting HTML"
                    },
                    "proxy": {
                        "type": "string",
                        "enum": ["datacenter", "residential"],
                        "description": "Proxy type to use"
                    }
                },
                "required": ["url"]
            }
        ),
        Tool(
            name="extract_text",
            description="Extract clean, readable text content from a webpage",
            inputSchema={
                "type": "object",
                "properties": {
                    "url": {
                        "type": "string",
                        "description": "The URL to extract text from"
                    },
                    "return_links": {
                        "type": "boolean",
                        "description": "Include hyperlinks in the text output"
                    }
                },
                "required": ["url"]
            }
        ),
        Tool(
            name="ask_question",
            description="Ask natural language questions about webpage content using AI",
            inputSchema={
                "type": "object",
                "properties": {
                    "url": {
                        "type": "string",
                        "description": "The URL to analyze"
                    },
                    "question": {
                        "type": "string",
                        "description": "The question to ask about the page"
                    }
                },
                "required": ["url", "question"]
            }
        ),
        Tool(
            name="extract_fields",
            description="Extract structured data fields from a webpage using AI",
            inputSchema={
                "type": "object",
                "properties": {
                    "url": {
                        "type": "string",
                        "description": "The URL to extract data from"
                    },
                    "fields": {
                        "type": "object",
                        "description": "Object mapping field names to extraction instructions",
                        "additionalProperties": {
                            "type": "string"
                        }
                    }
                },
                "required": ["url", "fields"]
            }
        )
    ]

@app.call_tool()
async def call_tool(name: str, arguments: dict) -> list[TextContent]:
    """Execute web scraping tools"""

    if not WEBSCRAPING_AI_KEY:
        return [TextContent(
            type="text",
            text="Error: WEBSCRAPING_AI_API_KEY environment variable not set"
        )]

    async with httpx.AsyncClient(timeout=30.0) as client:
        try:
            if name == "scrape_html":
                response = await client.get(
                    f"{BASE_URL}/html",
                    params={
                        "url": arguments["url"],
                        "api_key": WEBSCRAPING_AI_KEY,
                        "js": arguments.get("js", True),
                        "wait_for": arguments.get("wait_for"),
                        "proxy": arguments.get("proxy", "residential")
                    }
                )
                response.raise_for_status()
                return [TextContent(
                    type="text",
                    text=f"HTML Content ({len(response.text)} characters):\n\n{response.text}"
                )]

            elif name == "extract_text":
                response = await client.get(
                    f"{BASE_URL}/text",
                    params={
                        "url": arguments["url"],
                        "api_key": WEBSCRAPING_AI_KEY,
                        "return_links": arguments.get("return_links", False)
                    }
                )
                response.raise_for_status()
                data = response.json()
                return [TextContent(
                    type="text",
                    text=f"Extracted Text:\n\n{data.get('text', '')}"
                )]

            elif name == "ask_question":
                response = await client.post(
                    f"{BASE_URL}/question",
                    params={
                        "url": arguments["url"],
                        "api_key": WEBSCRAPING_AI_KEY
                    },
                    json={"question": arguments["question"]}
                )
                response.raise_for_status()
                data = response.json()
                return [TextContent(
                    type="text",
                    text=f"Answer: {data.get('answer', 'No answer found')}"
                )]

            elif name == "extract_fields":
                response = await client.post(
                    f"{BASE_URL}/fields",
                    params={
                        "url": arguments["url"],
                        "api_key": WEBSCRAPING_AI_KEY
                    },
                    json={"fields": arguments["fields"]}
                )
                response.raise_for_status()
                data = response.json()

                # Format extracted fields nicely
                result = "Extracted Fields:\n\n"
                for field, value in data.items():
                    result += f"{field}: {value}\n"

                return [TextContent(type="text", text=result)]

        except httpx.HTTPStatusError as e:
            return [TextContent(
                type="text",
                text=f"HTTP Error {e.response.status_code}: {e.response.text}"
            )]
        except Exception as e:
            return [TextContent(
                type="text",
                text=f"Error: {str(e)}"
            )]

    return [TextContent(type="text", text=f"Unknown tool: {name}")]

async def main():
    """Run the MCP server"""
    async with stdio_server() as (read_stream, write_stream):
        await app.run(
            read_stream,
            write_stream,
            app.create_initialization_options()
        )

if __name__ == "__main__":
    asyncio.run(main())

Save this as webscraping_mcp_server.py and run it with:

export WEBSCRAPING_AI_API_KEY="your_api_key_here"
python webscraping_mcp_server.py

Building a Web Scraping MCP Server with TypeScript

Here's the equivalent implementation in TypeScript:

import { Server } from "@modelcontextprotocol/sdk/server/index.js";
import { StdioServerTransport } from "@modelcontextprotocol/sdk/server/stdio.js";
import {
  CallToolRequestSchema,
  ListToolsRequestSchema,
} from "@modelcontextprotocol/sdk/types.js";
import axios, { AxiosError } from "axios";

const WEBSCRAPING_AI_KEY = process.env.WEBSCRAPING_AI_API_KEY;
const BASE_URL = "https://api.webscraping.ai";

// Create server instance
const server = new Server(
  {
    name: "webscraping-ai-mcp",
    version: "1.0.0",
  },
  {
    capabilities: {
      tools: {},
    },
  }
);

// Register available tools
server.setRequestHandler(ListToolsRequestSchema, async () => {
  return {
    tools: [
      {
        name: "scrape_html",
        description: "Scrape raw HTML from any URL with JavaScript rendering",
        inputSchema: {
          type: "object",
          properties: {
            url: {
              type: "string",
              description: "The URL to scrape",
            },
            js: {
              type: "boolean",
              description: "Enable JavaScript rendering",
              default: true,
            },
            wait_for: {
              type: "string",
              description: "CSS selector to wait for",
            },
            proxy: {
              type: "string",
              enum: ["datacenter", "residential"],
              description: "Proxy type",
            },
          },
          required: ["url"],
        },
      },
      {
        name: "extract_text",
        description: "Extract clean text content from a webpage",
        inputSchema: {
          type: "object",
          properties: {
            url: {
              type: "string",
              description: "The URL to extract text from",
            },
            return_links: {
              type: "boolean",
              description: "Include hyperlinks in output",
            },
          },
          required: ["url"],
        },
      },
      {
        name: "ask_question",
        description: "Ask questions about webpage content using AI",
        inputSchema: {
          type: "object",
          properties: {
            url: { type: "string", description: "The URL to analyze" },
            question: { type: "string", description: "Question to ask" },
          },
          required: ["url", "question"],
        },
      },
      {
        name: "extract_fields",
        description: "Extract structured data using AI",
        inputSchema: {
          type: "object",
          properties: {
            url: { type: "string", description: "Target URL" },
            fields: {
              type: "object",
              description: "Fields to extract",
              additionalProperties: { type: "string" },
            },
          },
          required: ["url", "fields"],
        },
      },
    ],
  };
});

// Handle tool execution
server.setRequestHandler(CallToolRequestSchema, async (request) => {
  const { name, arguments: args } = request.params;

  if (!WEBSCRAPING_AI_KEY) {
    return {
      content: [
        {
          type: "text",
          text: "Error: WEBSCRAPING_AI_API_KEY not set",
        },
      ],
    };
  }

  try {
    if (name === "scrape_html") {
      const response = await axios.get(`${BASE_URL}/html`, {
        params: {
          url: args.url,
          api_key: WEBSCRAPING_AI_KEY,
          js: args.js ?? true,
          wait_for: args.wait_for,
          proxy: args.proxy || "residential",
        },
      });

      return {
        content: [
          {
            type: "text",
            text: `HTML Content (${response.data.length} characters):\n\n${response.data}`,
          },
        ],
      };
    }

    if (name === "extract_text") {
      const response = await axios.get(`${BASE_URL}/text`, {
        params: {
          url: args.url,
          api_key: WEBSCRAPING_AI_KEY,
          return_links: args.return_links || false,
        },
      });

      return {
        content: [
          {
            type: "text",
            text: `Extracted Text:\n\n${response.data.text}`,
          },
        ],
      };
    }

    if (name === "ask_question") {
      const response = await axios.post(
        `${BASE_URL}/question`,
        { question: args.question },
        {
          params: {
            url: args.url,
            api_key: WEBSCRAPING_AI_KEY,
          },
        }
      );

      return {
        content: [
          {
            type: "text",
            text: `Answer: ${response.data.answer}`,
          },
        ],
      };
    }

    if (name === "extract_fields") {
      const response = await axios.post(
        `${BASE_URL}/fields`,
        { fields: args.fields },
        {
          params: {
            url: args.url,
            api_key: WEBSCRAPING_AI_KEY,
          },
        }
      );

      const result =
        "Extracted Fields:\n\n" +
        Object.entries(response.data)
          .map(([key, value]) => `${key}: ${value}`)
          .join("\n");

      return {
        content: [{ type: "text", text: result }],
      };
    }

    throw new Error(`Unknown tool: ${name}`);
  } catch (error) {
    const axiosError = error as AxiosError;
    return {
      content: [
        {
          type: "text",
          text: `Error: ${axiosError.message}`,
        },
      ],
    };
  }
});

// Start the server
async function main() {
  const transport = new StdioServerTransport();
  await server.connect(transport);
  console.error("WebScraping.AI MCP Server running on stdio");
}

main().catch(console.error);

Save as webscraping_mcp_server.ts and run with:

export WEBSCRAPING_AI_API_KEY="your_api_key_here"
npx tsx webscraping_mcp_server.ts

Configuring Your MCP Server

Claude Desktop Configuration

Once your server is built, configure it in Claude Desktop to make it accessible to Claude:

macOS Configuration (~/Library/Application Support/Claude/claude_desktop_config.json):

{
  "mcpServers": {
    "webscraping-ai": {
      "command": "python",
      "args": ["/absolute/path/to/webscraping_mcp_server.py"],
      "env": {
        "WEBSCRAPING_AI_API_KEY": "your_api_key_here"
      }
    }
  }
}

Windows Configuration (%APPDATA%\Claude\claude_desktop_config.json):

{
  "mcpServers": {
    "webscraping-ai": {
      "command": "python",
      "args": ["C:\\path\\to\\webscraping_mcp_server.py"],
      "env": {
        "WEBSCRAPING_AI_API_KEY": "your_api_key_here"
      }
    }
  }
}

For TypeScript servers:

{
  "mcpServers": {
    "webscraping-ai": {
      "command": "npx",
      "args": ["tsx", "/path/to/webscraping_mcp_server.ts"],
      "env": {
        "WEBSCRAPING_AI_API_KEY": "your_api_key_here"
      }
    }
  }
}

Advanced SDK Features

Adding Resources

Resources allow your MCP server to expose data that Claude can read. This is useful for caching scraped content:

from mcp.types import Resource

@app.list_resources()
async def list_resources() -> list[Resource]:
    return [
        Resource(
            uri="cache://recent-scrapes",
            name="Recent Scraping Results",
            mimeType="application/json",
            description="Recently scraped webpage data"
        )
    ]

@app.read_resource()
async def read_resource(uri: str) -> str:
    if uri == "cache://recent-scrapes":
        # Return cached scraping results
        return json.dumps(recent_scrapes_cache)
    raise ValueError(f"Unknown resource: {uri}")

Adding Prompts

Prompts provide pre-configured workflows that users can invoke:

from mcp.types import Prompt, PromptMessage

@app.list_prompts()
async def list_prompts() -> list[Prompt]:
    return [
        Prompt(
            name="scrape_product_page",
            description="Extract product information from e-commerce pages",
            arguments=[
                {
                    "name": "url",
                    "description": "Product page URL",
                    "required": True
                }
            ]
        )
    ]

@app.get_prompt()
async def get_prompt(name: str, arguments: dict) -> list[PromptMessage]:
    if name == "scrape_product_page":
        return [
            PromptMessage(
                role="user",
                content=f"""Extract the following from {arguments['url']}:
                - Product name
                - Price
                - Description
                - Availability
                - Customer rating
                """
            )
        ]

Error Handling Best Practices

Robust error handling is critical for production MCP servers:

import logging
from typing import Optional

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

@app.call_tool()
async def call_tool(name: str, arguments: dict) -> list[TextContent]:
    try:
        # Validate inputs
        if not arguments.get("url"):
            raise ValueError("URL parameter is required")

        # Validate URL format
        from urllib.parse import urlparse
        parsed = urlparse(arguments["url"])
        if not all([parsed.scheme, parsed.netloc]):
            raise ValueError("Invalid URL format")

        # Execute scraping with timeout
        async with httpx.AsyncClient(timeout=30.0) as client:
            response = await client.get(...)

        logger.info(f"Successfully executed {name} for {arguments['url']}")
        return [TextContent(type="text", text=response.text)]

    except httpx.TimeoutException:
        logger.error(f"Timeout scraping {arguments.get('url')}")
        return [TextContent(
            type="text",
            text="Error: Request timed out. The page took too long to load."
        )]
    except httpx.HTTPStatusError as e:
        logger.error(f"HTTP {e.response.status_code} for {arguments.get('url')}")
        return [TextContent(
            type="text",
            text=f"Error: HTTP {e.response.status_code} - {e.response.text}"
        )]
    except Exception as e:
        logger.exception(f"Unexpected error in {name}")
        return [TextContent(
            type="text",
            text=f"Unexpected error: {str(e)}"
        )]

Testing Your MCP Server

Unit Testing with Python

import pytest
from unittest.mock import AsyncMock, patch

@pytest.mark.asyncio
async def test_scrape_html_tool():
    """Test HTML scraping tool"""
    with patch('httpx.AsyncClient') as mock_client:
        mock_response = AsyncMock()
        mock_response.text = "<html><body>Test</body></html>"
        mock_response.raise_for_status = AsyncMock()

        mock_client.return_value.__aenter__.return_value.get.return_value = mock_response

        result = await call_tool("scrape_html", {"url": "https://example.com"})

        assert len(result) == 1
        assert "Test" in result[0].text

Integration Testing

Test your server end-to-end using the MCP Inspector tool:

# Install MCP Inspector
npm install -g @modelcontextprotocol/inspector

# Test your server
mcp-inspector python webscraping_mcp_server.py

Real-World Use Cases

1. E-commerce Price Monitoring

Build an MCP server that tracks product prices across multiple retailers, similar to handling browser sessions in Puppeteer:

@app.call_tool()
async def call_tool(name: str, arguments: dict):
    if name == "monitor_prices":
        products = arguments["products"]
        results = {}

        for product in products:
            response = await client.post(
                f"{BASE_URL}/fields",
                params={"url": product["url"], "api_key": API_KEY},
                json={
                    "fields": {
                        "price": "current product price",
                        "availability": "in stock or out of stock"
                    }
                }
            )
            results[product["name"]] = response.json()

        return [TextContent(type="text", text=json.dumps(results, indent=2))]

2. Content Aggregation with AI

Extract and summarize content from multiple news sources:

@app.call_tool()
async def call_tool(name: str, arguments: dict):
    if name == "aggregate_news":
        sources = arguments["sources"]
        summaries = []

        for source in sources:
            # Extract text
            text_response = await client.get(
                f"{BASE_URL}/text",
                params={"url": source, "api_key": API_KEY}
            )

            # Ask for summary
            summary_response = await client.post(
                f"{BASE_URL}/question",
                params={"url": source, "api_key": API_KEY},
                json={"question": "Summarize the main points of this article in 2-3 sentences"}
            )

            summaries.append({
                "url": source,
                "summary": summary_response.json()["answer"]
            })

        return [TextContent(type="text", text=json.dumps(summaries, indent=2))]

3. SEO Audit Automation

Create automated SEO audits, similar to how you might use Puppeteer for SEO auditing:

server.setRequestHandler(CallToolRequestSchema, async (request) => {
  if (request.params.name === "seo_audit") {
    const url = request.params.arguments.url;

    // Scrape the page
    const htmlResponse = await axios.get(`${BASE_URL}/html`, {
      params: { url, api_key: WEBSCRAPING_AI_KEY, js: true },
    });

    // Extract SEO fields
    const fieldsResponse = await axios.post(
      `${BASE_URL}/fields`,
      {
        fields: {
          title: "page title",
          meta_description: "meta description",
          h1_count: "number of h1 tags",
          image_count: "number of images",
          has_alt_tags: "do all images have alt tags",
        },
      },
      { params: { url, api_key: WEBSCRAPING_AI_KEY } }
    );

    return {
      content: [
        {
          type: "text",
          text: `SEO Audit Results:\n${JSON.stringify(fieldsResponse.data, null, 2)}`,
        },
      ],
    };
  }
});

Performance Optimization

Concurrent Requests

Use async operations to scrape multiple pages simultaneously:

async def scrape_multiple(urls: list[str]) -> dict:
    """Scrape multiple URLs concurrently"""
    async with httpx.AsyncClient() as client:
        tasks = [
            client.get(
                f"{BASE_URL}/html",
                params={"url": url, "api_key": WEBSCRAPING_AI_KEY}
            )
            for url in urls
        ]
        responses = await asyncio.gather(*tasks, return_exceptions=True)

        results = {}
        for url, response in zip(urls, responses):
            if isinstance(response, Exception):
                results[url] = {"error": str(response)}
            else:
                results[url] = response.text

        return results

Caching Results

Implement caching to reduce API calls:

from functools import lru_cache
import hashlib

cache = {}

def cache_key(url: str) -> str:
    return hashlib.md5(url.encode()).hexdigest()

@app.call_tool()
async def call_tool(name: str, arguments: dict):
    if name == "scrape_cached":
        url = arguments["url"]
        key = cache_key(url)

        # Check cache
        if key in cache:
            return [TextContent(type="text", text=f"[CACHED] {cache[key]}")]

        # Scrape and cache
        async with httpx.AsyncClient() as client:
            response = await client.get(
                f"{BASE_URL}/html",
                params={"url": url, "api_key": WEBSCRAPING_AI_KEY}
            )
            cache[key] = response.text
            return [TextContent(type="text", text=response.text)]

Deployment Strategies

Docker Deployment

Create a Dockerfile for your MCP server:

FROM python:3.11-slim

WORKDIR /app

COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

COPY webscraping_mcp_server.py .

ENV WEBSCRAPING_AI_API_KEY=""

CMD ["python", "webscraping_mcp_server.py"]

Build and run:

docker build -t webscraping-mcp .
docker run -e WEBSCRAPING_AI_API_KEY="your_key" webscraping-mcp

Systemd Service (Linux)

Create /etc/systemd/system/webscraping-mcp.service:

[Unit]
Description=WebScraping.AI MCP Server
After=network.target

[Service]
Type=simple
User=youruser
WorkingDirectory=/path/to/server
Environment="WEBSCRAPING_AI_API_KEY=your_key"
ExecStart=/usr/bin/python3 /path/to/webscraping_mcp_server.py
Restart=on-failure

[Install]
WantedBy=multi-user.target

Enable and start:

sudo systemctl enable webscraping-mcp
sudo systemctl start webscraping-mcp

Security Considerations

Environment Variables: Always use environment variables for API keys
Input Validation: Validate all URLs and parameters
Rate Limiting: Implement rate limiting to prevent abuse
Error Messages: Don't expose sensitive information in error messages
HTTPS Only: Only allow HTTPS URLs for scraping
Timeout Protection: Set reasonable timeouts for all requests

from urllib.parse import urlparse

def validate_url(url: str) -> tuple[bool, Optional[str]]:
    """Validate URL security"""
    try:
        parsed = urlparse(url)

        if parsed.scheme not in ["http", "https"]:
            return False, "Only HTTP/HTTPS URLs allowed"

        if not parsed.netloc:
            return False, "Invalid URL format"

        # Block localhost and private IPs
        if any(x in parsed.netloc.lower() for x in ["localhost", "127.0.0.1", "0.0.0.0"]):
            return False, "Cannot scrape local URLs"

        return True, None
    except Exception as e:
        return False, str(e)

Conclusion

The MCP SDK provides a powerful, standardized way to build AI-powered web scraping tools that integrate seamlessly with Claude and other MCP-compatible assistants. By understanding the SDK's architecture, request handlers, and best practices, you can create robust scraping servers that handle everything from simple HTML extraction to complex multi-step data pipelines.

Whether you're building in Python, TypeScript, or another supported language, the MCP SDK abstracts away protocol complexity and lets you focus on building great scraping tools that users can control through natural language.

Start by installing the SDK for your preferred language, build a simple server with one or two tools, configure it in Claude Desktop, and expand from there. The combination of MCP's standardization and WebScraping.AI's powerful API creates endless possibilities for automated data extraction workflows.

Table of contents