What is the MCP SDK and How Do I Use It?
The MCP SDK (Model Context Protocol Software Development Kit) is a collection of official libraries and tools that enable developers to build MCP servers and clients in multiple programming languages. For web scraping developers, the MCP SDK provides the foundation for creating AI-powered automation tools that can intelligently scrape websites, extract data, and integrate with external APIs through a standardized protocol.
The SDK abstracts away the complexity of the Model Context Protocol specification, providing high-level APIs for exposing tools, resources, and prompts that AI assistants like Claude can use to perform web scraping tasks.
Available MCP SDKs
Anthropic and the open-source community maintain MCP SDKs for multiple programming languages:
- Python:
mcp
- Ideal for data science and ML-focused scraping workflows - TypeScript/JavaScript:
@modelcontextprotocol/sdk
- Perfect for Node.js and web-based automation - Java: MCP Java SDK - For enterprise Java applications
- Kotlin: MCP Kotlin SDK - Android and JVM applications
- C#/.NET: MCP.NET SDK - Windows and cross-platform .NET apps
- Go: MCP Go SDK - High-performance concurrent scraping
- PHP: MCP PHP SDK - WordPress plugins and web applications
- Ruby: MCP Ruby SDK - Rails applications and scripts
- Rust: MCP Rust SDK - Systems programming and performance-critical applications
- Swift: MCP Swift SDK - iOS and macOS applications
For web scraping, the Python and TypeScript SDKs are the most commonly used due to their rich ecosystem of HTTP clients, HTML parsers, and data processing libraries.
Installing the MCP SDK
Python Installation
# Install the MCP SDK for Python
pip install mcp
# Install additional dependencies for web scraping
pip install httpx beautifulsoup4 pandas
# For async HTTP requests
pip install aiohttp
TypeScript/JavaScript Installation
# Create a new Node.js project
mkdir webscraping-mcp-server
cd webscraping-mcp-server
npm init -y
# Install the MCP SDK
npm install @modelcontextprotocol/sdk
# Install web scraping dependencies
npm install axios cheerio
# For TypeScript development
npm install --save-dev typescript @types/node
npm install tsx
Core MCP SDK Concepts
1. Server Architecture
MCP servers built with the SDK communicate via standard input/output (stdio), making them easy to integrate with any MCP host. The SDK handles:
- Protocol serialization/deserialization
- Message routing and validation
- Error handling and recovery
- Capability negotiation with clients
2. Request Handlers
The SDK uses a request-response pattern where servers register handlers for specific operations:
- List Tools: Advertise available scraping operations
- Call Tool: Execute scraping tasks with parameters
- List Resources: Expose data sources (cached HTML, databases)
- Read Resource: Retrieve resource content
- List Prompts: Provide templated scraping workflows
3. Transport Layer
The SDK supports multiple transport mechanisms:
- Stdio Transport: Standard input/output (most common)
- HTTP/SSE Transport: Server-sent events for web-based clients
- WebSocket Transport: Bidirectional real-time communication
Building a Web Scraping MCP Server with Python
Let's build a comprehensive MCP server that integrates with WebScraping.AI's API:
import asyncio
import os
from typing import Any
import httpx
from mcp.server import Server
from mcp.server.stdio import stdio_server
from mcp.types import Tool, TextContent, ImageContent
# Initialize server
app = Server("webscraping-ai-mcp")
# API configuration
WEBSCRAPING_AI_KEY = os.environ.get("WEBSCRAPING_AI_API_KEY")
BASE_URL = "https://api.webscraping.ai"
@app.list_tools()
async def list_tools() -> list[Tool]:
"""Define available web scraping tools"""
return [
Tool(
name="scrape_html",
description="Scrape raw HTML from any URL with JavaScript rendering support",
inputSchema={
"type": "object",
"properties": {
"url": {
"type": "string",
"description": "The URL to scrape"
},
"js": {
"type": "boolean",
"description": "Enable JavaScript rendering",
"default": True
},
"wait_for": {
"type": "string",
"description": "CSS selector to wait for before extracting HTML"
},
"proxy": {
"type": "string",
"enum": ["datacenter", "residential"],
"description": "Proxy type to use"
}
},
"required": ["url"]
}
),
Tool(
name="extract_text",
description="Extract clean, readable text content from a webpage",
inputSchema={
"type": "object",
"properties": {
"url": {
"type": "string",
"description": "The URL to extract text from"
},
"return_links": {
"type": "boolean",
"description": "Include hyperlinks in the text output"
}
},
"required": ["url"]
}
),
Tool(
name="ask_question",
description="Ask natural language questions about webpage content using AI",
inputSchema={
"type": "object",
"properties": {
"url": {
"type": "string",
"description": "The URL to analyze"
},
"question": {
"type": "string",
"description": "The question to ask about the page"
}
},
"required": ["url", "question"]
}
),
Tool(
name="extract_fields",
description="Extract structured data fields from a webpage using AI",
inputSchema={
"type": "object",
"properties": {
"url": {
"type": "string",
"description": "The URL to extract data from"
},
"fields": {
"type": "object",
"description": "Object mapping field names to extraction instructions",
"additionalProperties": {
"type": "string"
}
}
},
"required": ["url", "fields"]
}
)
]
@app.call_tool()
async def call_tool(name: str, arguments: dict) -> list[TextContent]:
"""Execute web scraping tools"""
if not WEBSCRAPING_AI_KEY:
return [TextContent(
type="text",
text="Error: WEBSCRAPING_AI_API_KEY environment variable not set"
)]
async with httpx.AsyncClient(timeout=30.0) as client:
try:
if name == "scrape_html":
response = await client.get(
f"{BASE_URL}/html",
params={
"url": arguments["url"],
"api_key": WEBSCRAPING_AI_KEY,
"js": arguments.get("js", True),
"wait_for": arguments.get("wait_for"),
"proxy": arguments.get("proxy", "residential")
}
)
response.raise_for_status()
return [TextContent(
type="text",
text=f"HTML Content ({len(response.text)} characters):\n\n{response.text}"
)]
elif name == "extract_text":
response = await client.get(
f"{BASE_URL}/text",
params={
"url": arguments["url"],
"api_key": WEBSCRAPING_AI_KEY,
"return_links": arguments.get("return_links", False)
}
)
response.raise_for_status()
data = response.json()
return [TextContent(
type="text",
text=f"Extracted Text:\n\n{data.get('text', '')}"
)]
elif name == "ask_question":
response = await client.post(
f"{BASE_URL}/question",
params={
"url": arguments["url"],
"api_key": WEBSCRAPING_AI_KEY
},
json={"question": arguments["question"]}
)
response.raise_for_status()
data = response.json()
return [TextContent(
type="text",
text=f"Answer: {data.get('answer', 'No answer found')}"
)]
elif name == "extract_fields":
response = await client.post(
f"{BASE_URL}/fields",
params={
"url": arguments["url"],
"api_key": WEBSCRAPING_AI_KEY
},
json={"fields": arguments["fields"]}
)
response.raise_for_status()
data = response.json()
# Format extracted fields nicely
result = "Extracted Fields:\n\n"
for field, value in data.items():
result += f"{field}: {value}\n"
return [TextContent(type="text", text=result)]
except httpx.HTTPStatusError as e:
return [TextContent(
type="text",
text=f"HTTP Error {e.response.status_code}: {e.response.text}"
)]
except Exception as e:
return [TextContent(
type="text",
text=f"Error: {str(e)}"
)]
return [TextContent(type="text", text=f"Unknown tool: {name}")]
async def main():
"""Run the MCP server"""
async with stdio_server() as (read_stream, write_stream):
await app.run(
read_stream,
write_stream,
app.create_initialization_options()
)
if __name__ == "__main__":
asyncio.run(main())
Save this as webscraping_mcp_server.py
and run it with:
export WEBSCRAPING_AI_API_KEY="your_api_key_here"
python webscraping_mcp_server.py
Building a Web Scraping MCP Server with TypeScript
Here's the equivalent implementation in TypeScript:
import { Server } from "@modelcontextprotocol/sdk/server/index.js";
import { StdioServerTransport } from "@modelcontextprotocol/sdk/server/stdio.js";
import {
CallToolRequestSchema,
ListToolsRequestSchema,
} from "@modelcontextprotocol/sdk/types.js";
import axios, { AxiosError } from "axios";
const WEBSCRAPING_AI_KEY = process.env.WEBSCRAPING_AI_API_KEY;
const BASE_URL = "https://api.webscraping.ai";
// Create server instance
const server = new Server(
{
name: "webscraping-ai-mcp",
version: "1.0.0",
},
{
capabilities: {
tools: {},
},
}
);
// Register available tools
server.setRequestHandler(ListToolsRequestSchema, async () => {
return {
tools: [
{
name: "scrape_html",
description: "Scrape raw HTML from any URL with JavaScript rendering",
inputSchema: {
type: "object",
properties: {
url: {
type: "string",
description: "The URL to scrape",
},
js: {
type: "boolean",
description: "Enable JavaScript rendering",
default: true,
},
wait_for: {
type: "string",
description: "CSS selector to wait for",
},
proxy: {
type: "string",
enum: ["datacenter", "residential"],
description: "Proxy type",
},
},
required: ["url"],
},
},
{
name: "extract_text",
description: "Extract clean text content from a webpage",
inputSchema: {
type: "object",
properties: {
url: {
type: "string",
description: "The URL to extract text from",
},
return_links: {
type: "boolean",
description: "Include hyperlinks in output",
},
},
required: ["url"],
},
},
{
name: "ask_question",
description: "Ask questions about webpage content using AI",
inputSchema: {
type: "object",
properties: {
url: { type: "string", description: "The URL to analyze" },
question: { type: "string", description: "Question to ask" },
},
required: ["url", "question"],
},
},
{
name: "extract_fields",
description: "Extract structured data using AI",
inputSchema: {
type: "object",
properties: {
url: { type: "string", description: "Target URL" },
fields: {
type: "object",
description: "Fields to extract",
additionalProperties: { type: "string" },
},
},
required: ["url", "fields"],
},
},
],
};
});
// Handle tool execution
server.setRequestHandler(CallToolRequestSchema, async (request) => {
const { name, arguments: args } = request.params;
if (!WEBSCRAPING_AI_KEY) {
return {
content: [
{
type: "text",
text: "Error: WEBSCRAPING_AI_API_KEY not set",
},
],
};
}
try {
if (name === "scrape_html") {
const response = await axios.get(`${BASE_URL}/html`, {
params: {
url: args.url,
api_key: WEBSCRAPING_AI_KEY,
js: args.js ?? true,
wait_for: args.wait_for,
proxy: args.proxy || "residential",
},
});
return {
content: [
{
type: "text",
text: `HTML Content (${response.data.length} characters):\n\n${response.data}`,
},
],
};
}
if (name === "extract_text") {
const response = await axios.get(`${BASE_URL}/text`, {
params: {
url: args.url,
api_key: WEBSCRAPING_AI_KEY,
return_links: args.return_links || false,
},
});
return {
content: [
{
type: "text",
text: `Extracted Text:\n\n${response.data.text}`,
},
],
};
}
if (name === "ask_question") {
const response = await axios.post(
`${BASE_URL}/question`,
{ question: args.question },
{
params: {
url: args.url,
api_key: WEBSCRAPING_AI_KEY,
},
}
);
return {
content: [
{
type: "text",
text: `Answer: ${response.data.answer}`,
},
],
};
}
if (name === "extract_fields") {
const response = await axios.post(
`${BASE_URL}/fields`,
{ fields: args.fields },
{
params: {
url: args.url,
api_key: WEBSCRAPING_AI_KEY,
},
}
);
const result =
"Extracted Fields:\n\n" +
Object.entries(response.data)
.map(([key, value]) => `${key}: ${value}`)
.join("\n");
return {
content: [{ type: "text", text: result }],
};
}
throw new Error(`Unknown tool: ${name}`);
} catch (error) {
const axiosError = error as AxiosError;
return {
content: [
{
type: "text",
text: `Error: ${axiosError.message}`,
},
],
};
}
});
// Start the server
async function main() {
const transport = new StdioServerTransport();
await server.connect(transport);
console.error("WebScraping.AI MCP Server running on stdio");
}
main().catch(console.error);
Save as webscraping_mcp_server.ts
and run with:
export WEBSCRAPING_AI_API_KEY="your_api_key_here"
npx tsx webscraping_mcp_server.ts
Configuring Your MCP Server
Claude Desktop Configuration
Once your server is built, configure it in Claude Desktop to make it accessible to Claude:
macOS Configuration (~/Library/Application Support/Claude/claude_desktop_config.json
):
{
"mcpServers": {
"webscraping-ai": {
"command": "python",
"args": ["/absolute/path/to/webscraping_mcp_server.py"],
"env": {
"WEBSCRAPING_AI_API_KEY": "your_api_key_here"
}
}
}
}
Windows Configuration (%APPDATA%\Claude\claude_desktop_config.json
):
{
"mcpServers": {
"webscraping-ai": {
"command": "python",
"args": ["C:\\path\\to\\webscraping_mcp_server.py"],
"env": {
"WEBSCRAPING_AI_API_KEY": "your_api_key_here"
}
}
}
}
For TypeScript servers:
{
"mcpServers": {
"webscraping-ai": {
"command": "npx",
"args": ["tsx", "/path/to/webscraping_mcp_server.ts"],
"env": {
"WEBSCRAPING_AI_API_KEY": "your_api_key_here"
}
}
}
}
Advanced SDK Features
Adding Resources
Resources allow your MCP server to expose data that Claude can read. This is useful for caching scraped content:
from mcp.types import Resource
@app.list_resources()
async def list_resources() -> list[Resource]:
return [
Resource(
uri="cache://recent-scrapes",
name="Recent Scraping Results",
mimeType="application/json",
description="Recently scraped webpage data"
)
]
@app.read_resource()
async def read_resource(uri: str) -> str:
if uri == "cache://recent-scrapes":
# Return cached scraping results
return json.dumps(recent_scrapes_cache)
raise ValueError(f"Unknown resource: {uri}")
Adding Prompts
Prompts provide pre-configured workflows that users can invoke:
from mcp.types import Prompt, PromptMessage
@app.list_prompts()
async def list_prompts() -> list[Prompt]:
return [
Prompt(
name="scrape_product_page",
description="Extract product information from e-commerce pages",
arguments=[
{
"name": "url",
"description": "Product page URL",
"required": True
}
]
)
]
@app.get_prompt()
async def get_prompt(name: str, arguments: dict) -> list[PromptMessage]:
if name == "scrape_product_page":
return [
PromptMessage(
role="user",
content=f"""Extract the following from {arguments['url']}:
- Product name
- Price
- Description
- Availability
- Customer rating
"""
)
]
Error Handling Best Practices
Robust error handling is critical for production MCP servers:
import logging
from typing import Optional
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
@app.call_tool()
async def call_tool(name: str, arguments: dict) -> list[TextContent]:
try:
# Validate inputs
if not arguments.get("url"):
raise ValueError("URL parameter is required")
# Validate URL format
from urllib.parse import urlparse
parsed = urlparse(arguments["url"])
if not all([parsed.scheme, parsed.netloc]):
raise ValueError("Invalid URL format")
# Execute scraping with timeout
async with httpx.AsyncClient(timeout=30.0) as client:
response = await client.get(...)
logger.info(f"Successfully executed {name} for {arguments['url']}")
return [TextContent(type="text", text=response.text)]
except httpx.TimeoutException:
logger.error(f"Timeout scraping {arguments.get('url')}")
return [TextContent(
type="text",
text="Error: Request timed out. The page took too long to load."
)]
except httpx.HTTPStatusError as e:
logger.error(f"HTTP {e.response.status_code} for {arguments.get('url')}")
return [TextContent(
type="text",
text=f"Error: HTTP {e.response.status_code} - {e.response.text}"
)]
except Exception as e:
logger.exception(f"Unexpected error in {name}")
return [TextContent(
type="text",
text=f"Unexpected error: {str(e)}"
)]
Testing Your MCP Server
Unit Testing with Python
import pytest
from unittest.mock import AsyncMock, patch
@pytest.mark.asyncio
async def test_scrape_html_tool():
"""Test HTML scraping tool"""
with patch('httpx.AsyncClient') as mock_client:
mock_response = AsyncMock()
mock_response.text = "<html><body>Test</body></html>"
mock_response.raise_for_status = AsyncMock()
mock_client.return_value.__aenter__.return_value.get.return_value = mock_response
result = await call_tool("scrape_html", {"url": "https://example.com"})
assert len(result) == 1
assert "Test" in result[0].text
Integration Testing
Test your server end-to-end using the MCP Inspector tool:
# Install MCP Inspector
npm install -g @modelcontextprotocol/inspector
# Test your server
mcp-inspector python webscraping_mcp_server.py
Real-World Use Cases
1. E-commerce Price Monitoring
Build an MCP server that tracks product prices across multiple retailers, similar to handling browser sessions in Puppeteer:
@app.call_tool()
async def call_tool(name: str, arguments: dict):
if name == "monitor_prices":
products = arguments["products"]
results = {}
for product in products:
response = await client.post(
f"{BASE_URL}/fields",
params={"url": product["url"], "api_key": API_KEY},
json={
"fields": {
"price": "current product price",
"availability": "in stock or out of stock"
}
}
)
results[product["name"]] = response.json()
return [TextContent(type="text", text=json.dumps(results, indent=2))]
2. Content Aggregation with AI
Extract and summarize content from multiple news sources:
@app.call_tool()
async def call_tool(name: str, arguments: dict):
if name == "aggregate_news":
sources = arguments["sources"]
summaries = []
for source in sources:
# Extract text
text_response = await client.get(
f"{BASE_URL}/text",
params={"url": source, "api_key": API_KEY}
)
# Ask for summary
summary_response = await client.post(
f"{BASE_URL}/question",
params={"url": source, "api_key": API_KEY},
json={"question": "Summarize the main points of this article in 2-3 sentences"}
)
summaries.append({
"url": source,
"summary": summary_response.json()["answer"]
})
return [TextContent(type="text", text=json.dumps(summaries, indent=2))]
3. SEO Audit Automation
Create automated SEO audits, similar to how you might use Puppeteer for SEO auditing:
server.setRequestHandler(CallToolRequestSchema, async (request) => {
if (request.params.name === "seo_audit") {
const url = request.params.arguments.url;
// Scrape the page
const htmlResponse = await axios.get(`${BASE_URL}/html`, {
params: { url, api_key: WEBSCRAPING_AI_KEY, js: true },
});
// Extract SEO fields
const fieldsResponse = await axios.post(
`${BASE_URL}/fields`,
{
fields: {
title: "page title",
meta_description: "meta description",
h1_count: "number of h1 tags",
image_count: "number of images",
has_alt_tags: "do all images have alt tags",
},
},
{ params: { url, api_key: WEBSCRAPING_AI_KEY } }
);
return {
content: [
{
type: "text",
text: `SEO Audit Results:\n${JSON.stringify(fieldsResponse.data, null, 2)}`,
},
],
};
}
});
Performance Optimization
Concurrent Requests
Use async operations to scrape multiple pages simultaneously:
async def scrape_multiple(urls: list[str]) -> dict:
"""Scrape multiple URLs concurrently"""
async with httpx.AsyncClient() as client:
tasks = [
client.get(
f"{BASE_URL}/html",
params={"url": url, "api_key": WEBSCRAPING_AI_KEY}
)
for url in urls
]
responses = await asyncio.gather(*tasks, return_exceptions=True)
results = {}
for url, response in zip(urls, responses):
if isinstance(response, Exception):
results[url] = {"error": str(response)}
else:
results[url] = response.text
return results
Caching Results
Implement caching to reduce API calls:
from functools import lru_cache
import hashlib
cache = {}
def cache_key(url: str) -> str:
return hashlib.md5(url.encode()).hexdigest()
@app.call_tool()
async def call_tool(name: str, arguments: dict):
if name == "scrape_cached":
url = arguments["url"]
key = cache_key(url)
# Check cache
if key in cache:
return [TextContent(type="text", text=f"[CACHED] {cache[key]}")]
# Scrape and cache
async with httpx.AsyncClient() as client:
response = await client.get(
f"{BASE_URL}/html",
params={"url": url, "api_key": WEBSCRAPING_AI_KEY}
)
cache[key] = response.text
return [TextContent(type="text", text=response.text)]
Deployment Strategies
Docker Deployment
Create a Dockerfile
for your MCP server:
FROM python:3.11-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY webscraping_mcp_server.py .
ENV WEBSCRAPING_AI_API_KEY=""
CMD ["python", "webscraping_mcp_server.py"]
Build and run:
docker build -t webscraping-mcp .
docker run -e WEBSCRAPING_AI_API_KEY="your_key" webscraping-mcp
Systemd Service (Linux)
Create /etc/systemd/system/webscraping-mcp.service
:
[Unit]
Description=WebScraping.AI MCP Server
After=network.target
[Service]
Type=simple
User=youruser
WorkingDirectory=/path/to/server
Environment="WEBSCRAPING_AI_API_KEY=your_key"
ExecStart=/usr/bin/python3 /path/to/webscraping_mcp_server.py
Restart=on-failure
[Install]
WantedBy=multi-user.target
Enable and start:
sudo systemctl enable webscraping-mcp
sudo systemctl start webscraping-mcp
Security Considerations
- Environment Variables: Always use environment variables for API keys
- Input Validation: Validate all URLs and parameters
- Rate Limiting: Implement rate limiting to prevent abuse
- Error Messages: Don't expose sensitive information in error messages
- HTTPS Only: Only allow HTTPS URLs for scraping
- Timeout Protection: Set reasonable timeouts for all requests
from urllib.parse import urlparse
def validate_url(url: str) -> tuple[bool, Optional[str]]:
"""Validate URL security"""
try:
parsed = urlparse(url)
if parsed.scheme not in ["http", "https"]:
return False, "Only HTTP/HTTPS URLs allowed"
if not parsed.netloc:
return False, "Invalid URL format"
# Block localhost and private IPs
if any(x in parsed.netloc.lower() for x in ["localhost", "127.0.0.1", "0.0.0.0"]):
return False, "Cannot scrape local URLs"
return True, None
except Exception as e:
return False, str(e)
Conclusion
The MCP SDK provides a powerful, standardized way to build AI-powered web scraping tools that integrate seamlessly with Claude and other MCP-compatible assistants. By understanding the SDK's architecture, request handlers, and best practices, you can create robust scraping servers that handle everything from simple HTML extraction to complex multi-step data pipelines.
Whether you're building in Python, TypeScript, or another supported language, the MCP SDK abstracts away protocol complexity and lets you focus on building great scraping tools that users can control through natural language.
Start by installing the SDK for your preferred language, build a simple server with one or two tools, configure it in Claude Desktop, and expand from there. The combination of MCP's standardization and WebScraping.AI's powerful API creates endless possibilities for automated data extraction workflows.