What is an MCP Server and How Does It Work?
Model Context Protocol (MCP) is an open-source protocol developed by Anthropic that enables AI assistants like Claude to securely connect with external data sources, tools, and services. MCP servers act as intermediaries that expose specific functionality to AI models through a standardized interface, making it easier to integrate web scraping, database access, API interactions, and other capabilities into AI-powered workflows.
Understanding the MCP Architecture
MCP follows a client-server architecture where:
- MCP Host: The application embedding the AI model (like Claude Desktop or an IDE)
- MCP Client: The component within the host that communicates with MCP servers
- MCP Server: A lightweight service that exposes specific tools, resources, or prompts to the AI model
- Transport Layer: The communication mechanism (typically stdio or HTTP/SSE)
This architecture allows AI assistants to access real-time data, execute code, interact with APIs, and perform web scraping operations without requiring direct integration into the AI model itself.
Core Components of MCP Servers
1. Resources
Resources represent data that the AI can read. In web scraping contexts, resources might include:
- Cached HTML content from previously scraped pages
- Configuration files with scraping rules
- Database records containing scraped data
- API response templates
Example resource definition in TypeScript:
server.setRequestHandler(ListResourcesRequestSchema, async () => {
return {
resources: [
{
uri: "scraper://config/settings",
name: "Scraper Configuration",
mimeType: "application/json",
description: "Current web scraper settings"
},
{
uri: "scraper://cache/latest",
name: "Latest Scraped Content",
mimeType: "text/html",
description: "Most recently scraped webpage"
}
]
};
});
2. Tools
Tools are functions that the AI can execute. For web scraping, tools might include:
- HTTP request execution
- HTML parsing and data extraction
- Screenshot capture
- Browser automation similar to handling browser sessions in Puppeteer
- Proxy management
Example tool implementation in Python:
from mcp.server import Server
from mcp.types import Tool, TextContent
import httpx
from bs4 import BeautifulSoup
app = Server("web-scraper")
@app.list_tools()
async def list_tools() -> list[Tool]:
return [
Tool(
name="scrape_webpage",
description="Scrape content from a webpage using HTTP requests",
inputSchema={
"type": "object",
"properties": {
"url": {
"type": "string",
"description": "The URL to scrape"
},
"selector": {
"type": "string",
"description": "CSS selector to extract specific elements"
},
"use_javascript": {
"type": "boolean",
"description": "Whether to render JavaScript"
}
},
"required": ["url"]
}
)
]
@app.call_tool()
async def call_tool(name: str, arguments: dict) -> list[TextContent]:
if name == "scrape_webpage":
url = arguments["url"]
selector = arguments.get("selector")
async with httpx.AsyncClient() as client:
response = await client.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
if selector:
elements = soup.select(selector)
content = "\n".join([el.get_text() for el in elements])
else:
content = soup.get_text()
return [TextContent(
type="text",
text=f"Scraped content from {url}:\n\n{content}"
)]
3. Prompts
Prompts are reusable templates that help guide the AI for specific tasks. For web scraping:
@app.list_prompts()
async def list_prompts() -> list[Prompt]:
return [
Prompt(
name="extract_product_data",
description="Extract structured product information from e-commerce pages",
arguments=[
PromptArgument(
name="url",
description="E-commerce product page URL",
required=True
)
]
)
]
How MCP Servers Work in Practice
Connection Flow
- Discovery: The MCP client discovers available servers through configuration
- Initialization: The client establishes a connection (typically via stdio)
- Capability Negotiation: Client and server exchange supported features
- Request/Response: The AI makes requests through the client to the server
- Execution: The server executes the requested operation and returns results
Transport Mechanisms
Stdio Transport (most common for local tools):
{
"mcpServers": {
"web-scraper": {
"command": "python",
"args": ["-m", "mcp_scraper_server"]
}
}
}
HTTP with SSE Transport (for remote servers):
{
"mcpServers": {
"remote-scraper": {
"url": "https://scraper.example.com/mcp",
"transport": "sse"
}
}
}
Building a Web Scraping MCP Server
Here's a complete example of a simple MCP server for web scraping in JavaScript:
#!/usr/bin/env node
import { Server } from "@modelcontextprotocol/sdk/server/index.js";
import { StdioServerTransport } from "@modelcontextprotocol/sdk/server/stdio.js";
import axios from "axios";
import * as cheerio from "cheerio";
const server = new Server(
{
name: "web-scraper-mcp",
version: "1.0.0",
},
{
capabilities: {
tools: {},
resources: {}
},
}
);
// Define scraping tools
server.setRequestHandler("tools/list", async () => {
return {
tools: [
{
name: "fetch_html",
description: "Fetch HTML content from a URL",
inputSchema: {
type: "object",
properties: {
url: {
type: "string",
description: "URL to fetch"
},
headers: {
type: "object",
description: "Optional HTTP headers"
}
},
required: ["url"]
}
},
{
name: "extract_data",
description: "Extract data using CSS selectors",
inputSchema: {
type: "object",
properties: {
html: {
type: "string",
description: "HTML content to parse"
},
selector: {
type: "string",
description: "CSS selector"
},
attribute: {
type: "string",
description: "Optional attribute to extract"
}
},
required: ["html", "selector"]
}
}
]
};
});
// Handle tool execution
server.setRequestHandler("tools/call", async (request) => {
const { name, arguments: args } = request.params;
try {
if (name === "fetch_html") {
const response = await axios.get(args.url, {
headers: args.headers || {
'User-Agent': 'Mozilla/5.0 (compatible; MCPScraper/1.0)'
},
timeout: 10000
});
return {
content: [
{
type: "text",
text: response.data
}
]
};
}
if (name === "extract_data") {
const $ = cheerio.load(args.html);
const elements = $(args.selector);
const results = [];
elements.each((i, el) => {
if (args.attribute) {
results.push($(el).attr(args.attribute));
} else {
results.push($(el).text().trim());
}
});
return {
content: [
{
type: "text",
text: JSON.stringify(results, null, 2)
}
]
};
}
throw new Error(`Unknown tool: ${name}`);
} catch (error) {
return {
content: [
{
type: "text",
text: `Error: ${error.message}`
}
],
isError: true
};
}
});
// Start the server
async function main() {
const transport = new StdioServerTransport();
await server.connect(transport);
console.error("Web Scraper MCP server running on stdio");
}
main().catch(console.error);
Advanced Web Scraping with MCP
Handling JavaScript-Heavy Sites
For sites that require JavaScript rendering, similar to handling AJAX requests using Puppeteer, you can integrate browser automation:
from playwright.async_api import async_playwright
@app.call_tool()
async def call_tool(name: str, arguments: dict):
if name == "scrape_dynamic":
url = arguments["url"]
async with async_playwright() as p:
browser = await p.chromium.launch()
page = await browser.new_page()
await page.goto(url)
await page.wait_for_load_state('networkidle')
content = await page.content()
await browser.close()
return [TextContent(type="text", text=content)]
Error Handling and Retries
Robust error handling is crucial for production scraping:
async function fetchWithRetry(url, maxRetries = 3) {
for (let i = 0; i < maxRetries; i++) {
try {
const response = await axios.get(url, { timeout: 10000 });
return response.data;
} catch (error) {
if (i === maxRetries - 1) throw error;
const delay = Math.pow(2, i) * 1000; // Exponential backoff
await new Promise(resolve => setTimeout(resolve, delay));
}
}
}
Rate Limiting and Proxy Support
import asyncio
from datetime import datetime, timedelta
class RateLimiter:
def __init__(self, requests_per_second=1):
self.rate = requests_per_second
self.last_request = None
async def acquire(self):
if self.last_request:
elapsed = (datetime.now() - self.last_request).total_seconds()
wait_time = (1 / self.rate) - elapsed
if wait_time > 0:
await asyncio.sleep(wait_time)
self.last_request = datetime.now()
# Usage in tool
rate_limiter = RateLimiter(requests_per_second=2)
@app.call_tool()
async def call_tool(name: str, arguments: dict):
await rate_limiter.acquire()
# ... perform scraping
Security Considerations
When building MCP servers for web scraping:
- Input Validation: Always validate and sanitize URLs and parameters
- Rate Limiting: Implement request throttling to avoid overwhelming target sites
- Access Control: Restrict which domains can be scraped
- Error Disclosure: Don't expose sensitive error details to the AI
- Resource Limits: Set timeouts and memory limits
ALLOWED_DOMAINS = ['example.com', 'api.example.org']
def validate_url(url: str) -> bool:
from urllib.parse import urlparse
domain = urlparse(url).netloc
return any(domain.endswith(allowed) for allowed in ALLOWED_DOMAINS)
Configuration and Deployment
Local Development Setup
Install dependencies:
npm install @modelcontextprotocol/sdk axios cheerio
# or
pip install mcp httpx beautifulsoup4
Configure Claude Desktop (claude_desktop_config.json
):
{
"mcpServers": {
"web-scraper": {
"command": "node",
"args": ["/path/to/scraper-server.js"]
}
}
}
Testing Your MCP Server
# Test with the MCP Inspector
npx @modelcontextprotocol/inspector node scraper-server.js
# Or test with Python
python -m mcp.cli.inspector python -m your_mcp_server
Use Cases for Web Scraping MCP Servers
- Competitive Intelligence: Automated monitoring of competitor websites
- Price Tracking: Real-time price comparison across e-commerce sites
- Content Aggregation: Collecting articles, news, or research papers
- SEO Analysis: Extracting meta tags, headers, and structured data
- Lead Generation: Gathering contact information from business directories
- Market Research: Analyzing product reviews and customer sentiment
- Data Validation: Verifying information across multiple sources
Best Practices
- Respect robots.txt: Check and honor robots.txt directives
- Use Appropriate User-Agents: Identify your scraper properly
- Implement Caching: Store results to minimize redundant requests
- Handle Pagination: Support multi-page data extraction efficiently
- Monitor Performance: Track success rates and response times
- Graceful Degradation: Fall back to simpler methods when complex ones fail
- Documentation: Clearly document available tools and their parameters
Conclusion
MCP servers provide a powerful, standardized way to integrate web scraping capabilities into AI-assisted workflows. By implementing the Model Context Protocol, you can create reusable, secure, and maintainable scraping tools that work seamlessly with AI assistants like Claude. Whether you're building simple HTTP-based scrapers or complex browser automation tools similar to interacting with DOM elements in Puppeteer, MCP offers a flexible framework for exposing these capabilities to AI models.
The protocol's extensibility means you can start simple and gradually add more sophisticated features like JavaScript rendering, proxy rotation, and anti-bot detection as your needs grow. With proper error handling, rate limiting, and security measures, MCP servers can become reliable components in your data extraction infrastructure.