What is the Model Context Protocol?
The Model Context Protocol (MCP) is an open-source protocol developed by Anthropic that enables seamless communication between AI assistants (like Claude) and external data sources, tools, and services. For web scraping developers, MCP represents a paradigm shift in how you can automate data extraction workflows by allowing AI models to directly interact with scraping tools, databases, and APIs.
MCP provides a standardized way to extend AI capabilities beyond their training data, making them context-aware and able to perform real-time operations with external systems. This is particularly powerful for web scraping, where you need dynamic access to websites, parsing tools, and data storage solutions.
Core Concepts of MCP
MCP operates on a client-server architecture with three main components:
1. MCP Hosts (Clients)
These are applications that embed AI models, such as Claude Desktop, IDEs, or custom applications. The host initiates connections to MCP servers and manages the protocol communication.
2. MCP Servers
Lightweight programs that expose specific capabilities through the standardized protocol. For web scraping, an MCP server might provide: - Access to web scraping APIs - Browser automation tools - HTML parsing utilities - Data transformation functions - Storage and database operations
3. Resources, Tools, and Prompts
MCP servers expose three types of capabilities: - Resources: Data sources that the AI can read (HTML content, API responses, databases) - Tools: Functions the AI can execute (scrape a URL, parse HTML, extract data) - Prompts: Pre-configured templates for common workflows
MCP Architecture for Web Scraping
Here's how MCP enables AI-powered web scraping:
┌─────────────────┐ ┌──────────────────┐ ┌─────────────────┐
│ Claude AI │◄───────►│ MCP Server │◄───────►│ Scraping API │
│ Assistant │ MCP │ (WebScraper) │ HTTP │ WebScraping.AI │
└─────────────────┘ └──────────────────┘ └─────────────────┘
│
▼
┌──────────────────┐
│ Data Storage │
│ (Database/File)│
└──────────────────┘
Building an MCP Server for Web Scraping
Python Example
Here's a complete MCP server that exposes web scraping capabilities using the official MCP SDK:
import asyncio
import httpx
from mcp.server import Server, NotificationOptions
from mcp.server.models import InitializationOptions
from mcp.server.stdio import stdio_server
from mcp.types import Tool, TextContent
# Create MCP server instance
app = Server("webscraping-mcp-server")
# Define available tools
@app.list_tools()
async def list_tools() -> list[Tool]:
return [
Tool(
name="scrape_url",
description="Scrape HTML content from a URL with JavaScript rendering",
inputSchema={
"type": "object",
"properties": {
"url": {
"type": "string",
"description": "The URL to scrape"
},
"wait_for": {
"type": "string",
"description": "CSS selector to wait for before returning"
}
},
"required": ["url"]
}
),
Tool(
name="extract_data",
description="Extract structured data from HTML using AI",
inputSchema={
"type": "object",
"properties": {
"url": {
"type": "string",
"description": "Target URL"
},
"fields": {
"type": "object",
"description": "Fields to extract with descriptions"
}
},
"required": ["url", "fields"]
}
)
]
# Implement tool execution
@app.call_tool()
async def call_tool(name: str, arguments: dict) -> list[TextContent]:
if name == "scrape_url":
async with httpx.AsyncClient() as client:
response = await client.get(
"https://api.webscraping.ai/html",
params={
"url": arguments["url"],
"api_key": "YOUR_API_KEY",
"js": "true",
"wait_for": arguments.get("wait_for")
}
)
return [TextContent(
type="text",
text=f"HTML Content:\n{response.text}"
)]
elif name == "extract_data":
async with httpx.AsyncClient() as client:
response = await client.post(
"https://api.webscraping.ai/fields",
params={
"url": arguments["url"],
"api_key": "YOUR_API_KEY"
},
json={"fields": arguments["fields"]}
)
return [TextContent(
type="text",
text=f"Extracted Data:\n{response.json()}"
)]
# Run the server
async def main():
async with stdio_server() as (read_stream, write_stream):
await app.run(
read_stream,
write_stream,
InitializationOptions(
server_name="webscraping-mcp",
server_version="1.0.0"
)
)
if __name__ == "__main__":
asyncio.run(main())
JavaScript/TypeScript Example
For Node.js environments, you can build an MCP server using the official SDK:
import { Server } from "@modelcontextprotocol/sdk/server/index.js";
import { StdioServerTransport } from "@modelcontextprotocol/sdk/server/stdio.js";
import {
CallToolRequestSchema,
ListToolsRequestSchema,
} from "@modelcontextprotocol/sdk/types.js";
import axios from "axios";
// Create server instance
const server = new Server(
{
name: "webscraping-mcp-server",
version: "1.0.0",
},
{
capabilities: {
tools: {},
},
}
);
// Define available tools
server.setRequestHandler(ListToolsRequestSchema, async () => {
return {
tools: [
{
name: "scrape_html",
description: "Scrape HTML content from any URL with JavaScript rendering",
inputSchema: {
type: "object",
properties: {
url: {
type: "string",
description: "The URL to scrape",
},
js_timeout: {
type: "number",
description: "Maximum JavaScript rendering time in ms",
},
},
required: ["url"],
},
},
{
name: "extract_text",
description: "Extract clean text content from a webpage",
inputSchema: {
type: "object",
properties: {
url: {
type: "string",
description: "The URL to extract text from",
},
},
required: ["url"],
},
},
],
};
});
// Handle tool execution
server.setRequestHandler(CallToolRequestSchema, async (request) => {
const { name, arguments: args } = request.params;
if (name === "scrape_html") {
const response = await axios.get("https://api.webscraping.ai/html", {
params: {
url: args.url,
api_key: process.env.WEBSCRAPING_AI_API_KEY,
js: true,
js_timeout: args.js_timeout || 2000,
},
});
return {
content: [
{
type: "text",
text: response.data,
},
],
};
}
if (name === "extract_text") {
const response = await axios.get("https://api.webscraping.ai/text", {
params: {
url: args.url,
api_key: process.env.WEBSCRAPING_AI_API_KEY,
},
});
return {
content: [
{
type: "text",
text: JSON.stringify(response.data, null, 2),
},
],
};
}
throw new Error(`Unknown tool: ${name}`);
});
// Start the server
async function main() {
const transport = new StdioServerTransport();
await server.connect(transport);
console.error("WebScraping MCP Server running on stdio");
}
main().catch(console.error);
Configuring MCP Servers
To use your MCP server with Claude Desktop or other MCP hosts, you need to configure it in the MCP settings:
Claude Desktop Configuration (macOS)
Edit ~/Library/Application Support/Claude/claude_desktop_config.json
:
{
"mcpServers": {
"webscraping": {
"command": "python",
"args": ["/path/to/your/mcp_server.py"],
"env": {
"WEBSCRAPING_AI_API_KEY": "your_api_key_here"
}
}
}
}
Claude Desktop Configuration (Windows)
Edit %APPDATA%\Claude\claude_desktop_config.json
with the same structure.
Real-World Use Cases
1. Automated Product Monitoring
Similar to how you might handle AJAX requests using Puppeteer, MCP servers can automate the monitoring of dynamic product pages:
# In your MCP server
@app.call_tool()
async def call_tool(name: str, arguments: dict):
if name == "monitor_price":
async with httpx.AsyncClient() as client:
# Scrape product page
html_response = await client.get(
"https://api.webscraping.ai/html",
params={
"url": arguments["product_url"],
"api_key": API_KEY,
"wait_for": ".price"
}
)
# Extract price using AI
data_response = await client.post(
"https://api.webscraping.ai/question",
params={"url": arguments["product_url"], "api_key": API_KEY},
json={"question": "What is the current price of this product?"}
)
return [TextContent(type="text", text=data_response.json())]
2. Content Aggregation
MCP servers can aggregate content from multiple sources, handling complex scenarios like browser sessions:
server.setRequestHandler(CallToolRequestSchema, async (request) => {
if (request.params.name === "aggregate_news") {
const sources = request.params.arguments.sources;
const results = [];
for (const source of sources) {
const response = await axios.get("https://api.webscraping.ai/text", {
params: {
url: source,
api_key: process.env.WEBSCRAPING_AI_API_KEY,
},
});
results.push(response.data);
}
return {
content: [
{
type: "text",
text: JSON.stringify(results, null, 2),
},
],
};
}
});
3. Data Pipeline Automation
Combine MCP with scraping APIs for end-to-end data pipelines:
# Install MCP server dependencies
pip install mcp httpx pandas
# Run your MCP server
python webscraping_mcp_server.py
Then interact naturally with Claude:
"Please scrape the product catalog from example.com, extract all prices,
and save them to a CSV file"
Security Considerations
When building MCP servers for web scraping:
- API Key Management: Store API keys in environment variables, never hardcode them
- Rate Limiting: Implement rate limiting to avoid overwhelming target sites
- Input Validation: Validate all URLs and parameters before passing to scraping APIs
- Error Handling: Implement robust error handling for network failures and timeouts
# Example security best practices
import os
from urllib.parse import urlparse
def validate_url(url: str) -> bool:
"""Validate URL before scraping"""
try:
result = urlparse(url)
return all([result.scheme, result.netloc])
except:
return False
# Use environment variables for sensitive data
API_KEY = os.environ.get("WEBSCRAPING_AI_API_KEY")
if not API_KEY:
raise ValueError("API key not configured")
Advantages of MCP for Web Scraping
- Standardization: Single protocol works across different AI models and tools
- Modularity: Build reusable scraping components that work with any MCP host
- Context Awareness: AI can make intelligent decisions about scraping strategies
- Natural Language Control: Control complex scraping workflows through conversation
- Rapid Prototyping: Build and test scraping workflows without writing boilerplate code
Getting Started
To start building with MCP for web scraping:
- Install MCP SDK:
# Python
pip install mcp
# Node.js
npm install @modelcontextprotocol/sdk
Get a WebScraping.AI API Key: Sign up at WebScraping.AI
Build Your First MCP Server: Use the examples above as templates
Configure Claude Desktop: Add your server to the MCP configuration
Test Your Integration: Start conversing with Claude to trigger your scraping tools
Conclusion
The Model Context Protocol represents a fundamental shift in how developers can build AI-powered web scraping solutions. By creating standardized MCP servers, you can expose scraping capabilities that any MCP-compatible AI assistant can use through natural language, making complex data extraction workflows more accessible and maintainable.
Whether you're building automated monitoring systems, content aggregation pipelines, or custom data extraction tools, MCP provides the foundation for connecting AI intelligence with web scraping capabilities in a secure, standardized way.