What is Claude MCP and How Does It Integrate with Web Scraping?
Claude MCP (Model Context Protocol for Claude) is Anthropic's implementation and ecosystem around the Model Context Protocol, enabling Claude AI to interact with external tools, data sources, and services through a standardized interface. For web scraping developers, Claude MCP opens up powerful possibilities for building AI-driven automation workflows that combine natural language understanding with robust data extraction capabilities.
Claude MCP allows developers to create custom servers that expose web scraping functionality directly to Claude, enabling conversational control over complex scraping operations. Instead of writing traditional scripts, you can build MCP servers that let Claude orchestrate scraping tasks, handle dynamic content, and make intelligent decisions based on the data it encounters.
Understanding Claude MCP Architecture
Claude MCP operates through a client-server model specifically designed for AI assistants:
The MCP Stack for Web Scraping
┌─────────────────────────────────────────────────────┐
│ Claude Desktop / API │
│ (MCP Host - Anthropic) │
└──────────────────┬──────────────────────────────────┘
│ MCP Protocol (JSON-RPC)
│
┌──────────────────▼──────────────────────────────────┐
│ Your Custom MCP Server │
│ (Python/TypeScript/JavaScript) │
│ - Exposes scraping tools │
│ - Handles browser automation │
│ - Manages data extraction │
└──────────────────┬──────────────────────────────────┘
│ HTTP/HTTPS
│
┌──────────────────▼──────────────────────────────────┐
│ Web Scraping Infrastructure │
│ - Target websites │
│ - Scraping APIs (WebScraping.AI) │
│ - Browser automation (Puppeteer/Playwright) │
│ - Data storage │
└─────────────────────────────────────────────────────┘
Key Components
- Claude Desktop/API (MCP Host): The application running Claude that connects to your MCP servers
- MCP Server: Your custom server exposing web scraping capabilities
- Tools: Specific scraping functions Claude can invoke
- Resources: Data sources Claude can read (scraped content, databases)
- Prompts: Reusable templates for common scraping workflows
Building an MCP Server for Web Scraping with Claude
Python Implementation with Official SDK
Here's a production-ready MCP server that integrates web scraping capabilities with Claude:
import asyncio
import json
import os
from typing import Any, Sequence
import httpx
from mcp.server import Server
from mcp.server.stdio import stdio_server
from mcp.types import Tool, TextContent, ImageContent, EmbeddedResource
from pydantic import AnyUrl
# Initialize MCP server
app = Server("claude-webscraping-server")
# WebScraping.AI API configuration
API_KEY = os.environ.get("WEBSCRAPING_AI_API_KEY")
BASE_URL = "https://api.webscraping.ai"
@app.list_tools()
async def list_tools() -> list[Tool]:
"""Define available web scraping tools for Claude"""
return [
Tool(
name="scrape_html",
description="Scrape full HTML content from a URL with JavaScript rendering. Use for dynamic sites.",
inputSchema={
"type": "object",
"properties": {
"url": {
"type": "string",
"description": "The target URL to scrape"
},
"wait_for": {
"type": "string",
"description": "CSS selector to wait for before returning content"
},
"js_timeout": {
"type": "integer",
"description": "JavaScript rendering timeout in milliseconds",
"default": 2000
},
"proxy": {
"type": "string",
"description": "Proxy type: datacenter or residential",
"enum": ["datacenter", "residential"]
}
},
"required": ["url"]
}
),
Tool(
name="scrape_text",
description="Extract clean, readable text content from a webpage. Best for articles and content analysis.",
inputSchema={
"type": "object",
"properties": {
"url": {
"type": "string",
"description": "The URL to extract text from"
},
"return_links": {
"type": "boolean",
"description": "Include links found in the text",
"default": False
}
},
"required": ["url"]
}
),
Tool(
name="ask_question",
description="Use AI to answer specific questions about webpage content without writing selectors.",
inputSchema={
"type": "object",
"properties": {
"url": {
"type": "string",
"description": "The webpage URL"
},
"question": {
"type": "string",
"description": "Natural language question about the page content"
}
},
"required": ["url", "question"]
}
),
Tool(
name="extract_structured_data",
description="Extract specific fields from a webpage using AI-powered field extraction.",
inputSchema={
"type": "object",
"properties": {
"url": {
"type": "string",
"description": "The URL to extract data from"
},
"fields": {
"type": "object",
"description": "Object mapping field names to extraction instructions",
"additionalProperties": {"type": "string"}
}
},
"required": ["url", "fields"]
}
),
Tool(
name="scrape_selected",
description="Extract content matching a CSS selector from a webpage.",
inputSchema={
"type": "object",
"properties": {
"url": {
"type": "string",
"description": "The target URL"
},
"selector": {
"type": "string",
"description": "CSS selector to extract"
}
},
"required": ["url", "selector"]
}
)
]
@app.call_tool()
async def call_tool(name: str, arguments: Any) -> Sequence[TextContent | ImageContent | EmbeddedResource]:
"""Execute web scraping tools requested by Claude"""
async with httpx.AsyncClient(timeout=30.0) as client:
try:
if name == "scrape_html":
params = {
"url": arguments["url"],
"api_key": API_KEY,
"js": "true",
"js_timeout": arguments.get("js_timeout", 2000)
}
if "wait_for" in arguments:
params["wait_for"] = arguments["wait_for"]
if "proxy" in arguments:
params["proxy"] = arguments["proxy"]
response = await client.get(f"{BASE_URL}/html", params=params)
response.raise_for_status()
return [TextContent(
type="text",
text=f"HTML content from {arguments['url']}:\n\n{response.text}"
)]
elif name == "scrape_text":
params = {
"url": arguments["url"],
"api_key": API_KEY,
"return_links": arguments.get("return_links", False)
}
response = await client.get(f"{BASE_URL}/text", params=params)
response.raise_for_status()
data = response.json()
return [TextContent(
type="text",
text=f"Text content from {arguments['url']}:\n\n{json.dumps(data, indent=2)}"
)]
elif name == "ask_question":
params = {
"url": arguments["url"],
"api_key": API_KEY,
"question": arguments["question"]
}
response = await client.get(f"{BASE_URL}/question", params=params)
response.raise_for_status()
return [TextContent(
type="text",
text=f"Answer: {response.text}"
)]
elif name == "extract_structured_data":
params = {
"url": arguments["url"],
"api_key": API_KEY
}
response = await client.post(
f"{BASE_URL}/fields",
params=params,
json={"fields": arguments["fields"]}
)
response.raise_for_status()
data = response.json()
return [TextContent(
type="text",
text=f"Extracted data:\n\n{json.dumps(data, indent=2)}"
)]
elif name == "scrape_selected":
params = {
"url": arguments["url"],
"api_key": API_KEY,
"selector": arguments["selector"]
}
response = await client.get(f"{BASE_URL}/selected", params=params)
response.raise_for_status()
data = response.json()
return [TextContent(
type="text",
text=f"Selected content:\n\n{json.dumps(data, indent=2)}"
)]
else:
raise ValueError(f"Unknown tool: {name}")
except httpx.HTTPError as e:
return [TextContent(
type="text",
text=f"Error scraping {arguments.get('url', 'URL')}: {str(e)}"
)]
async def main():
"""Run the MCP server"""
async with stdio_server() as (read_stream, write_stream):
await app.run(
read_stream,
write_stream,
app.create_initialization_options()
)
if __name__ == "__main__":
asyncio.run(main())
TypeScript/Node.js Implementation
For JavaScript developers, here's how to build an MCP server for Claude using Node.js:
#!/usr/bin/env node
import { Server } from "@modelcontextprotocol/sdk/server/index.js";
import { StdioServerTransport } from "@modelcontextprotocol/sdk/server/stdio.js";
import {
CallToolRequestSchema,
ListToolsRequestSchema,
Tool,
} from "@modelcontextprotocol/sdk/types.js";
import axios, { AxiosError } from "axios";
// Initialize server
const server = new Server(
{
name: "claude-webscraping-mcp",
version: "1.0.0",
},
{
capabilities: {
tools: {},
},
}
);
const API_KEY = process.env.WEBSCRAPING_AI_API_KEY;
const BASE_URL = "https://api.webscraping.ai";
// Define scraping tools for Claude
server.setRequestHandler(ListToolsRequestSchema, async () => {
return {
tools: [
{
name: "scrape_dynamic_page",
description: "Scrape JavaScript-heavy pages with full rendering support",
inputSchema: {
type: "object",
properties: {
url: {
type: "string",
description: "URL to scrape",
},
wait_for: {
type: "string",
description: "CSS selector to wait for",
},
timeout: {
type: "number",
description: "Maximum wait time in milliseconds",
default: 15000,
},
},
required: ["url"],
},
},
{
name: "extract_product_data",
description: "Extract structured product information from e-commerce pages",
inputSchema: {
type: "object",
properties: {
url: {
type: "string",
description: "Product page URL",
},
},
required: ["url"],
},
},
{
name: "monitor_content_changes",
description: "Check for content changes on a webpage",
inputSchema: {
type: "object",
properties: {
url: {
type: "string",
description: "URL to monitor",
},
selector: {
type: "string",
description: "CSS selector to monitor for changes",
},
},
required: ["url", "selector"],
},
},
] as Tool[],
};
});
// Handle tool execution
server.setRequestHandler(CallToolRequestSchema, async (request) => {
const { name, arguments: args } = request.params;
try {
if (name === "scrape_dynamic_page") {
const response = await axios.get(`${BASE_URL}/html`, {
params: {
url: args.url,
api_key: API_KEY,
js: true,
wait_for: args.wait_for,
timeout: args.timeout || 15000,
},
});
return {
content: [
{
type: "text",
text: `Successfully scraped ${args.url}\n\nHTML Content:\n${response.data}`,
},
],
};
}
if (name === "extract_product_data") {
// Use AI-powered field extraction
const response = await axios.post(
`${BASE_URL}/fields`,
{
fields: {
title: "Product title or name",
price: "Current price",
original_price: "Original price before discount",
availability: "In stock or out of stock",
rating: "Customer rating",
description: "Product description",
},
},
{
params: {
url: args.url,
api_key: API_KEY,
},
}
);
return {
content: [
{
type: "text",
text: `Product data from ${args.url}:\n\n${JSON.stringify(
response.data,
null,
2
)}`,
},
],
};
}
if (name === "monitor_content_changes") {
const response = await axios.get(`${BASE_URL}/selected`, {
params: {
url: args.url,
api_key: API_KEY,
selector: args.selector,
},
});
return {
content: [
{
type: "text",
text: `Current content for selector "${args.selector}":\n\n${JSON.stringify(
response.data,
null,
2
)}`,
},
],
};
}
throw new Error(`Unknown tool: ${name}`);
} catch (error) {
const errorMessage =
error instanceof AxiosError
? `API Error: ${error.response?.data || error.message}`
: `Error: ${error}`;
return {
content: [
{
type: "text",
text: errorMessage,
},
],
isError: true,
};
}
});
// Start server
async function main() {
const transport = new StdioServerTransport();
await server.connect(transport);
console.error("Claude WebScraping MCP Server running");
}
main().catch(console.error);
Configuring Claude Desktop for Web Scraping
To connect your MCP server to Claude Desktop, you need to update the configuration file:
macOS Configuration
Edit ~/Library/Application Support/Claude/claude_desktop_config.json
:
{
"mcpServers": {
"webscraping": {
"command": "python",
"args": ["/absolute/path/to/webscraping_mcp_server.py"],
"env": {
"WEBSCRAPING_AI_API_KEY": "your_api_key_here"
}
},
"webscraping-node": {
"command": "node",
"args": ["/absolute/path/to/webscraping-mcp-server.js"],
"env": {
"WEBSCRAPING_AI_API_KEY": "your_api_key_here"
}
}
}
}
Windows Configuration
Edit %APPDATA%\Claude\claude_desktop_config.json
with the same structure.
Linux Configuration
Edit ~/.config/Claude/claude_desktop_config.json
.
Using Claude MCP for Web Scraping Workflows
Once configured, you can interact with Claude naturally to perform scraping tasks:
Example Conversations
Simple scraping:
You: "Please scrape the HTML from https://example.com and extract all product titles"
Claude: [Uses scrape_html tool, then analyzes the HTML to extract titles]
Complex automation:
You: "Monitor the price of this product every hour: https://shop.example.com/product/123"
Claude: [Uses extract_structured_data to get current price, can be combined with scheduling]
Multi-step workflows:
You: "Scrape the top 10 tech news sites, extract article headlines,
and summarize the main topics"
Claude: [Orchestrates multiple scrape_text calls, analyzes content, provides summary]
Advanced Integration Patterns
Combining MCP with Browser Automation
Similar to how you would handle browser sessions in Puppeteer, you can create MCP tools that manage persistent browser contexts:
from playwright.async_api import async_playwright
@app.call_tool()
async def call_tool(name: str, arguments: Any):
if name == "scrape_with_auth":
async with async_playwright() as p:
browser = await p.chromium.launch()
context = await browser.new_context()
page = await context.new_page()
# Navigate and authenticate
await page.goto(arguments["url"])
await page.fill("#username", arguments["username"])
await page.fill("#password", arguments["password"])
await page.click("#login")
# Wait for navigation
await page.wait_for_load_state("networkidle")
# Get content
content = await page.content()
await browser.close()
return [TextContent(type="text", text=content)]
Handling Dynamic Content
For pages that load content dynamically, you can create tools that handle AJAX requests using Puppeteer principles:
server.setRequestHandler(CallToolRequestSchema, async (request) => {
if (request.params.name === "wait_for_dynamic_content") {
const response = await axios.get(`${BASE_URL}/html`, {
params: {
url: request.params.arguments.url,
api_key: API_KEY,
js: true,
wait_for: request.params.arguments.selector,
js_timeout: 5000,
},
});
return {
content: [
{
type: "text",
text: `Dynamic content loaded:\n${response.data}`,
},
],
};
}
});
Real-World Use Cases
1. E-commerce Price Monitoring
# MCP tool for price tracking
@app.call_tool()
async def call_tool(name: str, arguments: Any):
if name == "track_prices":
products = arguments["products"] # List of product URLs
results = []
async with httpx.AsyncClient() as client:
for product_url in products:
response = await client.post(
f"{BASE_URL}/fields",
params={"url": product_url, "api_key": API_KEY},
json={
"fields": {
"title": "Product name",
"price": "Current price",
"stock": "Availability status"
}
}
)
results.append(response.json())
return [TextContent(
type="text",
text=f"Price tracking results:\n{json.dumps(results, indent=2)}"
)]
2. Content Aggregation Pipeline
// Aggregate articles from multiple sources
server.setRequestHandler(CallToolRequestSchema, async (request) => {
if (request.params.name === "aggregate_articles") {
const sources = request.params.arguments.sources;
const articles = [];
for (const source of sources) {
const response = await axios.get(`${BASE_URL}/text`, {
params: {
url: source,
api_key: API_KEY,
return_links: true,
},
});
articles.push({
source: source,
content: response.data,
});
}
return {
content: [
{
type: "text",
text: JSON.stringify(articles, null, 2),
},
],
};
}
});
3. Competitive Intelligence
@app.call_tool()
async def call_tool(name: str, arguments: Any):
if name == "competitor_analysis":
competitor_urls = arguments["competitors"]
insights = []
async with httpx.AsyncClient() as client:
for url in competitor_urls:
# Ask specific questions about each competitor
questions = [
"What are the main product categories?",
"What is the pricing strategy?",
"What unique features are highlighted?"
]
competitor_data = {"url": url, "insights": {}}
for question in questions:
response = await client.get(
f"{BASE_URL}/question",
params={
"url": url,
"api_key": API_KEY,
"question": question
}
)
competitor_data["insights"][question] = response.text
insights.append(competitor_data)
return [TextContent(
type="text",
text=f"Competitor analysis:\n{json.dumps(insights, indent=2)}"
)]
Best Practices for Claude MCP Web Scraping
1. Error Handling and Resilience
@app.call_tool()
async def call_tool(name: str, arguments: Any):
max_retries = 3
retry_delay = 2
for attempt in range(max_retries):
try:
async with httpx.AsyncClient(timeout=30.0) as client:
response = await client.get(
f"{BASE_URL}/html",
params={
"url": arguments["url"],
"api_key": API_KEY
}
)
response.raise_for_status()
return [TextContent(type="text", text=response.text)]
except httpx.HTTPError as e:
if attempt < max_retries - 1:
await asyncio.sleep(retry_delay)
continue
return [TextContent(
type="text",
text=f"Failed after {max_retries} attempts: {str(e)}"
)]
2. Rate Limiting and Throttling
class RateLimiter {
private requests: number[] = [];
private maxRequests: number;
private timeWindow: number;
constructor(maxRequests: number, timeWindowMs: number) {
this.maxRequests = maxRequests;
this.timeWindow = timeWindowMs;
}
async waitForSlot(): Promise<void> {
const now = Date.now();
this.requests = this.requests.filter((time) => now - time < this.timeWindow);
if (this.requests.length >= this.maxRequests) {
const oldestRequest = this.requests[0];
const waitTime = this.timeWindow - (now - oldestRequest);
await new Promise((resolve) => setTimeout(resolve, waitTime));
}
this.requests.push(Date.now());
}
}
const limiter = new RateLimiter(10, 60000); // 10 requests per minute
server.setRequestHandler(CallToolRequestSchema, async (request) => {
await limiter.waitForSlot();
// Proceed with scraping...
});
3. Validation and Security
from urllib.parse import urlparse
import re
def validate_url(url: str) -> tuple[bool, str]:
"""Validate URL before scraping"""
try:
result = urlparse(url)
if not all([result.scheme, result.netloc]):
return False, "Invalid URL format"
if result.scheme not in ['http', 'https']:
return False, "Only HTTP/HTTPS protocols allowed"
# Prevent internal network access
if re.match(r'(localhost|127\.0\.0\.1|10\.|172\.(1[6-9]|2[0-9]|3[01])\.|192\.168\.)', result.netloc):
return False, "Internal network access not allowed"
return True, ""
except Exception as e:
return False, f"URL validation error: {str(e)}"
@app.call_tool()
async def call_tool(name: str, arguments: Any):
url = arguments.get("url")
is_valid, error = validate_url(url)
if not is_valid:
return [TextContent(type="text", text=f"Invalid URL: {error}")]
# Proceed with scraping...
Installation and Setup
Prerequisites
# Python environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install MCP SDK and dependencies
pip install mcp httpx playwright
# Node.js environment
npm init -y
npm install @modelcontextprotocol/sdk axios playwright
Running Your MCP Server
# Python
python webscraping_mcp_server.py
# Node.js
node webscraping-mcp-server.js
# Make executable (Unix-like systems)
chmod +x webscraping-mcp-server.js
./webscraping-mcp-server.js
Debugging and Testing
Test Your MCP Server
# Use MCP Inspector for debugging
npx @modelcontextprotocol/inspector python webscraping_mcp_server.py
# Or for Node.js
npx @modelcontextprotocol/inspector node webscraping-mcp-server.js
Logging
import logging
logging.basicConfig(
level=logging.INFO,
format='%(asctime)s - %(name)s - %(levelname)s - %(message)s',
handlers=[
logging.FileHandler('mcp_server.log'),
logging.StreamHandler()
]
)
logger = logging.getLogger(__name__)
@app.call_tool()
async def call_tool(name: str, arguments: Any):
logger.info(f"Tool called: {name} with arguments: {arguments}")
# Your tool logic...
Advantages of Claude MCP for Web Scraping
- Natural Language Control: Describe what you want to scrape in plain English
- Intelligent Decision Making: Claude can adapt scraping strategies based on page structure
- Context Awareness: Claude remembers previous scraping results and can build on them
- Error Recovery: Claude can troubleshoot failed scrapes and try alternative approaches
- Data Analysis: Immediate analysis and insights from scraped data
- Workflow Orchestration: Complex multi-step scraping pipelines through conversation
Conclusion
Claude MCP transforms web scraping from a purely programmatic task into a conversational, AI-assisted workflow. By building MCP servers that expose scraping capabilities, you enable Claude to orchestrate complex data extraction operations, make intelligent decisions about scraping strategies, and provide immediate insights from collected data.
Whether you're building automated monitoring systems, competitive intelligence tools, or content aggregation pipelines, Claude MCP provides a powerful foundation for combining AI intelligence with robust web scraping infrastructure. The standardized protocol ensures your scraping tools work seamlessly with Claude while maintaining security, reliability, and scalability.
Start building your Claude MCP web scraping server today and experience the future of AI-powered data extraction.