How do I use web scraping APIs with MCP servers?
Integrating web scraping APIs with Model Context Protocol (MCP) servers enables you to build powerful, AI-assisted data extraction workflows. MCP servers act as a bridge between AI models and external tools, making it possible to combine intelligent decision-making with robust scraping capabilities.
This guide will show you how to effectively use web scraping APIs within MCP server implementations, providing practical examples and best practices for production use.
Understanding MCP Server Architecture
MCP servers expose tools, resources, and prompts that AI models can interact with. When you integrate a web scraping API into an MCP server, you're creating a tool that the AI can call to fetch and process web data.
The basic flow works like this:
- AI model requests data extraction through an MCP tool
- MCP server receives the request and calls the web scraping API
- Web scraping API fetches and processes the target page
- MCP server returns the extracted data to the AI model
- AI model processes and presents the results
Setting Up an MCP Server with Web Scraping API Integration
Python Implementation
Here's a complete example of an MCP server that integrates a web scraping API using Python:
import asyncio
import os
from mcp.server import Server, NotificationOptions
from mcp.server.models import InitializationOptions
from mcp.server.stdio import stdio_server
from mcp.types import Tool, TextContent
import httpx
# Initialize the MCP server
app = Server("web-scraping-api-server")
# Your web scraping API configuration
API_KEY = os.getenv("WEBSCRAPING_API_KEY")
API_BASE_URL = "https://api.webscraping.ai"
@app.list_tools()
async def list_tools() -> list[Tool]:
"""Define available web scraping tools."""
return [
Tool(
name="scrape_html",
description="Extract HTML content from any webpage with JavaScript rendering support",
inputSchema={
"type": "object",
"properties": {
"url": {
"type": "string",
"description": "The URL to scrape"
},
"wait_for": {
"type": "string",
"description": "CSS selector to wait for before returning content"
},
"proxy": {
"type": "string",
"enum": ["datacenter", "residential"],
"description": "Type of proxy to use"
}
},
"required": ["url"]
}
),
Tool(
name="extract_text",
description="Extract clean text content from a webpage",
inputSchema={
"type": "object",
"properties": {
"url": {
"type": "string",
"description": "The URL to extract text from"
},
"return_links": {
"type": "boolean",
"description": "Whether to include links in the response"
}
},
"required": ["url"]
}
),
Tool(
name="ai_question",
description="Ask a question about webpage content using AI",
inputSchema={
"type": "object",
"properties": {
"url": {
"type": "string",
"description": "The URL to analyze"
},
"question": {
"type": "string",
"description": "Question to ask about the page content"
}
},
"required": ["url", "question"]
}
)
]
@app.call_tool()
async def call_tool(name: str, arguments: dict) -> list[TextContent]:
"""Handle tool calls by invoking the web scraping API."""
async with httpx.AsyncClient(timeout=30.0) as client:
if name == "scrape_html":
params = {
"url": arguments["url"],
"api_key": API_KEY,
"js": "true"
}
if "wait_for" in arguments:
params["wait_for"] = arguments["wait_for"]
if "proxy" in arguments:
params["proxy"] = arguments["proxy"]
response = await client.get(f"{API_BASE_URL}/html", params=params)
response.raise_for_status()
return [TextContent(
type="text",
text=f"Successfully scraped {arguments['url']}:\n\n{response.text}"
)]
elif name == "extract_text":
params = {
"url": arguments["url"],
"api_key": API_KEY
}
if arguments.get("return_links"):
params["return_links"] = "true"
response = await client.get(f"{API_BASE_URL}/text", params=params)
response.raise_for_status()
return [TextContent(
type="text",
text=response.text
)]
elif name == "ai_question":
params = {
"url": arguments["url"],
"question": arguments["question"],
"api_key": API_KEY
}
response = await client.get(f"{API_BASE_URL}/question", params=params)
response.raise_for_status()
return [TextContent(
type="text",
text=response.text
)]
else:
raise ValueError(f"Unknown tool: {name}")
async def main():
"""Run the MCP server."""
async with stdio_server() as (read_stream, write_stream):
await app.run(
read_stream,
write_stream,
InitializationOptions(
server_name="web-scraping-api-server",
server_version="1.0.0",
capabilities=app.get_capabilities(
notification_options=NotificationOptions(),
experimental_capabilities={}
)
)
)
if __name__ == "__main__":
asyncio.run(main())
JavaScript/TypeScript Implementation
For Node.js environments, here's an equivalent implementation:
import { Server } from "@modelcontextprotocol/sdk/server/index.js";
import { StdioServerTransport } from "@modelcontextprotocol/sdk/server/stdio.js";
import {
CallToolRequestSchema,
ListToolsRequestSchema,
} from "@modelcontextprotocol/sdk/types.js";
import axios from "axios";
const API_KEY = process.env.WEBSCRAPING_API_KEY;
const API_BASE_URL = "https://api.webscraping.ai";
// Create MCP server instance
const server = new Server(
{
name: "web-scraping-api-server",
version: "1.0.0",
},
{
capabilities: {
tools: {},
},
}
);
// Define available tools
server.setRequestHandler(ListToolsRequestSchema, async () => {
return {
tools: [
{
name: "scrape_html",
description: "Extract HTML content from any webpage with JavaScript rendering support",
inputSchema: {
type: "object",
properties: {
url: {
type: "string",
description: "The URL to scrape",
},
wait_for: {
type: "string",
description: "CSS selector to wait for before returning content",
},
proxy: {
type: "string",
enum: ["datacenter", "residential"],
description: "Type of proxy to use",
},
},
required: ["url"],
},
},
{
name: "extract_text",
description: "Extract clean text content from a webpage",
inputSchema: {
type: "object",
properties: {
url: {
type: "string",
description: "The URL to extract text from",
},
return_links: {
type: "boolean",
description: "Whether to include links in the response",
},
},
required: ["url"],
},
},
{
name: "ai_question",
description: "Ask a question about webpage content using AI",
inputSchema: {
type: "object",
properties: {
url: {
type: "string",
description: "The URL to analyze",
},
question: {
type: "string",
description: "Question to ask about the page content",
},
},
required: ["url", "question"],
},
},
],
};
});
// Handle tool execution
server.setRequestHandler(CallToolRequestSchema, async (request) => {
const { name, arguments: args } = request.params;
try {
if (name === "scrape_html") {
const params = {
url: args.url,
api_key: API_KEY,
js: true,
};
if (args.wait_for) params.wait_for = args.wait_for;
if (args.proxy) params.proxy = args.proxy;
const response = await axios.get(`${API_BASE_URL}/html`, { params });
return {
content: [
{
type: "text",
text: `Successfully scraped ${args.url}:\n\n${response.data}`,
},
],
};
} else if (name === "extract_text") {
const params = {
url: args.url,
api_key: API_KEY,
};
if (args.return_links) params.return_links = true;
const response = await axios.get(`${API_BASE_URL}/text`, { params });
return {
content: [
{
type: "text",
text: response.data,
},
],
};
} else if (name === "ai_question") {
const params = {
url: args.url,
question: args.question,
api_key: API_KEY,
};
const response = await axios.get(`${API_BASE_URL}/question`, { params });
return {
content: [
{
type: "text",
text: response.data,
},
],
};
} else {
throw new Error(`Unknown tool: ${name}`);
}
} catch (error) {
return {
content: [
{
type: "text",
text: `Error: ${error.message}`,
},
],
isError: true,
};
}
});
// Start the server
async function main() {
const transport = new StdioServerTransport();
await server.connect(transport);
console.error("Web Scraping API MCP server running on stdio");
}
main().catch((error) => {
console.error("Fatal error in main():", error);
process.exit(1);
});
Configuration and Installation
Installing Dependencies
For Python:
pip install mcp httpx python-dotenv
For JavaScript:
npm install @modelcontextprotocol/sdk axios dotenv
Environment Configuration
Create a .env
file with your API credentials:
WEBSCRAPING_API_KEY=your_api_key_here
Registering the MCP Server
Add your server to the Claude Desktop configuration file:
macOS: ~/Library/Application Support/Claude/claude_desktop_config.json
Windows: %APPDATA%/Claude/claude_desktop_config.json
{
"mcpServers": {
"web-scraping-api": {
"command": "python",
"args": ["/path/to/your/scraping_server.py"],
"env": {
"WEBSCRAPING_API_KEY": "your_api_key_here"
}
}
}
}
For Node.js:
{
"mcpServers": {
"web-scraping-api": {
"command": "node",
"args": ["/path/to/your/scraping_server.js"],
"env": {
"WEBSCRAPING_API_KEY": "your_api_key_here"
}
}
}
}
Advanced Features and Best Practices
Handling Dynamic Content
When scraping single-page applications or pages with AJAX-loaded content, use the wait_for
parameter to ensure content is fully loaded:
# In your MCP tool call
result = await call_tool("scrape_html", {
"url": "https://example.com/spa",
"wait_for": "div.product-list",
"proxy": "residential"
})
This is particularly useful when handling AJAX requests or working with dynamic content that requires specific elements to load.
Error Handling and Retry Logic
Implement robust error handling in your MCP server:
from tenacity import retry, stop_after_attempt, wait_exponential
@retry(
stop=stop_after_attempt(3),
wait=wait_exponential(multiplier=1, min=4, max=10)
)
async def scrape_with_retry(url: str, **kwargs):
"""Scrape URL with automatic retry on failure."""
async with httpx.AsyncClient(timeout=30.0) as client:
params = {"url": url, "api_key": API_KEY, **kwargs}
response = await client.get(f"{API_BASE_URL}/html", params=params)
response.raise_for_status()
return response.text
Rate Limiting and Concurrency
When processing multiple URLs, implement proper rate limiting:
import asyncio
from asyncio import Semaphore
async def scrape_multiple_urls(urls: list[str], max_concurrent: int = 5):
"""Scrape multiple URLs with concurrency control."""
semaphore = Semaphore(max_concurrent)
async def scrape_one(url: str):
async with semaphore:
return await call_tool("scrape_html", {"url": url})
results = await asyncio.gather(*[scrape_one(url) for url in urls])
return results
Caching Responses
Implement caching to reduce API calls and improve performance:
from functools import lru_cache
import hashlib
import json
import time
class ScrapingCache:
def __init__(self, ttl: int = 3600):
self.cache = {}
self.ttl = ttl
def get_cache_key(self, url: str, params: dict) -> str:
"""Generate cache key from URL and parameters."""
cache_string = f"{url}:{json.dumps(params, sort_keys=True)}"
return hashlib.md5(cache_string.encode()).hexdigest()
async def get_or_scrape(self, url: str, params: dict):
"""Get from cache or scrape if not cached."""
cache_key = self.get_cache_key(url, params)
if cache_key in self.cache:
cached_data, timestamp = self.cache[cache_key]
if time.time() - timestamp < self.ttl:
return cached_data
# Scrape and cache
result = await scrape_with_retry(url, **params)
self.cache[cache_key] = (result, time.time())
return result
Monitoring and Logging
Add comprehensive logging to track API usage and debug issues:
import logging
logging.basicConfig(
level=logging.INFO,
format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
)
logger = logging.getLogger("web-scraping-mcp")
@app.call_tool()
async def call_tool(name: str, arguments: dict) -> list[TextContent]:
logger.info(f"Tool called: {name} with arguments: {arguments}")
try:
# Your scraping logic here
result = await perform_scraping(name, arguments)
logger.info(f"Tool {name} completed successfully")
return result
except Exception as e:
logger.error(f"Tool {name} failed: {str(e)}", exc_info=True)
raise
Integration with Browser Automation
For scenarios requiring more control over browser sessions, you can combine web scraping APIs with browser automation tools through your MCP server. This allows you to leverage both approaches within the same workflow.
Testing Your MCP Server
Create a simple test script to verify your integration:
import asyncio
from mcp import ClientSession, StdioServerParameters
from mcp.client.stdio import stdio_client
async def test_scraping_server():
"""Test the web scraping MCP server."""
server_params = StdioServerParameters(
command="python",
args=["scraping_server.py"],
env={"WEBSCRAPING_API_KEY": "your_key"}
)
async with stdio_client(server_params) as (read, write):
async with ClientSession(read, write) as session:
await session.initialize()
# List available tools
tools = await session.list_tools()
print("Available tools:", tools)
# Test HTML scraping
result = await session.call_tool(
"scrape_html",
{"url": "https://example.com"}
)
print("Scraping result:", result)
if __name__ == "__main__":
asyncio.run(test_scraping_server())
Conclusion
Integrating web scraping APIs with MCP servers creates a powerful combination that enables AI models to intelligently extract and process web data. By following the patterns and best practices outlined in this guide, you can build robust, scalable scraping solutions that leverage the strengths of both technologies.
The key to success is proper error handling, rate limiting, and choosing the right scraping approach for your use case. Whether you're extracting structured data, monitoring content changes, or building AI-powered research tools, MCP servers provide the perfect framework for exposing web scraping capabilities to AI models.