How do I use MCP server with Python for web scraping?
The Model Context Protocol (MCP) is an open standard that enables seamless integration between AI applications and external data sources. When it comes to web scraping, MCP servers written in Python provide a powerful way to expose scraping capabilities as standardized tools that AI assistants can leverage. This guide will show you how to build and use MCP servers with Python for web scraping projects.
What is MCP and Why Use It for Web Scraping?
MCP (Model Context Protocol) is a protocol developed by Anthropic that allows AI applications to connect to various data sources and tools through a standardized interface. For web scraping, MCP servers act as intermediaries that:
- Expose web scraping functionality as callable tools
- Provide structured data extraction capabilities
- Handle complex browser automation tasks
- Integrate seamlessly with AI assistants like Claude
Using Python for MCP servers is particularly advantageous because Python has a rich ecosystem of web scraping libraries like BeautifulSoup, Scrapy, Selenium, and Playwright.
Prerequisites
Before building an MCP server for web scraping in Python, ensure you have:
- Python 3.10 or higher installed
- Basic understanding of async/await patterns in Python
- Familiarity with web scraping concepts
- Node.js (for testing with MCP clients)
Installing the MCP Python SDK
First, install the official MCP SDK for Python:
pip install mcp
For web scraping capabilities, you'll also want to install scraping libraries:
# For HTML parsing
pip install beautifulsoup4 requests
# For browser automation
pip install playwright
playwright install chromium
# Or use Selenium
pip install selenium
Building a Basic MCP Server for Web Scraping
Here's a complete example of an MCP server that provides web scraping capabilities using Python:
import asyncio
import json
from mcp.server import Server
from mcp.server.stdio import stdio_server
from mcp.types import Tool, TextContent
import httpx
from bs4 import BeautifulSoup
from playwright.async_api import async_playwright
# Initialize the MCP server
app = Server("web-scraper")
# Tool 1: Simple HTML fetcher
@app.list_tools()
async def list_tools() -> list[Tool]:
return [
Tool(
name="fetch_html",
description="Fetch HTML content from a URL",
inputSchema={
"type": "object",
"properties": {
"url": {
"type": "string",
"description": "The URL to fetch"
}
},
"required": ["url"]
}
),
Tool(
name="extract_text",
description="Extract text content from a webpage using CSS selectors",
inputSchema={
"type": "object",
"properties": {
"url": {
"type": "string",
"description": "The URL to scrape"
},
"selector": {
"type": "string",
"description": "CSS selector to target elements"
}
},
"required": ["url", "selector"]
}
),
Tool(
name="scrape_dynamic",
description="Scrape JavaScript-rendered content using browser automation",
inputSchema={
"type": "object",
"properties": {
"url": {
"type": "string",
"description": "The URL to scrape"
},
"wait_selector": {
"type": "string",
"description": "CSS selector to wait for before extracting content"
}
},
"required": ["url"]
}
)
]
@app.call_tool()
async def call_tool(name: str, arguments: dict) -> list[TextContent]:
if name == "fetch_html":
return await fetch_html(arguments["url"])
elif name == "extract_text":
return await extract_text(arguments["url"], arguments["selector"])
elif name == "scrape_dynamic":
return await scrape_dynamic(
arguments["url"],
arguments.get("wait_selector")
)
else:
raise ValueError(f"Unknown tool: {name}")
async def fetch_html(url: str) -> list[TextContent]:
"""Fetch raw HTML from a URL"""
async with httpx.AsyncClient() as client:
response = await client.get(url)
response.raise_for_status()
return [TextContent(
type="text",
text=response.text
)]
async def extract_text(url: str, selector: str) -> list[TextContent]:
"""Extract text content using CSS selectors"""
async with httpx.AsyncClient() as client:
response = await client.get(url)
response.raise_for_status()
soup = BeautifulSoup(response.text, 'html.parser')
elements = soup.select(selector)
results = [element.get_text(strip=True) for element in elements]
return [TextContent(
type="text",
text=json.dumps(results, indent=2)
)]
async def scrape_dynamic(url: str, wait_selector: str = None) -> list[TextContent]:
"""Scrape JavaScript-rendered content using Playwright"""
async with async_playwright() as p:
browser = await p.chromium.launch()
page = await browser.new_page()
await page.goto(url)
if wait_selector:
await page.wait_for_selector(wait_selector, timeout=10000)
else:
await page.wait_for_load_state('networkidle')
content = await page.content()
await browser.close()
return [TextContent(
type="text",
text=content
)]
async def main():
"""Run the MCP server"""
async with stdio_server() as (read_stream, write_stream):
await app.run(
read_stream,
write_stream,
app.create_initialization_options()
)
if __name__ == "__main__":
asyncio.run(main())
Save this as scraper_server.py
and run it:
python scraper_server.py
Advanced Features: Adding Proxy Support and Error Handling
For production web scraping, you'll want to add features like proxy support, rate limiting, and robust error handling:
async def scrape_with_proxy(url: str, proxy: str = None, timeout: int = 30) -> list[TextContent]:
"""Scrape with proxy support and error handling"""
try:
async with async_playwright() as p:
browser = await p.chromium.launch(
proxy={"server": proxy} if proxy else None
)
context = await browser.new_context(
user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
)
page = await context.new_page()
# Set timeout
page.set_default_timeout(timeout * 1000)
try:
await page.goto(url, wait_until='networkidle')
content = await page.content()
return [TextContent(
type="text",
text=content
)]
except Exception as e:
return [TextContent(
type="text",
text=f"Error scraping page: {str(e)}"
)]
finally:
await browser.close()
except Exception as e:
return [TextContent(
type="text",
text=f"Error initializing browser: {str(e)}"
)]
Connecting to Your MCP Server from Claude Desktop
To use your Python MCP server with Claude Desktop, add it to your Claude configuration file:
On macOS: ~/Library/Application Support/Claude/claude_desktop_config.json
On Windows: %APPDATA%/Claude/claude_desktop_config.json
{
"mcpServers": {
"web-scraper": {
"command": "python",
"args": ["/path/to/your/scraper_server.py"]
}
}
}
After restarting Claude Desktop, you can use your scraping tools directly in conversations:
Can you scrape the pricing information from example.com using the extract_text tool?
Using MCP with the Playwright MCP Server
If you prefer not to build your own server, you can leverage the existing Playwright MCP server which provides comprehensive browser automation capabilities out of the box. Similar concepts apply when handling browser sessions or managing page navigation.
Best Practices for Python MCP Servers
1. Use Async/Await Properly
MCP servers in Python are built on asyncio, so all I/O operations should be async:
# Good - async HTTP requests
async with httpx.AsyncClient() as client:
response = await client.get(url)
# Avoid - blocking requests
import requests
response = requests.get(url) # This will block the event loop
2. Implement Rate Limiting
Prevent overwhelming target servers:
import asyncio
from asyncio import Semaphore
# Limit concurrent requests
semaphore = Semaphore(3)
async def scrape_with_limit(url: str):
async with semaphore:
await asyncio.sleep(1) # Rate limit delay
return await fetch_html(url)
3. Add Comprehensive Error Handling
Always handle network errors gracefully:
try:
async with httpx.AsyncClient() as client:
response = await client.get(url, timeout=30.0)
response.raise_for_status()
except httpx.TimeoutException:
return [TextContent(type="text", text="Request timed out")]
except httpx.HTTPStatusError as e:
return [TextContent(type="text", text=f"HTTP error: {e.response.status_code}")]
except Exception as e:
return [TextContent(type="text", text=f"Unexpected error: {str(e)}")]
4. Use Structured Data
Return data in structured formats like JSON when possible:
results = {
"url": url,
"title": soup.find('title').get_text(),
"links": [a['href'] for a in soup.find_all('a', href=True)],
"scraped_at": datetime.now().isoformat()
}
return [TextContent(
type="text",
text=json.dumps(results, indent=2)
)]
Testing Your MCP Server
Create a simple test script to verify your server works:
import asyncio
from mcp import ClientSession, StdioServerParameters
from mcp.client.stdio import stdio_client
async def test_server():
server_params = StdioServerParameters(
command="python",
args=["scraper_server.py"]
)
async with stdio_client(server_params) as (read, write):
async with ClientSession(read, write) as session:
await session.initialize()
# List available tools
tools = await session.list_tools()
print("Available tools:", [tool.name for tool in tools.tools])
# Call a tool
result = await session.call_tool(
"extract_text",
{
"url": "https://example.com",
"selector": "h1"
}
)
print("Result:", result.content[0].text)
if __name__ == "__main__":
asyncio.run(test_server())
Integrating with Web Scraping APIs
For production scraping at scale, consider integrating your MCP server with specialized web scraping APIs. This approach handles challenges like proxy rotation, CAPTCHA solving, and JavaScript rendering automatically. You can wrap API calls in your MCP tools:
async def scrape_with_api(url: str, api_key: str) -> list[TextContent]:
"""Use WebScraping.AI API for reliable scraping"""
async with httpx.AsyncClient() as client:
response = await client.get(
"https://api.webscraping.ai/html",
params={
"url": url,
"api_key": api_key
}
)
response.raise_for_status()
return [TextContent(
type="text",
text=response.text
)]
Conclusion
Building MCP servers with Python for web scraping opens up powerful possibilities for AI-assisted data extraction. By following the patterns shown in this guide, you can create robust, reusable scraping tools that integrate seamlessly with AI assistants. Whether you're building simple HTML parsers or complex browser automation workflows, the MCP protocol provides a standardized way to expose these capabilities.
Start with basic tools, add features incrementally, and always follow web scraping best practices including respecting robots.txt, implementing rate limiting, and handling errors gracefully. As you become more comfortable with the MCP SDK, you can expand your server to include specialized scraping capabilities tailored to your specific use cases.