Table of contents

How do I use MCP server with Python for web scraping?

The Model Context Protocol (MCP) is an open standard that enables seamless integration between AI applications and external data sources. When it comes to web scraping, MCP servers written in Python provide a powerful way to expose scraping capabilities as standardized tools that AI assistants can leverage. This guide will show you how to build and use MCP servers with Python for web scraping projects.

What is MCP and Why Use It for Web Scraping?

MCP (Model Context Protocol) is a protocol developed by Anthropic that allows AI applications to connect to various data sources and tools through a standardized interface. For web scraping, MCP servers act as intermediaries that:

  • Expose web scraping functionality as callable tools
  • Provide structured data extraction capabilities
  • Handle complex browser automation tasks
  • Integrate seamlessly with AI assistants like Claude

Using Python for MCP servers is particularly advantageous because Python has a rich ecosystem of web scraping libraries like BeautifulSoup, Scrapy, Selenium, and Playwright.

Prerequisites

Before building an MCP server for web scraping in Python, ensure you have:

  • Python 3.10 or higher installed
  • Basic understanding of async/await patterns in Python
  • Familiarity with web scraping concepts
  • Node.js (for testing with MCP clients)

Installing the MCP Python SDK

First, install the official MCP SDK for Python:

pip install mcp

For web scraping capabilities, you'll also want to install scraping libraries:

# For HTML parsing
pip install beautifulsoup4 requests

# For browser automation
pip install playwright
playwright install chromium

# Or use Selenium
pip install selenium

Building a Basic MCP Server for Web Scraping

Here's a complete example of an MCP server that provides web scraping capabilities using Python:

import asyncio
import json
from mcp.server import Server
from mcp.server.stdio import stdio_server
from mcp.types import Tool, TextContent
import httpx
from bs4 import BeautifulSoup
from playwright.async_api import async_playwright

# Initialize the MCP server
app = Server("web-scraper")

# Tool 1: Simple HTML fetcher
@app.list_tools()
async def list_tools() -> list[Tool]:
    return [
        Tool(
            name="fetch_html",
            description="Fetch HTML content from a URL",
            inputSchema={
                "type": "object",
                "properties": {
                    "url": {
                        "type": "string",
                        "description": "The URL to fetch"
                    }
                },
                "required": ["url"]
            }
        ),
        Tool(
            name="extract_text",
            description="Extract text content from a webpage using CSS selectors",
            inputSchema={
                "type": "object",
                "properties": {
                    "url": {
                        "type": "string",
                        "description": "The URL to scrape"
                    },
                    "selector": {
                        "type": "string",
                        "description": "CSS selector to target elements"
                    }
                },
                "required": ["url", "selector"]
            }
        ),
        Tool(
            name="scrape_dynamic",
            description="Scrape JavaScript-rendered content using browser automation",
            inputSchema={
                "type": "object",
                "properties": {
                    "url": {
                        "type": "string",
                        "description": "The URL to scrape"
                    },
                    "wait_selector": {
                        "type": "string",
                        "description": "CSS selector to wait for before extracting content"
                    }
                },
                "required": ["url"]
            }
        )
    ]

@app.call_tool()
async def call_tool(name: str, arguments: dict) -> list[TextContent]:
    if name == "fetch_html":
        return await fetch_html(arguments["url"])
    elif name == "extract_text":
        return await extract_text(arguments["url"], arguments["selector"])
    elif name == "scrape_dynamic":
        return await scrape_dynamic(
            arguments["url"],
            arguments.get("wait_selector")
        )
    else:
        raise ValueError(f"Unknown tool: {name}")

async def fetch_html(url: str) -> list[TextContent]:
    """Fetch raw HTML from a URL"""
    async with httpx.AsyncClient() as client:
        response = await client.get(url)
        response.raise_for_status()
        return [TextContent(
            type="text",
            text=response.text
        )]

async def extract_text(url: str, selector: str) -> list[TextContent]:
    """Extract text content using CSS selectors"""
    async with httpx.AsyncClient() as client:
        response = await client.get(url)
        response.raise_for_status()

    soup = BeautifulSoup(response.text, 'html.parser')
    elements = soup.select(selector)

    results = [element.get_text(strip=True) for element in elements]

    return [TextContent(
        type="text",
        text=json.dumps(results, indent=2)
    )]

async def scrape_dynamic(url: str, wait_selector: str = None) -> list[TextContent]:
    """Scrape JavaScript-rendered content using Playwright"""
    async with async_playwright() as p:
        browser = await p.chromium.launch()
        page = await browser.new_page()

        await page.goto(url)

        if wait_selector:
            await page.wait_for_selector(wait_selector, timeout=10000)
        else:
            await page.wait_for_load_state('networkidle')

        content = await page.content()
        await browser.close()

        return [TextContent(
            type="text",
            text=content
        )]

async def main():
    """Run the MCP server"""
    async with stdio_server() as (read_stream, write_stream):
        await app.run(
            read_stream,
            write_stream,
            app.create_initialization_options()
        )

if __name__ == "__main__":
    asyncio.run(main())

Save this as scraper_server.py and run it:

python scraper_server.py

Advanced Features: Adding Proxy Support and Error Handling

For production web scraping, you'll want to add features like proxy support, rate limiting, and robust error handling:

async def scrape_with_proxy(url: str, proxy: str = None, timeout: int = 30) -> list[TextContent]:
    """Scrape with proxy support and error handling"""
    try:
        async with async_playwright() as p:
            browser = await p.chromium.launch(
                proxy={"server": proxy} if proxy else None
            )

            context = await browser.new_context(
                user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
            )

            page = await context.new_page()

            # Set timeout
            page.set_default_timeout(timeout * 1000)

            try:
                await page.goto(url, wait_until='networkidle')
                content = await page.content()

                return [TextContent(
                    type="text",
                    text=content
                )]
            except Exception as e:
                return [TextContent(
                    type="text",
                    text=f"Error scraping page: {str(e)}"
                )]
            finally:
                await browser.close()

    except Exception as e:
        return [TextContent(
            type="text",
            text=f"Error initializing browser: {str(e)}"
        )]

Connecting to Your MCP Server from Claude Desktop

To use your Python MCP server with Claude Desktop, add it to your Claude configuration file:

On macOS: ~/Library/Application Support/Claude/claude_desktop_config.json

On Windows: %APPDATA%/Claude/claude_desktop_config.json

{
  "mcpServers": {
    "web-scraper": {
      "command": "python",
      "args": ["/path/to/your/scraper_server.py"]
    }
  }
}

After restarting Claude Desktop, you can use your scraping tools directly in conversations:

Can you scrape the pricing information from example.com using the extract_text tool?

Using MCP with the Playwright MCP Server

If you prefer not to build your own server, you can leverage the existing Playwright MCP server which provides comprehensive browser automation capabilities out of the box. Similar concepts apply when handling browser sessions or managing page navigation.

Best Practices for Python MCP Servers

1. Use Async/Await Properly

MCP servers in Python are built on asyncio, so all I/O operations should be async:

# Good - async HTTP requests
async with httpx.AsyncClient() as client:
    response = await client.get(url)

# Avoid - blocking requests
import requests
response = requests.get(url)  # This will block the event loop

2. Implement Rate Limiting

Prevent overwhelming target servers:

import asyncio
from asyncio import Semaphore

# Limit concurrent requests
semaphore = Semaphore(3)

async def scrape_with_limit(url: str):
    async with semaphore:
        await asyncio.sleep(1)  # Rate limit delay
        return await fetch_html(url)

3. Add Comprehensive Error Handling

Always handle network errors gracefully:

try:
    async with httpx.AsyncClient() as client:
        response = await client.get(url, timeout=30.0)
        response.raise_for_status()
except httpx.TimeoutException:
    return [TextContent(type="text", text="Request timed out")]
except httpx.HTTPStatusError as e:
    return [TextContent(type="text", text=f"HTTP error: {e.response.status_code}")]
except Exception as e:
    return [TextContent(type="text", text=f"Unexpected error: {str(e)}")]

4. Use Structured Data

Return data in structured formats like JSON when possible:

results = {
    "url": url,
    "title": soup.find('title').get_text(),
    "links": [a['href'] for a in soup.find_all('a', href=True)],
    "scraped_at": datetime.now().isoformat()
}

return [TextContent(
    type="text",
    text=json.dumps(results, indent=2)
)]

Testing Your MCP Server

Create a simple test script to verify your server works:

import asyncio
from mcp import ClientSession, StdioServerParameters
from mcp.client.stdio import stdio_client

async def test_server():
    server_params = StdioServerParameters(
        command="python",
        args=["scraper_server.py"]
    )

    async with stdio_client(server_params) as (read, write):
        async with ClientSession(read, write) as session:
            await session.initialize()

            # List available tools
            tools = await session.list_tools()
            print("Available tools:", [tool.name for tool in tools.tools])

            # Call a tool
            result = await session.call_tool(
                "extract_text",
                {
                    "url": "https://example.com",
                    "selector": "h1"
                }
            )
            print("Result:", result.content[0].text)

if __name__ == "__main__":
    asyncio.run(test_server())

Integrating with Web Scraping APIs

For production scraping at scale, consider integrating your MCP server with specialized web scraping APIs. This approach handles challenges like proxy rotation, CAPTCHA solving, and JavaScript rendering automatically. You can wrap API calls in your MCP tools:

async def scrape_with_api(url: str, api_key: str) -> list[TextContent]:
    """Use WebScraping.AI API for reliable scraping"""
    async with httpx.AsyncClient() as client:
        response = await client.get(
            "https://api.webscraping.ai/html",
            params={
                "url": url,
                "api_key": api_key
            }
        )
        response.raise_for_status()

        return [TextContent(
            type="text",
            text=response.text
        )]

Conclusion

Building MCP servers with Python for web scraping opens up powerful possibilities for AI-assisted data extraction. By following the patterns shown in this guide, you can create robust, reusable scraping tools that integrate seamlessly with AI assistants. Whether you're building simple HTML parsers or complex browser automation workflows, the MCP protocol provides a standardized way to expose these capabilities.

Start with basic tools, add features incrementally, and always follow web scraping best practices including respecting robots.txt, implementing rate limiting, and handling errors gracefully. As you become more comfortable with the MCP SDK, you can expand your server to include specialized scraping capabilities tailored to your specific use cases.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon