Table of contents

What is the Model Context Protocol?

The Model Context Protocol (MCP) is an open-source protocol developed by Anthropic that enables seamless communication between AI assistants (like Claude) and external data sources, tools, and services. For web scraping developers, MCP represents a paradigm shift in how you can automate data extraction workflows by allowing AI models to directly interact with scraping tools, databases, and APIs.

MCP provides a standardized way to extend AI capabilities beyond their training data, making them context-aware and able to perform real-time operations with external systems. This is particularly powerful for web scraping, where you need dynamic access to websites, parsing tools, and data storage solutions.

Core Concepts of MCP

MCP operates on a client-server architecture with three main components:

1. MCP Hosts (Clients)

These are applications that embed AI models, such as Claude Desktop, IDEs, or custom applications. The host initiates connections to MCP servers and manages the protocol communication.

2. MCP Servers

Lightweight programs that expose specific capabilities through the standardized protocol. For web scraping, an MCP server might provide: - Access to web scraping APIs - Browser automation tools - HTML parsing utilities - Data transformation functions - Storage and database operations

3. Resources, Tools, and Prompts

MCP servers expose three types of capabilities: - Resources: Data sources that the AI can read (HTML content, API responses, databases) - Tools: Functions the AI can execute (scrape a URL, parse HTML, extract data) - Prompts: Pre-configured templates for common workflows

MCP Architecture for Web Scraping

Here's how MCP enables AI-powered web scraping:

┌─────────────────┐         ┌──────────────────┐         ┌─────────────────┐
│   Claude AI     │◄───────►│   MCP Server     │◄───────►│  Scraping API   │
│   Assistant     │   MCP   │   (WebScraper)   │  HTTP   │  WebScraping.AI │
└─────────────────┘         └──────────────────┘         └─────────────────┘
                                     │
                                     ▼
                            ┌──────────────────┐
                            │   Data Storage   │
                            │   (Database/File)│
                            └──────────────────┘

Building an MCP Server for Web Scraping

Python Example

Here's a complete MCP server that exposes web scraping capabilities using the official MCP SDK:

import asyncio
import httpx
from mcp.server import Server, NotificationOptions
from mcp.server.models import InitializationOptions
from mcp.server.stdio import stdio_server
from mcp.types import Tool, TextContent

# Create MCP server instance
app = Server("webscraping-mcp-server")

# Define available tools
@app.list_tools()
async def list_tools() -> list[Tool]:
    return [
        Tool(
            name="scrape_url",
            description="Scrape HTML content from a URL with JavaScript rendering",
            inputSchema={
                "type": "object",
                "properties": {
                    "url": {
                        "type": "string",
                        "description": "The URL to scrape"
                    },
                    "wait_for": {
                        "type": "string",
                        "description": "CSS selector to wait for before returning"
                    }
                },
                "required": ["url"]
            }
        ),
        Tool(
            name="extract_data",
            description="Extract structured data from HTML using AI",
            inputSchema={
                "type": "object",
                "properties": {
                    "url": {
                        "type": "string",
                        "description": "Target URL"
                    },
                    "fields": {
                        "type": "object",
                        "description": "Fields to extract with descriptions"
                    }
                },
                "required": ["url", "fields"]
            }
        )
    ]

# Implement tool execution
@app.call_tool()
async def call_tool(name: str, arguments: dict) -> list[TextContent]:
    if name == "scrape_url":
        async with httpx.AsyncClient() as client:
            response = await client.get(
                "https://api.webscraping.ai/html",
                params={
                    "url": arguments["url"],
                    "api_key": "YOUR_API_KEY",
                    "js": "true",
                    "wait_for": arguments.get("wait_for")
                }
            )
            return [TextContent(
                type="text",
                text=f"HTML Content:\n{response.text}"
            )]

    elif name == "extract_data":
        async with httpx.AsyncClient() as client:
            response = await client.post(
                "https://api.webscraping.ai/fields",
                params={
                    "url": arguments["url"],
                    "api_key": "YOUR_API_KEY"
                },
                json={"fields": arguments["fields"]}
            )
            return [TextContent(
                type="text",
                text=f"Extracted Data:\n{response.json()}"
            )]

# Run the server
async def main():
    async with stdio_server() as (read_stream, write_stream):
        await app.run(
            read_stream,
            write_stream,
            InitializationOptions(
                server_name="webscraping-mcp",
                server_version="1.0.0"
            )
        )

if __name__ == "__main__":
    asyncio.run(main())

JavaScript/TypeScript Example

For Node.js environments, you can build an MCP server using the official SDK:

import { Server } from "@modelcontextprotocol/sdk/server/index.js";
import { StdioServerTransport } from "@modelcontextprotocol/sdk/server/stdio.js";
import {
  CallToolRequestSchema,
  ListToolsRequestSchema,
} from "@modelcontextprotocol/sdk/types.js";
import axios from "axios";

// Create server instance
const server = new Server(
  {
    name: "webscraping-mcp-server",
    version: "1.0.0",
  },
  {
    capabilities: {
      tools: {},
    },
  }
);

// Define available tools
server.setRequestHandler(ListToolsRequestSchema, async () => {
  return {
    tools: [
      {
        name: "scrape_html",
        description: "Scrape HTML content from any URL with JavaScript rendering",
        inputSchema: {
          type: "object",
          properties: {
            url: {
              type: "string",
              description: "The URL to scrape",
            },
            js_timeout: {
              type: "number",
              description: "Maximum JavaScript rendering time in ms",
            },
          },
          required: ["url"],
        },
      },
      {
        name: "extract_text",
        description: "Extract clean text content from a webpage",
        inputSchema: {
          type: "object",
          properties: {
            url: {
              type: "string",
              description: "The URL to extract text from",
            },
          },
          required: ["url"],
        },
      },
    ],
  };
});

// Handle tool execution
server.setRequestHandler(CallToolRequestSchema, async (request) => {
  const { name, arguments: args } = request.params;

  if (name === "scrape_html") {
    const response = await axios.get("https://api.webscraping.ai/html", {
      params: {
        url: args.url,
        api_key: process.env.WEBSCRAPING_AI_API_KEY,
        js: true,
        js_timeout: args.js_timeout || 2000,
      },
    });

    return {
      content: [
        {
          type: "text",
          text: response.data,
        },
      ],
    };
  }

  if (name === "extract_text") {
    const response = await axios.get("https://api.webscraping.ai/text", {
      params: {
        url: args.url,
        api_key: process.env.WEBSCRAPING_AI_API_KEY,
      },
    });

    return {
      content: [
        {
          type: "text",
          text: JSON.stringify(response.data, null, 2),
        },
      ],
    };
  }

  throw new Error(`Unknown tool: ${name}`);
});

// Start the server
async function main() {
  const transport = new StdioServerTransport();
  await server.connect(transport);
  console.error("WebScraping MCP Server running on stdio");
}

main().catch(console.error);

Configuring MCP Servers

To use your MCP server with Claude Desktop or other MCP hosts, you need to configure it in the MCP settings:

Claude Desktop Configuration (macOS)

Edit ~/Library/Application Support/Claude/claude_desktop_config.json:

{
  "mcpServers": {
    "webscraping": {
      "command": "python",
      "args": ["/path/to/your/mcp_server.py"],
      "env": {
        "WEBSCRAPING_AI_API_KEY": "your_api_key_here"
      }
    }
  }
}

Claude Desktop Configuration (Windows)

Edit %APPDATA%\Claude\claude_desktop_config.json with the same structure.

Real-World Use Cases

1. Automated Product Monitoring

Similar to how you might handle AJAX requests using Puppeteer, MCP servers can automate the monitoring of dynamic product pages:

# In your MCP server
@app.call_tool()
async def call_tool(name: str, arguments: dict):
    if name == "monitor_price":
        async with httpx.AsyncClient() as client:
            # Scrape product page
            html_response = await client.get(
                "https://api.webscraping.ai/html",
                params={
                    "url": arguments["product_url"],
                    "api_key": API_KEY,
                    "wait_for": ".price"
                }
            )

            # Extract price using AI
            data_response = await client.post(
                "https://api.webscraping.ai/question",
                params={"url": arguments["product_url"], "api_key": API_KEY},
                json={"question": "What is the current price of this product?"}
            )

            return [TextContent(type="text", text=data_response.json())]

2. Content Aggregation

MCP servers can aggregate content from multiple sources, handling complex scenarios like browser sessions:

server.setRequestHandler(CallToolRequestSchema, async (request) => {
  if (request.params.name === "aggregate_news") {
    const sources = request.params.arguments.sources;
    const results = [];

    for (const source of sources) {
      const response = await axios.get("https://api.webscraping.ai/text", {
        params: {
          url: source,
          api_key: process.env.WEBSCRAPING_AI_API_KEY,
        },
      });
      results.push(response.data);
    }

    return {
      content: [
        {
          type: "text",
          text: JSON.stringify(results, null, 2),
        },
      ],
    };
  }
});

3. Data Pipeline Automation

Combine MCP with scraping APIs for end-to-end data pipelines:

# Install MCP server dependencies
pip install mcp httpx pandas

# Run your MCP server
python webscraping_mcp_server.py

Then interact naturally with Claude: "Please scrape the product catalog from example.com, extract all prices, and save them to a CSV file"

Security Considerations

When building MCP servers for web scraping:

  1. API Key Management: Store API keys in environment variables, never hardcode them
  2. Rate Limiting: Implement rate limiting to avoid overwhelming target sites
  3. Input Validation: Validate all URLs and parameters before passing to scraping APIs
  4. Error Handling: Implement robust error handling for network failures and timeouts
# Example security best practices
import os
from urllib.parse import urlparse

def validate_url(url: str) -> bool:
    """Validate URL before scraping"""
    try:
        result = urlparse(url)
        return all([result.scheme, result.netloc])
    except:
        return False

# Use environment variables for sensitive data
API_KEY = os.environ.get("WEBSCRAPING_AI_API_KEY")
if not API_KEY:
    raise ValueError("API key not configured")

Advantages of MCP for Web Scraping

  1. Standardization: Single protocol works across different AI models and tools
  2. Modularity: Build reusable scraping components that work with any MCP host
  3. Context Awareness: AI can make intelligent decisions about scraping strategies
  4. Natural Language Control: Control complex scraping workflows through conversation
  5. Rapid Prototyping: Build and test scraping workflows without writing boilerplate code

Getting Started

To start building with MCP for web scraping:

  1. Install MCP SDK:
# Python
pip install mcp

# Node.js
npm install @modelcontextprotocol/sdk
  1. Get a WebScraping.AI API Key: Sign up at WebScraping.AI

  2. Build Your First MCP Server: Use the examples above as templates

  3. Configure Claude Desktop: Add your server to the MCP configuration

  4. Test Your Integration: Start conversing with Claude to trigger your scraping tools

Conclusion

The Model Context Protocol represents a fundamental shift in how developers can build AI-powered web scraping solutions. By creating standardized MCP servers, you can expose scraping capabilities that any MCP-compatible AI assistant can use through natural language, making complex data extraction workflows more accessible and maintainable.

Whether you're building automated monitoring systems, content aggregation pipelines, or custom data extraction tools, MCP provides the foundation for connecting AI intelligence with web scraping capabilities in a secure, standardized way.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon