Table of contents

How do I use web scraping APIs with MCP servers?

Integrating web scraping APIs with Model Context Protocol (MCP) servers enables you to build powerful, AI-assisted data extraction workflows. MCP servers act as a bridge between AI models and external tools, making it possible to combine intelligent decision-making with robust scraping capabilities.

This guide will show you how to effectively use web scraping APIs within MCP server implementations, providing practical examples and best practices for production use.

Understanding MCP Server Architecture

MCP servers expose tools, resources, and prompts that AI models can interact with. When you integrate a web scraping API into an MCP server, you're creating a tool that the AI can call to fetch and process web data.

The basic flow works like this:

  1. AI model requests data extraction through an MCP tool
  2. MCP server receives the request and calls the web scraping API
  3. Web scraping API fetches and processes the target page
  4. MCP server returns the extracted data to the AI model
  5. AI model processes and presents the results

Setting Up an MCP Server with Web Scraping API Integration

Python Implementation

Here's a complete example of an MCP server that integrates a web scraping API using Python:

import asyncio
import os
from mcp.server import Server, NotificationOptions
from mcp.server.models import InitializationOptions
from mcp.server.stdio import stdio_server
from mcp.types import Tool, TextContent
import httpx

# Initialize the MCP server
app = Server("web-scraping-api-server")

# Your web scraping API configuration
API_KEY = os.getenv("WEBSCRAPING_API_KEY")
API_BASE_URL = "https://api.webscraping.ai"

@app.list_tools()
async def list_tools() -> list[Tool]:
    """Define available web scraping tools."""
    return [
        Tool(
            name="scrape_html",
            description="Extract HTML content from any webpage with JavaScript rendering support",
            inputSchema={
                "type": "object",
                "properties": {
                    "url": {
                        "type": "string",
                        "description": "The URL to scrape"
                    },
                    "wait_for": {
                        "type": "string",
                        "description": "CSS selector to wait for before returning content"
                    },
                    "proxy": {
                        "type": "string",
                        "enum": ["datacenter", "residential"],
                        "description": "Type of proxy to use"
                    }
                },
                "required": ["url"]
            }
        ),
        Tool(
            name="extract_text",
            description="Extract clean text content from a webpage",
            inputSchema={
                "type": "object",
                "properties": {
                    "url": {
                        "type": "string",
                        "description": "The URL to extract text from"
                    },
                    "return_links": {
                        "type": "boolean",
                        "description": "Whether to include links in the response"
                    }
                },
                "required": ["url"]
            }
        ),
        Tool(
            name="ai_question",
            description="Ask a question about webpage content using AI",
            inputSchema={
                "type": "object",
                "properties": {
                    "url": {
                        "type": "string",
                        "description": "The URL to analyze"
                    },
                    "question": {
                        "type": "string",
                        "description": "Question to ask about the page content"
                    }
                },
                "required": ["url", "question"]
            }
        )
    ]

@app.call_tool()
async def call_tool(name: str, arguments: dict) -> list[TextContent]:
    """Handle tool calls by invoking the web scraping API."""

    async with httpx.AsyncClient(timeout=30.0) as client:
        if name == "scrape_html":
            params = {
                "url": arguments["url"],
                "api_key": API_KEY,
                "js": "true"
            }

            if "wait_for" in arguments:
                params["wait_for"] = arguments["wait_for"]
            if "proxy" in arguments:
                params["proxy"] = arguments["proxy"]

            response = await client.get(f"{API_BASE_URL}/html", params=params)
            response.raise_for_status()

            return [TextContent(
                type="text",
                text=f"Successfully scraped {arguments['url']}:\n\n{response.text}"
            )]

        elif name == "extract_text":
            params = {
                "url": arguments["url"],
                "api_key": API_KEY
            }

            if arguments.get("return_links"):
                params["return_links"] = "true"

            response = await client.get(f"{API_BASE_URL}/text", params=params)
            response.raise_for_status()

            return [TextContent(
                type="text",
                text=response.text
            )]

        elif name == "ai_question":
            params = {
                "url": arguments["url"],
                "question": arguments["question"],
                "api_key": API_KEY
            }

            response = await client.get(f"{API_BASE_URL}/question", params=params)
            response.raise_for_status()

            return [TextContent(
                type="text",
                text=response.text
            )]

        else:
            raise ValueError(f"Unknown tool: {name}")

async def main():
    """Run the MCP server."""
    async with stdio_server() as (read_stream, write_stream):
        await app.run(
            read_stream,
            write_stream,
            InitializationOptions(
                server_name="web-scraping-api-server",
                server_version="1.0.0",
                capabilities=app.get_capabilities(
                    notification_options=NotificationOptions(),
                    experimental_capabilities={}
                )
            )
        )

if __name__ == "__main__":
    asyncio.run(main())

JavaScript/TypeScript Implementation

For Node.js environments, here's an equivalent implementation:

import { Server } from "@modelcontextprotocol/sdk/server/index.js";
import { StdioServerTransport } from "@modelcontextprotocol/sdk/server/stdio.js";
import {
  CallToolRequestSchema,
  ListToolsRequestSchema,
} from "@modelcontextprotocol/sdk/types.js";
import axios from "axios";

const API_KEY = process.env.WEBSCRAPING_API_KEY;
const API_BASE_URL = "https://api.webscraping.ai";

// Create MCP server instance
const server = new Server(
  {
    name: "web-scraping-api-server",
    version: "1.0.0",
  },
  {
    capabilities: {
      tools: {},
    },
  }
);

// Define available tools
server.setRequestHandler(ListToolsRequestSchema, async () => {
  return {
    tools: [
      {
        name: "scrape_html",
        description: "Extract HTML content from any webpage with JavaScript rendering support",
        inputSchema: {
          type: "object",
          properties: {
            url: {
              type: "string",
              description: "The URL to scrape",
            },
            wait_for: {
              type: "string",
              description: "CSS selector to wait for before returning content",
            },
            proxy: {
              type: "string",
              enum: ["datacenter", "residential"],
              description: "Type of proxy to use",
            },
          },
          required: ["url"],
        },
      },
      {
        name: "extract_text",
        description: "Extract clean text content from a webpage",
        inputSchema: {
          type: "object",
          properties: {
            url: {
              type: "string",
              description: "The URL to extract text from",
            },
            return_links: {
              type: "boolean",
              description: "Whether to include links in the response",
            },
          },
          required: ["url"],
        },
      },
      {
        name: "ai_question",
        description: "Ask a question about webpage content using AI",
        inputSchema: {
          type: "object",
          properties: {
            url: {
              type: "string",
              description: "The URL to analyze",
            },
            question: {
              type: "string",
              description: "Question to ask about the page content",
            },
          },
          required: ["url", "question"],
        },
      },
    ],
  };
});

// Handle tool execution
server.setRequestHandler(CallToolRequestSchema, async (request) => {
  const { name, arguments: args } = request.params;

  try {
    if (name === "scrape_html") {
      const params = {
        url: args.url,
        api_key: API_KEY,
        js: true,
      };

      if (args.wait_for) params.wait_for = args.wait_for;
      if (args.proxy) params.proxy = args.proxy;

      const response = await axios.get(`${API_BASE_URL}/html`, { params });

      return {
        content: [
          {
            type: "text",
            text: `Successfully scraped ${args.url}:\n\n${response.data}`,
          },
        ],
      };
    } else if (name === "extract_text") {
      const params = {
        url: args.url,
        api_key: API_KEY,
      };

      if (args.return_links) params.return_links = true;

      const response = await axios.get(`${API_BASE_URL}/text`, { params });

      return {
        content: [
          {
            type: "text",
            text: response.data,
          },
        ],
      };
    } else if (name === "ai_question") {
      const params = {
        url: args.url,
        question: args.question,
        api_key: API_KEY,
      };

      const response = await axios.get(`${API_BASE_URL}/question`, { params });

      return {
        content: [
          {
            type: "text",
            text: response.data,
          },
        ],
      };
    } else {
      throw new Error(`Unknown tool: ${name}`);
    }
  } catch (error) {
    return {
      content: [
        {
          type: "text",
          text: `Error: ${error.message}`,
        },
      ],
      isError: true,
    };
  }
});

// Start the server
async function main() {
  const transport = new StdioServerTransport();
  await server.connect(transport);
  console.error("Web Scraping API MCP server running on stdio");
}

main().catch((error) => {
  console.error("Fatal error in main():", error);
  process.exit(1);
});

Configuration and Installation

Installing Dependencies

For Python:

pip install mcp httpx python-dotenv

For JavaScript:

npm install @modelcontextprotocol/sdk axios dotenv

Environment Configuration

Create a .env file with your API credentials:

WEBSCRAPING_API_KEY=your_api_key_here

Registering the MCP Server

Add your server to the Claude Desktop configuration file:

macOS: ~/Library/Application Support/Claude/claude_desktop_config.json

Windows: %APPDATA%/Claude/claude_desktop_config.json

{
  "mcpServers": {
    "web-scraping-api": {
      "command": "python",
      "args": ["/path/to/your/scraping_server.py"],
      "env": {
        "WEBSCRAPING_API_KEY": "your_api_key_here"
      }
    }
  }
}

For Node.js:

{
  "mcpServers": {
    "web-scraping-api": {
      "command": "node",
      "args": ["/path/to/your/scraping_server.js"],
      "env": {
        "WEBSCRAPING_API_KEY": "your_api_key_here"
      }
    }
  }
}

Advanced Features and Best Practices

Handling Dynamic Content

When scraping single-page applications or pages with AJAX-loaded content, use the wait_for parameter to ensure content is fully loaded:

# In your MCP tool call
result = await call_tool("scrape_html", {
    "url": "https://example.com/spa",
    "wait_for": "div.product-list",
    "proxy": "residential"
})

This is particularly useful when handling AJAX requests or working with dynamic content that requires specific elements to load.

Error Handling and Retry Logic

Implement robust error handling in your MCP server:

from tenacity import retry, stop_after_attempt, wait_exponential

@retry(
    stop=stop_after_attempt(3),
    wait=wait_exponential(multiplier=1, min=4, max=10)
)
async def scrape_with_retry(url: str, **kwargs):
    """Scrape URL with automatic retry on failure."""
    async with httpx.AsyncClient(timeout=30.0) as client:
        params = {"url": url, "api_key": API_KEY, **kwargs}
        response = await client.get(f"{API_BASE_URL}/html", params=params)
        response.raise_for_status()
        return response.text

Rate Limiting and Concurrency

When processing multiple URLs, implement proper rate limiting:

import asyncio
from asyncio import Semaphore

async def scrape_multiple_urls(urls: list[str], max_concurrent: int = 5):
    """Scrape multiple URLs with concurrency control."""
    semaphore = Semaphore(max_concurrent)

    async def scrape_one(url: str):
        async with semaphore:
            return await call_tool("scrape_html", {"url": url})

    results = await asyncio.gather(*[scrape_one(url) for url in urls])
    return results

Caching Responses

Implement caching to reduce API calls and improve performance:

from functools import lru_cache
import hashlib
import json
import time

class ScrapingCache:
    def __init__(self, ttl: int = 3600):
        self.cache = {}
        self.ttl = ttl

    def get_cache_key(self, url: str, params: dict) -> str:
        """Generate cache key from URL and parameters."""
        cache_string = f"{url}:{json.dumps(params, sort_keys=True)}"
        return hashlib.md5(cache_string.encode()).hexdigest()

    async def get_or_scrape(self, url: str, params: dict):
        """Get from cache or scrape if not cached."""
        cache_key = self.get_cache_key(url, params)

        if cache_key in self.cache:
            cached_data, timestamp = self.cache[cache_key]
            if time.time() - timestamp < self.ttl:
                return cached_data

        # Scrape and cache
        result = await scrape_with_retry(url, **params)
        self.cache[cache_key] = (result, time.time())
        return result

Monitoring and Logging

Add comprehensive logging to track API usage and debug issues:

import logging

logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
)
logger = logging.getLogger("web-scraping-mcp")

@app.call_tool()
async def call_tool(name: str, arguments: dict) -> list[TextContent]:
    logger.info(f"Tool called: {name} with arguments: {arguments}")

    try:
        # Your scraping logic here
        result = await perform_scraping(name, arguments)
        logger.info(f"Tool {name} completed successfully")
        return result
    except Exception as e:
        logger.error(f"Tool {name} failed: {str(e)}", exc_info=True)
        raise

Integration with Browser Automation

For scenarios requiring more control over browser sessions, you can combine web scraping APIs with browser automation tools through your MCP server. This allows you to leverage both approaches within the same workflow.

Testing Your MCP Server

Create a simple test script to verify your integration:

import asyncio
from mcp import ClientSession, StdioServerParameters
from mcp.client.stdio import stdio_client

async def test_scraping_server():
    """Test the web scraping MCP server."""
    server_params = StdioServerParameters(
        command="python",
        args=["scraping_server.py"],
        env={"WEBSCRAPING_API_KEY": "your_key"}
    )

    async with stdio_client(server_params) as (read, write):
        async with ClientSession(read, write) as session:
            await session.initialize()

            # List available tools
            tools = await session.list_tools()
            print("Available tools:", tools)

            # Test HTML scraping
            result = await session.call_tool(
                "scrape_html",
                {"url": "https://example.com"}
            )
            print("Scraping result:", result)

if __name__ == "__main__":
    asyncio.run(test_scraping_server())

Conclusion

Integrating web scraping APIs with MCP servers creates a powerful combination that enables AI models to intelligently extract and process web data. By following the patterns and best practices outlined in this guide, you can build robust, scalable scraping solutions that leverage the strengths of both technologies.

The key to success is proper error handling, rate limiting, and choosing the right scraping approach for your use case. Whether you're extracting structured data, monitoring content changes, or building AI-powered research tools, MCP servers provide the perfect framework for exposing web scraping capabilities to AI models.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon