Table of contents

What are the Available MCP Server Tools for Data Extraction?

MCP (Model Context Protocol) servers provide a powerful ecosystem of tools for data extraction and web scraping. These tools enable AI assistants to interact with web browsers, APIs, and scraping services through a standardized interface. Understanding the available tools helps you build more effective automated data extraction workflows.

Core MCP Tool Categories for Data Extraction

MCP servers expose tools through three main categories:

  1. Browser Automation Tools: Control headless browsers for dynamic content extraction
  2. HTTP Request Tools: Make API calls to scraping services and web endpoints
  3. Data Processing Tools: Transform, validate, and structure extracted data

Official MCP Servers for Web Scraping

1. Playwright MCP Server

The Playwright MCP Server provides comprehensive browser automation capabilities for extracting data from modern web applications. It's one of the most powerful MCP servers for handling dynamic content and complex user interactions.

Available Tools

browser_navigate

{
  "name": "browser_navigate",
  "description": "Navigate to a URL",
  "inputSchema": {
    "type": "object",
    "properties": {
      "url": {
        "type": "string",
        "description": "The URL to navigate to"
      }
    },
    "required": ["url"]
  }
}

browser_snapshot

{
  "name": "browser_snapshot",
  "description": "Capture accessibility snapshot of the current page",
  "inputSchema": {
    "type": "object",
    "properties": {}
  }
}

browser_click

{
  "name": "browser_click",
  "description": "Click on an element",
  "inputSchema": {
    "type": "object",
    "properties": {
      "element": {
        "type": "string",
        "description": "Human-readable element description"
      },
      "ref": {
        "type": "string",
        "description": "Exact target element reference"
      }
    },
    "required": ["element", "ref"]
  }
}

browser_type

{
  "name": "browser_type",
  "description": "Type text into an element",
  "inputSchema": {
    "type": "object",
    "properties": {
      "element": {
        "type": "string",
        "description": "Human-readable element description"
      },
      "ref": {
        "type": "string",
        "description": "Exact target element reference"
      },
      "text": {
        "type": "string",
        "description": "Text to type"
      }
    },
    "required": ["element", "ref", "text"]
  }
}

browser_evaluate

{
  "name": "browser_evaluate",
  "description": "Execute JavaScript in the browser context",
  "inputSchema": {
    "type": "object",
    "properties": {
      "function": {
        "type": "string",
        "description": "JavaScript function to execute"
      }
    },
    "required": ["function"]
  }
}

browser_take_screenshot

{
  "name": "browser_take_screenshot",
  "description": "Capture screenshot of the page or element",
  "inputSchema": {
    "type": "object",
    "properties": {
      "element": {
        "type": "string",
        "description": "Element to screenshot (optional)"
      },
      "fullPage": {
        "type": "boolean",
        "description": "Capture full scrollable page"
      }
    }
  }
}

browser_fill_form

{
  "name": "browser_fill_form",
  "description": "Fill multiple form fields",
  "inputSchema": {
    "type": "object",
    "properties": {
      "fields": {
        "type": "array",
        "description": "Array of field objects to fill"
      }
    },
    "required": ["fields"]
  }
}

browser_wait_for

{
  "name": "browser_wait_for",
  "description": "Wait for text to appear or time to pass",
  "inputSchema": {
    "type": "object",
    "properties": {
      "text": {
        "type": "string",
        "description": "Text to wait for"
      },
      "time": {
        "type": "number",
        "description": "Time to wait in seconds"
      }
    }
  }
}

Practical Example with Playwright MCP

# Using Playwright MCP tools through Claude
# This demonstrates how the AI can interact with the tools

# Tool call sequence for extracting product data:
# 1. Navigate to product page
{
  "tool": "browser_navigate",
  "arguments": {
    "url": "https://example-shop.com/product/123"
  }
}

# 2. Wait for content to load
{
  "tool": "browser_wait_for",
  "arguments": {
    "text": "Add to Cart",
    "time": 5
  }
}

# 3. Capture page snapshot to analyze structure
{
  "tool": "browser_snapshot",
  "arguments": {}
}

# 4. Execute JavaScript to extract data
{
  "tool": "browser_evaluate",
  "arguments": {
    "function": "() => { return { title: document.querySelector('h1').textContent, price: document.querySelector('.price').textContent, description: document.querySelector('.description').textContent }; }"
  }
}

Similar to how you would navigate to different pages using Puppeteer, the Playwright MCP server handles navigation and page interactions through tool calls.

2. Puppeteer MCP Server

While not as feature-rich as the Playwright MCP server in the official ecosystem, Puppeteer-based MCP servers can be built to provide similar functionality:

Custom Puppeteer MCP Tools

scrape_page

import { Server } from "@modelcontextprotocol/sdk/server/index.js";
import puppeteer from "puppeteer";

const server = new Server({
  name: "puppeteer-scraper",
  version: "1.0.0",
}, {
  capabilities: { tools: {} }
});

server.setRequestHandler(ListToolsRequestSchema, async () => {
  return {
    tools: [
      {
        name: "scrape_page",
        description: "Scrape page content using Puppeteer",
        inputSchema: {
          type: "object",
          properties: {
            url: {
              type: "string",
              description: "URL to scrape"
            },
            selector: {
              type: "string",
              description: "CSS selector to extract"
            },
            waitForSelector: {
              type: "string",
              description: "Selector to wait for before extraction"
            }
          },
          required: ["url"]
        }
      }
    ]
  };
});

server.setRequestHandler(CallToolRequestSchema, async (request) => {
  if (request.params.name === "scrape_page") {
    const { url, selector, waitForSelector } = request.params.arguments;

    const browser = await puppeteer.launch({ headless: true });
    const page = await browser.newPage();

    await page.goto(url, { waitUntil: "networkidle2" });

    if (waitForSelector) {
      await page.waitForSelector(waitForSelector);
    }

    let content;
    if (selector) {
      content = await page.$$eval(selector, elements =>
        elements.map(el => el.textContent)
      );
    } else {
      content = await page.content();
    }

    await browser.close();

    return {
      content: [{
        type: "text",
        text: JSON.stringify(content, null, 2)
      }]
    };
  }
});

3. WebScraping.AI MCP Server

A custom MCP server integrating with WebScraping.AI API provides AI-powered data extraction capabilities:

WebScraping.AI Tools

scrape_html

from mcp.server import Server
from mcp.types import Tool, TextContent
import httpx

app = Server("webscraping-ai")

@app.list_tools()
async def list_tools() -> list[Tool]:
    return [
        Tool(
            name="scrape_html",
            description="Scrape HTML with JavaScript rendering and proxy rotation",
            inputSchema={
                "type": "object",
                "properties": {
                    "url": {"type": "string", "description": "URL to scrape"},
                    "js": {"type": "boolean", "description": "Enable JavaScript rendering"},
                    "wait_for": {"type": "string", "description": "CSS selector to wait for"},
                    "timeout": {"type": "number", "description": "Request timeout in ms"},
                    "proxy": {"type": "string", "description": "Proxy type: datacenter or residential"}
                },
                "required": ["url"]
            }
        ),
        Tool(
            name="extract_text",
            description="Extract clean text content from a webpage",
            inputSchema={
                "type": "object",
                "properties": {
                    "url": {"type": "string", "description": "URL to extract text from"},
                    "text_format": {"type": "string", "description": "Format: plain, xml, or json"}
                },
                "required": ["url"]
            }
        ),
        Tool(
            name="ask_question",
            description="Ask AI a question about webpage content",
            inputSchema={
                "type": "object",
                "properties": {
                    "url": {"type": "string", "description": "URL to analyze"},
                    "question": {"type": "string", "description": "Question to ask about the page"}
                },
                "required": ["url", "question"]
            }
        ),
        Tool(
            name="extract_fields",
            description="Extract structured data fields using AI",
            inputSchema={
                "type": "object",
                "properties": {
                    "url": {"type": "string", "description": "URL to extract from"},
                    "fields": {
                        "type": "object",
                        "description": "Field definitions with descriptions"
                    }
                },
                "required": ["url", "fields"]
            }
        )
    ]

@app.call_tool()
async def call_tool(name: str, arguments: dict) -> list[TextContent]:
    api_key = os.environ.get("WEBSCRAPING_AI_API_KEY")
    base_url = "https://api.webscraping.ai"

    async with httpx.AsyncClient() as client:
        if name == "scrape_html":
            response = await client.get(
                f"{base_url}/html",
                params={
                    "url": arguments["url"],
                    "api_key": api_key,
                    "js": arguments.get("js", True),
                    "wait_for": arguments.get("wait_for"),
                    "timeout": arguments.get("timeout", 15000),
                    "proxy": arguments.get("proxy", "residential")
                }
            )
            return [TextContent(type="text", text=response.text)]

        elif name == "extract_text":
            response = await client.get(
                f"{base_url}/text",
                params={
                    "url": arguments["url"],
                    "api_key": api_key,
                    "text_format": arguments.get("text_format", "json")
                }
            )
            return [TextContent(type="text", text=response.text)]

        elif name == "ask_question":
            response = await client.post(
                f"{base_url}/question",
                params={"url": arguments["url"], "api_key": api_key},
                json={"question": arguments["question"]}
            )
            return [TextContent(type="text", text=response.text)]

        elif name == "extract_fields":
            response = await client.post(
                f"{base_url}/fields",
                params={"url": arguments["url"], "api_key": api_key},
                json={"fields": arguments["fields"]}
            )
            return [TextContent(type="text", text=response.text)]

4. HTTP Request MCP Server

The Fetch MCP server provides basic HTTP request capabilities:

HTTP Tools

fetch_url

{
  "name": "fetch_url",
  "description": "Fetch content from a URL",
  "inputSchema": {
    "type": "object",
    "properties": {
      "url": {"type": "string"},
      "method": {"type": "string", "enum": ["GET", "POST", "PUT", "DELETE"]},
      "headers": {"type": "object"},
      "body": {"type": "string"}
    },
    "required": ["url"]
  }
}

Building a Multi-Tool MCP Server

Combine multiple extraction methods in a single MCP server:

from mcp.server import Server
from mcp.types import Tool, TextContent
import httpx
import asyncio
from bs4 import BeautifulSoup

app = Server("comprehensive-scraper")

@app.list_tools()
async def list_tools() -> list[Tool]:
    return [
        Tool(
            name="scrape_with_api",
            description="Scrape using WebScraping.AI API with proxy rotation",
            inputSchema={
                "type": "object",
                "properties": {
                    "url": {"type": "string"},
                    "wait_for": {"type": "string"},
                    "proxy": {"type": "string", "enum": ["datacenter", "residential"]}
                },
                "required": ["url"]
            }
        ),
        Tool(
            name="parse_html",
            description="Parse HTML and extract specific elements",
            inputSchema={
                "type": "object",
                "properties": {
                    "html": {"type": "string"},
                    "selector": {"type": "string"},
                    "attribute": {"type": "string"}
                },
                "required": ["html", "selector"]
            }
        ),
        Tool(
            name="extract_structured_data",
            description="Extract structured data using AI",
            inputSchema={
                "type": "object",
                "properties": {
                    "url": {"type": "string"},
                    "schema": {"type": "object"}
                },
                "required": ["url", "schema"]
            }
        ),
        Tool(
            name="batch_scrape",
            description="Scrape multiple URLs concurrently",
            inputSchema={
                "type": "object",
                "properties": {
                    "urls": {"type": "array", "items": {"type": "string"}},
                    "max_concurrent": {"type": "number"}
                },
                "required": ["urls"]
            }
        )
    ]

@app.call_tool()
async def call_tool(name: str, arguments: dict) -> list[TextContent]:
    if name == "parse_html":
        soup = BeautifulSoup(arguments["html"], "html.parser")
        elements = soup.select(arguments["selector"])

        if arguments.get("attribute"):
            result = [el.get(arguments["attribute"]) for el in elements]
        else:
            result = [el.get_text(strip=True) for el in elements]

        return [TextContent(type="text", text=str(result))]

    elif name == "batch_scrape":
        urls = arguments["urls"]
        max_concurrent = arguments.get("max_concurrent", 5)
        semaphore = asyncio.Semaphore(max_concurrent)

        async def scrape_one(url):
            async with semaphore:
                async with httpx.AsyncClient() as client:
                    response = await client.get(
                        "https://api.webscraping.ai/html",
                        params={"url": url, "api_key": os.environ["API_KEY"]}
                    )
                    return {"url": url, "content": response.text}

        results = await asyncio.gather(*[scrape_one(url) for url in urls])
        return [TextContent(type="text", text=str(results))]

Data Processing Tools

Beyond extraction, MCP servers can provide data transformation tools:

transform_data

{
  "name": "transform_data",
  "description": "Transform extracted data into different formats",
  "inputSchema": {
    "type": "object",
    "properties": {
      "data": {"type": "string"},
      "format": {"type": "string", "enum": ["json", "csv", "xml"]},
      "schema": {"type": "object"}
    }
  }
}

validate_data

{
  "name": "validate_data",
  "description": "Validate extracted data against schema",
  "inputSchema": {
    "type": "object",
    "properties": {
      "data": {"type": "object"},
      "schema": {"type": "object"}
    }
  }
}

Installing and Configuring MCP Servers

Installation

# Install Playwright MCP Server
npm install -g @modelcontextprotocol/server-playwright

# Install Python MCP SDK for custom servers
pip install mcp httpx beautifulsoup4

# Install Node.js MCP SDK
npm install @modelcontextprotocol/sdk

Configuration

Configure MCP servers in Claude Desktop (~/Library/Application Support/Claude/claude_desktop_config.json on macOS):

{
  "mcpServers": {
    "playwright": {
      "command": "npx",
      "args": ["-y", "@modelcontextprotocol/server-playwright"]
    },
    "webscraping-ai": {
      "command": "python",
      "args": ["/path/to/webscraping_mcp_server.py"],
      "env": {
        "WEBSCRAPING_AI_API_KEY": "your_api_key"
      }
    },
    "custom-scraper": {
      "command": "node",
      "args": ["/path/to/scraper-server.js"],
      "env": {
        "API_KEY": "your_key"
      }
    }
  }
}

Best Practices for MCP Data Extraction Tools

1. Error Handling

Just like when you handle errors in Puppeteer, implement robust error handling in MCP tools:

@app.call_tool()
async def call_tool(name: str, arguments: dict) -> list[TextContent]:
    try:
        # Tool implementation
        result = await extract_data(arguments)
        return [TextContent(type="text", text=result)]
    except httpx.TimeoutException:
        return [TextContent(
            type="text",
            text="Error: Request timed out. Try increasing timeout or using a different proxy."
        )]
    except httpx.HTTPStatusError as e:
        return [TextContent(
            type="text",
            text=f"Error: HTTP {e.response.status_code} - {e.response.text}"
        )]
    except Exception as e:
        return [TextContent(
            type="text",
            text=f"Error: {str(e)}"
        )]

2. Rate Limiting

from asyncio import Semaphore

rate_limiter = Semaphore(5)  # Max 5 concurrent requests

@app.call_tool()
async def call_tool(name: str, arguments: dict):
    async with rate_limiter:
        # Tool implementation
        pass

3. Caching

from functools import lru_cache
import hashlib

cache = {}

async def get_cached_content(url: str, ttl: int = 3600):
    cache_key = hashlib.md5(url.encode()).hexdigest()

    if cache_key in cache:
        cached_data, timestamp = cache[cache_key]
        if time.time() - timestamp < ttl:
            return cached_data

    # Fetch fresh data
    data = await fetch_url(url)
    cache[cache_key] = (data, time.time())
    return data

4. Input Validation

from urllib.parse import urlparse

def validate_url(url: str) -> bool:
    try:
        result = urlparse(url)
        return all([result.scheme in ['http', 'https'], result.netloc])
    except:
        return False

@app.call_tool()
async def call_tool(name: str, arguments: dict):
    url = arguments.get("url")
    if not validate_url(url):
        return [TextContent(type="text", text="Error: Invalid URL format")]
    # Continue with tool execution

Real-World Applications

E-commerce Product Monitoring

# MCP server tool for monitoring product prices
Tool(
    name="monitor_product",
    description="Monitor product price and availability",
    inputSchema={
        "type": "object",
        "properties": {
            "product_url": {"type": "string"},
            "check_interval": {"type": "number"},
            "alert_threshold": {"type": "number"}
        }
    }
)

News Aggregation

# Aggregate news from multiple sources
Tool(
    name="aggregate_news",
    description="Collect and aggregate news articles",
    inputSchema={
        "type": "object",
        "properties": {
            "sources": {"type": "array", "items": {"type": "string"}},
            "keywords": {"type": "array", "items": {"type": "string"}},
            "date_range": {"type": "string"}
        }
    }
)

Market Research

# Extract competitor data
Tool(
    name="analyze_competitors",
    description="Extract and analyze competitor data",
    inputSchema={
        "type": "object",
        "properties": {
            "competitor_urls": {"type": "array"},
            "metrics": {"type": "array", "items": {"type": "string"}}
        }
    }
)

Conclusion

The MCP ecosystem provides a rich set of tools for data extraction, from browser automation with Playwright to AI-powered extraction with WebScraping.AI. By understanding and leveraging these tools, you can build sophisticated, AI-driven data extraction workflows that adapt to complex scraping scenarios. Whether you're using official MCP servers or building custom ones, the standardized protocol ensures your tools work seamlessly with AI assistants like Claude.

Start by exploring the Playwright MCP server for browser automation needs, then expand to custom servers integrating specialized APIs and processing capabilities. With proper error handling, rate limiting, and validation, MCP tools provide a robust foundation for production-grade data extraction systems.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon