What are the Available MCP Server Tools for Data Extraction?
MCP (Model Context Protocol) servers provide a powerful ecosystem of tools for data extraction and web scraping. These tools enable AI assistants to interact with web browsers, APIs, and scraping services through a standardized interface. Understanding the available tools helps you build more effective automated data extraction workflows.
Core MCP Tool Categories for Data Extraction
MCP servers expose tools through three main categories:
- Browser Automation Tools: Control headless browsers for dynamic content extraction
- HTTP Request Tools: Make API calls to scraping services and web endpoints
- Data Processing Tools: Transform, validate, and structure extracted data
Official MCP Servers for Web Scraping
1. Playwright MCP Server
The Playwright MCP Server provides comprehensive browser automation capabilities for extracting data from modern web applications. It's one of the most powerful MCP servers for handling dynamic content and complex user interactions.
Available Tools
browser_navigate
{
"name": "browser_navigate",
"description": "Navigate to a URL",
"inputSchema": {
"type": "object",
"properties": {
"url": {
"type": "string",
"description": "The URL to navigate to"
}
},
"required": ["url"]
}
}
browser_snapshot
{
"name": "browser_snapshot",
"description": "Capture accessibility snapshot of the current page",
"inputSchema": {
"type": "object",
"properties": {}
}
}
browser_click
{
"name": "browser_click",
"description": "Click on an element",
"inputSchema": {
"type": "object",
"properties": {
"element": {
"type": "string",
"description": "Human-readable element description"
},
"ref": {
"type": "string",
"description": "Exact target element reference"
}
},
"required": ["element", "ref"]
}
}
browser_type
{
"name": "browser_type",
"description": "Type text into an element",
"inputSchema": {
"type": "object",
"properties": {
"element": {
"type": "string",
"description": "Human-readable element description"
},
"ref": {
"type": "string",
"description": "Exact target element reference"
},
"text": {
"type": "string",
"description": "Text to type"
}
},
"required": ["element", "ref", "text"]
}
}
browser_evaluate
{
"name": "browser_evaluate",
"description": "Execute JavaScript in the browser context",
"inputSchema": {
"type": "object",
"properties": {
"function": {
"type": "string",
"description": "JavaScript function to execute"
}
},
"required": ["function"]
}
}
browser_take_screenshot
{
"name": "browser_take_screenshot",
"description": "Capture screenshot of the page or element",
"inputSchema": {
"type": "object",
"properties": {
"element": {
"type": "string",
"description": "Element to screenshot (optional)"
},
"fullPage": {
"type": "boolean",
"description": "Capture full scrollable page"
}
}
}
}
browser_fill_form
{
"name": "browser_fill_form",
"description": "Fill multiple form fields",
"inputSchema": {
"type": "object",
"properties": {
"fields": {
"type": "array",
"description": "Array of field objects to fill"
}
},
"required": ["fields"]
}
}
browser_wait_for
{
"name": "browser_wait_for",
"description": "Wait for text to appear or time to pass",
"inputSchema": {
"type": "object",
"properties": {
"text": {
"type": "string",
"description": "Text to wait for"
},
"time": {
"type": "number",
"description": "Time to wait in seconds"
}
}
}
}
Practical Example with Playwright MCP
# Using Playwright MCP tools through Claude
# This demonstrates how the AI can interact with the tools
# Tool call sequence for extracting product data:
# 1. Navigate to product page
{
"tool": "browser_navigate",
"arguments": {
"url": "https://example-shop.com/product/123"
}
}
# 2. Wait for content to load
{
"tool": "browser_wait_for",
"arguments": {
"text": "Add to Cart",
"time": 5
}
}
# 3. Capture page snapshot to analyze structure
{
"tool": "browser_snapshot",
"arguments": {}
}
# 4. Execute JavaScript to extract data
{
"tool": "browser_evaluate",
"arguments": {
"function": "() => { return { title: document.querySelector('h1').textContent, price: document.querySelector('.price').textContent, description: document.querySelector('.description').textContent }; }"
}
}
Similar to how you would navigate to different pages using Puppeteer, the Playwright MCP server handles navigation and page interactions through tool calls.
2. Puppeteer MCP Server
While not as feature-rich as the Playwright MCP server in the official ecosystem, Puppeteer-based MCP servers can be built to provide similar functionality:
Custom Puppeteer MCP Tools
scrape_page
import { Server } from "@modelcontextprotocol/sdk/server/index.js";
import puppeteer from "puppeteer";
const server = new Server({
name: "puppeteer-scraper",
version: "1.0.0",
}, {
capabilities: { tools: {} }
});
server.setRequestHandler(ListToolsRequestSchema, async () => {
return {
tools: [
{
name: "scrape_page",
description: "Scrape page content using Puppeteer",
inputSchema: {
type: "object",
properties: {
url: {
type: "string",
description: "URL to scrape"
},
selector: {
type: "string",
description: "CSS selector to extract"
},
waitForSelector: {
type: "string",
description: "Selector to wait for before extraction"
}
},
required: ["url"]
}
}
]
};
});
server.setRequestHandler(CallToolRequestSchema, async (request) => {
if (request.params.name === "scrape_page") {
const { url, selector, waitForSelector } = request.params.arguments;
const browser = await puppeteer.launch({ headless: true });
const page = await browser.newPage();
await page.goto(url, { waitUntil: "networkidle2" });
if (waitForSelector) {
await page.waitForSelector(waitForSelector);
}
let content;
if (selector) {
content = await page.$$eval(selector, elements =>
elements.map(el => el.textContent)
);
} else {
content = await page.content();
}
await browser.close();
return {
content: [{
type: "text",
text: JSON.stringify(content, null, 2)
}]
};
}
});
3. WebScraping.AI MCP Server
A custom MCP server integrating with WebScraping.AI API provides AI-powered data extraction capabilities:
WebScraping.AI Tools
scrape_html
from mcp.server import Server
from mcp.types import Tool, TextContent
import httpx
app = Server("webscraping-ai")
@app.list_tools()
async def list_tools() -> list[Tool]:
return [
Tool(
name="scrape_html",
description="Scrape HTML with JavaScript rendering and proxy rotation",
inputSchema={
"type": "object",
"properties": {
"url": {"type": "string", "description": "URL to scrape"},
"js": {"type": "boolean", "description": "Enable JavaScript rendering"},
"wait_for": {"type": "string", "description": "CSS selector to wait for"},
"timeout": {"type": "number", "description": "Request timeout in ms"},
"proxy": {"type": "string", "description": "Proxy type: datacenter or residential"}
},
"required": ["url"]
}
),
Tool(
name="extract_text",
description="Extract clean text content from a webpage",
inputSchema={
"type": "object",
"properties": {
"url": {"type": "string", "description": "URL to extract text from"},
"text_format": {"type": "string", "description": "Format: plain, xml, or json"}
},
"required": ["url"]
}
),
Tool(
name="ask_question",
description="Ask AI a question about webpage content",
inputSchema={
"type": "object",
"properties": {
"url": {"type": "string", "description": "URL to analyze"},
"question": {"type": "string", "description": "Question to ask about the page"}
},
"required": ["url", "question"]
}
),
Tool(
name="extract_fields",
description="Extract structured data fields using AI",
inputSchema={
"type": "object",
"properties": {
"url": {"type": "string", "description": "URL to extract from"},
"fields": {
"type": "object",
"description": "Field definitions with descriptions"
}
},
"required": ["url", "fields"]
}
)
]
@app.call_tool()
async def call_tool(name: str, arguments: dict) -> list[TextContent]:
api_key = os.environ.get("WEBSCRAPING_AI_API_KEY")
base_url = "https://api.webscraping.ai"
async with httpx.AsyncClient() as client:
if name == "scrape_html":
response = await client.get(
f"{base_url}/html",
params={
"url": arguments["url"],
"api_key": api_key,
"js": arguments.get("js", True),
"wait_for": arguments.get("wait_for"),
"timeout": arguments.get("timeout", 15000),
"proxy": arguments.get("proxy", "residential")
}
)
return [TextContent(type="text", text=response.text)]
elif name == "extract_text":
response = await client.get(
f"{base_url}/text",
params={
"url": arguments["url"],
"api_key": api_key,
"text_format": arguments.get("text_format", "json")
}
)
return [TextContent(type="text", text=response.text)]
elif name == "ask_question":
response = await client.post(
f"{base_url}/question",
params={"url": arguments["url"], "api_key": api_key},
json={"question": arguments["question"]}
)
return [TextContent(type="text", text=response.text)]
elif name == "extract_fields":
response = await client.post(
f"{base_url}/fields",
params={"url": arguments["url"], "api_key": api_key},
json={"fields": arguments["fields"]}
)
return [TextContent(type="text", text=response.text)]
4. HTTP Request MCP Server
The Fetch MCP server provides basic HTTP request capabilities:
HTTP Tools
fetch_url
{
"name": "fetch_url",
"description": "Fetch content from a URL",
"inputSchema": {
"type": "object",
"properties": {
"url": {"type": "string"},
"method": {"type": "string", "enum": ["GET", "POST", "PUT", "DELETE"]},
"headers": {"type": "object"},
"body": {"type": "string"}
},
"required": ["url"]
}
}
Building a Multi-Tool MCP Server
Combine multiple extraction methods in a single MCP server:
from mcp.server import Server
from mcp.types import Tool, TextContent
import httpx
import asyncio
from bs4 import BeautifulSoup
app = Server("comprehensive-scraper")
@app.list_tools()
async def list_tools() -> list[Tool]:
return [
Tool(
name="scrape_with_api",
description="Scrape using WebScraping.AI API with proxy rotation",
inputSchema={
"type": "object",
"properties": {
"url": {"type": "string"},
"wait_for": {"type": "string"},
"proxy": {"type": "string", "enum": ["datacenter", "residential"]}
},
"required": ["url"]
}
),
Tool(
name="parse_html",
description="Parse HTML and extract specific elements",
inputSchema={
"type": "object",
"properties": {
"html": {"type": "string"},
"selector": {"type": "string"},
"attribute": {"type": "string"}
},
"required": ["html", "selector"]
}
),
Tool(
name="extract_structured_data",
description="Extract structured data using AI",
inputSchema={
"type": "object",
"properties": {
"url": {"type": "string"},
"schema": {"type": "object"}
},
"required": ["url", "schema"]
}
),
Tool(
name="batch_scrape",
description="Scrape multiple URLs concurrently",
inputSchema={
"type": "object",
"properties": {
"urls": {"type": "array", "items": {"type": "string"}},
"max_concurrent": {"type": "number"}
},
"required": ["urls"]
}
)
]
@app.call_tool()
async def call_tool(name: str, arguments: dict) -> list[TextContent]:
if name == "parse_html":
soup = BeautifulSoup(arguments["html"], "html.parser")
elements = soup.select(arguments["selector"])
if arguments.get("attribute"):
result = [el.get(arguments["attribute"]) for el in elements]
else:
result = [el.get_text(strip=True) for el in elements]
return [TextContent(type="text", text=str(result))]
elif name == "batch_scrape":
urls = arguments["urls"]
max_concurrent = arguments.get("max_concurrent", 5)
semaphore = asyncio.Semaphore(max_concurrent)
async def scrape_one(url):
async with semaphore:
async with httpx.AsyncClient() as client:
response = await client.get(
"https://api.webscraping.ai/html",
params={"url": url, "api_key": os.environ["API_KEY"]}
)
return {"url": url, "content": response.text}
results = await asyncio.gather(*[scrape_one(url) for url in urls])
return [TextContent(type="text", text=str(results))]
Data Processing Tools
Beyond extraction, MCP servers can provide data transformation tools:
transform_data
{
"name": "transform_data",
"description": "Transform extracted data into different formats",
"inputSchema": {
"type": "object",
"properties": {
"data": {"type": "string"},
"format": {"type": "string", "enum": ["json", "csv", "xml"]},
"schema": {"type": "object"}
}
}
}
validate_data
{
"name": "validate_data",
"description": "Validate extracted data against schema",
"inputSchema": {
"type": "object",
"properties": {
"data": {"type": "object"},
"schema": {"type": "object"}
}
}
}
Installing and Configuring MCP Servers
Installation
# Install Playwright MCP Server
npm install -g @modelcontextprotocol/server-playwright
# Install Python MCP SDK for custom servers
pip install mcp httpx beautifulsoup4
# Install Node.js MCP SDK
npm install @modelcontextprotocol/sdk
Configuration
Configure MCP servers in Claude Desktop (~/Library/Application Support/Claude/claude_desktop_config.json
on macOS):
{
"mcpServers": {
"playwright": {
"command": "npx",
"args": ["-y", "@modelcontextprotocol/server-playwright"]
},
"webscraping-ai": {
"command": "python",
"args": ["/path/to/webscraping_mcp_server.py"],
"env": {
"WEBSCRAPING_AI_API_KEY": "your_api_key"
}
},
"custom-scraper": {
"command": "node",
"args": ["/path/to/scraper-server.js"],
"env": {
"API_KEY": "your_key"
}
}
}
}
Best Practices for MCP Data Extraction Tools
1. Error Handling
Just like when you handle errors in Puppeteer, implement robust error handling in MCP tools:
@app.call_tool()
async def call_tool(name: str, arguments: dict) -> list[TextContent]:
try:
# Tool implementation
result = await extract_data(arguments)
return [TextContent(type="text", text=result)]
except httpx.TimeoutException:
return [TextContent(
type="text",
text="Error: Request timed out. Try increasing timeout or using a different proxy."
)]
except httpx.HTTPStatusError as e:
return [TextContent(
type="text",
text=f"Error: HTTP {e.response.status_code} - {e.response.text}"
)]
except Exception as e:
return [TextContent(
type="text",
text=f"Error: {str(e)}"
)]
2. Rate Limiting
from asyncio import Semaphore
rate_limiter = Semaphore(5) # Max 5 concurrent requests
@app.call_tool()
async def call_tool(name: str, arguments: dict):
async with rate_limiter:
# Tool implementation
pass
3. Caching
from functools import lru_cache
import hashlib
cache = {}
async def get_cached_content(url: str, ttl: int = 3600):
cache_key = hashlib.md5(url.encode()).hexdigest()
if cache_key in cache:
cached_data, timestamp = cache[cache_key]
if time.time() - timestamp < ttl:
return cached_data
# Fetch fresh data
data = await fetch_url(url)
cache[cache_key] = (data, time.time())
return data
4. Input Validation
from urllib.parse import urlparse
def validate_url(url: str) -> bool:
try:
result = urlparse(url)
return all([result.scheme in ['http', 'https'], result.netloc])
except:
return False
@app.call_tool()
async def call_tool(name: str, arguments: dict):
url = arguments.get("url")
if not validate_url(url):
return [TextContent(type="text", text="Error: Invalid URL format")]
# Continue with tool execution
Real-World Applications
E-commerce Product Monitoring
# MCP server tool for monitoring product prices
Tool(
name="monitor_product",
description="Monitor product price and availability",
inputSchema={
"type": "object",
"properties": {
"product_url": {"type": "string"},
"check_interval": {"type": "number"},
"alert_threshold": {"type": "number"}
}
}
)
News Aggregation
# Aggregate news from multiple sources
Tool(
name="aggregate_news",
description="Collect and aggregate news articles",
inputSchema={
"type": "object",
"properties": {
"sources": {"type": "array", "items": {"type": "string"}},
"keywords": {"type": "array", "items": {"type": "string"}},
"date_range": {"type": "string"}
}
}
)
Market Research
# Extract competitor data
Tool(
name="analyze_competitors",
description="Extract and analyze competitor data",
inputSchema={
"type": "object",
"properties": {
"competitor_urls": {"type": "array"},
"metrics": {"type": "array", "items": {"type": "string"}}
}
}
)
Conclusion
The MCP ecosystem provides a rich set of tools for data extraction, from browser automation with Playwright to AI-powered extraction with WebScraping.AI. By understanding and leveraging these tools, you can build sophisticated, AI-driven data extraction workflows that adapt to complex scraping scenarios. Whether you're using official MCP servers or building custom ones, the standardized protocol ensures your tools work seamlessly with AI assistants like Claude.
Start by exploring the Playwright MCP server for browser automation needs, then expand to custom servers integrating specialized APIs and processing capabilities. With proper error handling, rate limiting, and validation, MCP tools provide a robust foundation for production-grade data extraction systems.