What are MCP server examples for data extraction?
Model Context Protocol (MCP) servers provide powerful tools for data extraction and web scraping. These servers act as intermediaries between AI models and various data sources, offering standardized interfaces for browser automation, API interactions, and custom data extraction workflows. This guide explores practical MCP server examples that developers can use to build robust data extraction pipelines.
Popular MCP Server Examples for Data Extraction
1. Puppeteer MCP Server
The Puppeteer MCP server is one of the most widely used implementations for browser automation and web scraping. It provides a comprehensive set of tools for interacting with web pages through the Chrome DevTools Protocol.
Installation:
npm install @modelcontextprotocol/server-puppeteer
Basic Configuration (claude_desktop_config.json):
{
"mcpServers": {
"puppeteer": {
"command": "node",
"args": [
"/path/to/node_modules/@modelcontextprotocol/server-puppeteer/dist/index.js"
]
}
}
}
Example Usage in Python:
from mcp import ClientSession, StdioServerParameters
from mcp.client.stdio import stdio_client
# Connect to Puppeteer MCP server
server_params = StdioServerParameters(
command="node",
args=["/path/to/puppeteer-mcp-server/index.js"]
)
async with stdio_client(server_params) as (read, write):
async with ClientSession(read, write) as session:
await session.initialize()
# Navigate to a page
result = await session.call_tool(
"puppeteer_navigate",
arguments={"url": "https://example.com"}
)
# Extract data from the page
content = await session.call_tool(
"puppeteer_screenshot",
arguments={"fullPage": True}
)
JavaScript Example:
import { Client } from "@modelcontextprotocol/sdk/client/index.js";
import { StdioClientTransport } from "@modelcontextprotocol/sdk/client/stdio.js";
const transport = new StdioClientTransport({
command: "node",
args: ["./puppeteer-mcp-server/index.js"]
});
const client = new Client({
name: "data-extractor",
version: "1.0.0"
}, {
capabilities: {}
});
await client.connect(transport);
// Navigate and extract data
const result = await client.callTool({
name: "puppeteer_navigate",
arguments: { url: "https://example.com/products" }
});
// Click elements and scrape content
await client.callTool({
name: "puppeteer_click",
arguments: { selector: ".load-more-button" }
});
const data = await client.callTool({
name: "puppeteer_evaluate",
arguments: {
expression: `
Array.from(document.querySelectorAll('.product')).map(el => ({
title: el.querySelector('.title').textContent,
price: el.querySelector('.price').textContent,
image: el.querySelector('img').src
}))
`
}
});
2. Playwright MCP Server
The Playwright MCP server offers cross-browser support and is ideal for data extraction from complex web applications. When you need to handle browser sessions across different browsers, Playwright provides excellent compatibility.
Installation:
npm install @modelcontextprotocol/server-playwright
npx playwright install chromium
Configuration Example:
{
"mcpServers": {
"playwright": {
"command": "node",
"args": ["./node_modules/@modelcontextprotocol/server-playwright/dist/index.js"],
"env": {
"BROWSER": "chromium"
}
}
}
}
Python Implementation:
import asyncio
from mcp import ClientSession, StdioServerParameters
from mcp.client.stdio import stdio_client
async def extract_data_with_playwright():
server_params = StdioServerParameters(
command="node",
args=["./playwright-mcp-server/index.js"],
env={"BROWSER": "chromium"}
)
async with stdio_client(server_params) as (read, write):
async with ClientSession(read, write) as session:
await session.initialize()
# Navigate to target page
await session.call_tool(
"playwright_navigate",
arguments={"url": "https://api-docs-site.com"}
)
# Wait for dynamic content
await session.call_tool(
"playwright_wait_for_selector",
arguments={"selector": ".api-endpoint", "timeout": 5000}
)
# Extract structured data
api_endpoints = await session.call_tool(
"playwright_evaluate",
arguments={
"expression": """
Array.from(document.querySelectorAll('.api-endpoint')).map(endpoint => ({
method: endpoint.querySelector('.method').textContent,
path: endpoint.querySelector('.path').textContent,
description: endpoint.querySelector('.description').textContent
}))
"""
}
)
return api_endpoints
# Run the extraction
data = asyncio.run(extract_data_with_playwright())
print(data)
3. Custom HTTP/REST API MCP Server
For extracting data from REST APIs, a custom MCP server can provide structured access to external data sources.
Server Implementation (Node.js):
import { Server } from "@modelcontextprotocol/sdk/server/index.js";
import { StdioServerTransport } from "@modelcontextprotocol/sdk/server/stdio.js";
import axios from "axios";
const server = new Server(
{
name: "api-extractor",
version: "1.0.0"
},
{
capabilities: {
tools: {}
}
}
);
// Define tools for data extraction
server.setRequestHandler("tools/list", async () => ({
tools: [
{
name: "fetch_api_data",
description: "Fetch data from a REST API endpoint",
inputSchema: {
type: "object",
properties: {
url: { type: "string" },
method: { type: "string", enum: ["GET", "POST"] },
headers: { type: "object" },
body: { type: "object" }
},
required: ["url"]
}
},
{
name: "extract_json_field",
description: "Extract specific fields from JSON data",
inputSchema: {
type: "object",
properties: {
data: { type: "object" },
path: { type: "string" }
},
required: ["data", "path"]
}
}
]
}));
server.setRequestHandler("tools/call", async (request) => {
if (request.params.name === "fetch_api_data") {
const { url, method = "GET", headers = {}, body } = request.params.arguments;
try {
const response = await axios({
method,
url,
headers,
data: body
});
return {
content: [{
type: "text",
text: JSON.stringify(response.data, null, 2)
}]
};
} catch (error) {
return {
content: [{
type: "text",
text: `Error: ${error.message}`
}],
isError: true
};
}
}
if (request.params.name === "extract_json_field") {
const { data, path } = request.params.arguments;
const value = path.split('.').reduce((obj, key) => obj?.[key], data);
return {
content: [{
type: "text",
text: JSON.stringify(value, null, 2)
}]
};
}
});
const transport = new StdioServerTransport();
await server.connect(transport);
Client Usage:
async def extract_from_api():
server_params = StdioServerParameters(
command="node",
args=["./api-extractor-server.js"]
)
async with stdio_client(server_params) as (read, write):
async with ClientSession(read, write) as session:
await session.initialize()
# Fetch data from API
result = await session.call_tool(
"fetch_api_data",
arguments={
"url": "https://api.github.com/repos/microsoft/playwright/issues",
"headers": {"Accept": "application/vnd.github.v3+json"}
}
)
# Extract specific fields
titles = await session.call_tool(
"extract_json_field",
arguments={
"data": result,
"path": "*.title"
}
)
return titles
4. Filesystem MCP Server for Log Analysis
When extracting data from log files or local datasets, a filesystem MCP server provides efficient file access.
Installation:
npm install @modelcontextprotocol/server-filesystem
Configuration:
{
"mcpServers": {
"filesystem": {
"command": "npx",
"args": [
"-y",
"@modelcontextprotocol/server-filesystem",
"/path/to/data/directory"
]
}
}
}
Python Example for Log Extraction:
async def extract_log_data():
server_params = StdioServerParameters(
command="npx",
args=["-y", "@modelcontextprotocol/server-filesystem", "/var/logs"]
)
async with stdio_client(server_params) as (read, write):
async with ClientSession(read, write) as session:
await session.initialize()
# Read log file
log_content = await session.call_tool(
"read_file",
arguments={"path": "application.log"}
)
# Parse and extract error entries
errors = []
for line in log_content.split('\n'):
if 'ERROR' in line:
errors.append(line)
return errors
5. Database MCP Server for Structured Data
For extracting data from databases, a custom MCP server can provide query capabilities.
PostgreSQL MCP Server Example:
import { Server } from "@modelcontextprotocol/sdk/server/index.js";
import { StdioServerTransport } from "@modelcontextprotocol/sdk/server/stdio.js";
import pg from "pg";
const pool = new pg.Pool({
host: process.env.DB_HOST,
database: process.env.DB_NAME,
user: process.env.DB_USER,
password: process.env.DB_PASSWORD
});
const server = new Server(
{ name: "postgres-extractor", version: "1.0.0" },
{ capabilities: { tools: {} } }
);
server.setRequestHandler("tools/list", async () => ({
tools: [{
name: "query_database",
description: "Execute SQL query and extract data",
inputSchema: {
type: "object",
properties: {
query: { type: "string" },
params: { type: "array" }
},
required: ["query"]
}
}]
}));
server.setRequestHandler("tools/call", async (request) => {
if (request.params.name === "query_database") {
const { query, params = [] } = request.params.arguments;
try {
const result = await pool.query(query, params);
return {
content: [{
type: "text",
text: JSON.stringify(result.rows, null, 2)
}]
};
} catch (error) {
return {
content: [{ type: "text", text: `Error: ${error.message}` }],
isError: true
};
}
}
});
const transport = new StdioServerTransport();
await server.connect(transport);
Advanced Data Extraction Patterns
Combining Multiple MCP Servers
You can use multiple MCP servers together for complex extraction workflows:
import asyncio
from mcp import ClientSession, StdioServerParameters
from mcp.client.stdio import stdio_client
async def multi_source_extraction():
# Connect to Puppeteer server for web scraping
puppeteer_params = StdioServerParameters(
command="node",
args=["./puppeteer-mcp-server/index.js"]
)
# Connect to filesystem server for data storage
fs_params = StdioServerParameters(
command="npx",
args=["-y", "@modelcontextprotocol/server-filesystem", "./data"]
)
async with stdio_client(puppeteer_params) as (p_read, p_write), \
stdio_client(fs_params) as (f_read, f_write):
async with ClientSession(p_read, p_write) as puppeteer_session, \
ClientSession(f_read, f_write) as fs_session:
await puppeteer_session.initialize()
await fs_session.initialize()
# Scrape data with Puppeteer
await puppeteer_session.call_tool(
"puppeteer_navigate",
arguments={"url": "https://data-source.com"}
)
scraped_data = await puppeteer_session.call_tool(
"puppeteer_evaluate",
arguments={"expression": "document.body.innerText"}
)
# Save to filesystem
await fs_session.call_tool(
"write_file",
arguments={
"path": "scraped_data.txt",
"content": scraped_data
}
)
Handling Pagination and Dynamic Content
When dealing with paginated content, similar to how you would handle AJAX requests using Puppeteer, MCP servers can automate the extraction process:
async function extractPaginatedData(session) {
const allData = [];
let currentPage = 1;
let hasNextPage = true;
while (hasNextPage) {
// Navigate to page
await session.callTool({
name: "puppeteer_navigate",
arguments: { url: `https://example.com/data?page=${currentPage}` }
});
// Wait for content to load
await session.callTool({
name: "puppeteer_wait_for_selector",
arguments: { selector: ".data-item" }
});
// Extract data from current page
const pageData = await session.callTool({
name: "puppeteer_evaluate",
arguments: {
expression: `
Array.from(document.querySelectorAll('.data-item')).map(item => ({
id: item.dataset.id,
title: item.querySelector('h2').textContent,
content: item.querySelector('p').textContent
}))
`
}
});
allData.push(...JSON.parse(pageData.content[0].text));
// Check if next page exists
const nextButton = await session.callTool({
name: "puppeteer_query_selector",
arguments: { selector: ".next-page:not(.disabled)" }
});
hasNextPage = nextButton !== null;
currentPage++;
}
return allData;
}
Error Handling and Retries
Implement robust error handling when working with MCP servers for data extraction, especially when handling timeouts:
import asyncio
from tenacity import retry, stop_after_attempt, wait_exponential
@retry(stop=stop_after_attempt(3), wait=wait_exponential(multiplier=1, min=4, max=10))
async def resilient_extraction(session, url):
try:
# Navigate with timeout
await asyncio.wait_for(
session.call_tool(
"puppeteer_navigate",
arguments={"url": url, "timeout": 30000}
),
timeout=35
)
# Extract data with error handling
try:
data = await session.call_tool(
"puppeteer_evaluate",
arguments={"expression": "document.querySelector('.data').textContent"}
)
return data
except Exception as e:
print(f"Extraction error: {e}")
return None
except asyncio.TimeoutError:
print(f"Timeout while loading {url}")
raise
except Exception as e:
print(f"Navigation error: {e}")
raise
Best Practices for MCP-Based Data Extraction
- Use Resource Management: Properly initialize and close MCP sessions to prevent resource leaks
- Implement Rate Limiting: Add delays between requests to avoid overwhelming target servers
- Cache Responses: Store extracted data to minimize redundant requests
- Validate Data: Always validate extracted data structure before processing
- Monitor Performance: Track extraction speed and success rates
- Handle Authentication: Securely manage credentials when accessing protected resources
- Log Activities: Maintain detailed logs for debugging and compliance
Conclusion
MCP servers provide a standardized, powerful approach to data extraction across various sources. Whether you're using Puppeteer for browser automation, custom servers for API access, or filesystem servers for local data processing, the Model Context Protocol offers a consistent interface that simplifies complex extraction workflows. By combining multiple MCP servers and implementing proper error handling, developers can build robust, scalable data extraction pipelines that integrate seamlessly with AI-powered applications.
The examples provided in this guide demonstrate practical implementations you can adapt to your specific data extraction needs, from simple web scraping to complex multi-source data aggregation workflows.