What are MCP server examples for data extraction?

Model Context Protocol (MCP) servers provide powerful tools for data extraction and web scraping. These servers act as intermediaries between AI models and various data sources, offering standardized interfaces for browser automation, API interactions, and custom data extraction workflows. This guide explores practical MCP server examples that developers can use to build robust data extraction pipelines.

Popular MCP Server Examples for Data Extraction

1. Puppeteer MCP Server

The Puppeteer MCP server is one of the most widely used implementations for browser automation and web scraping. It provides a comprehensive set of tools for interacting with web pages through the Chrome DevTools Protocol.

Installation:

npm install @modelcontextprotocol/server-puppeteer

Basic Configuration (claude_desktop_config.json):

{
  "mcpServers": {
    "puppeteer": {
      "command": "node",
      "args": [
        "/path/to/node_modules/@modelcontextprotocol/server-puppeteer/dist/index.js"
      ]
    }
  }
}

Example Usage in Python:

from mcp import ClientSession, StdioServerParameters
from mcp.client.stdio import stdio_client

# Connect to Puppeteer MCP server
server_params = StdioServerParameters(
    command="node",
    args=["/path/to/puppeteer-mcp-server/index.js"]
)

async with stdio_client(server_params) as (read, write):
    async with ClientSession(read, write) as session:
        await session.initialize()

        # Navigate to a page
        result = await session.call_tool(
            "puppeteer_navigate",
            arguments={"url": "https://example.com"}
        )

        # Extract data from the page
        content = await session.call_tool(
            "puppeteer_screenshot",
            arguments={"fullPage": True}
        )

JavaScript Example:

import { Client } from "@modelcontextprotocol/sdk/client/index.js";
import { StdioClientTransport } from "@modelcontextprotocol/sdk/client/stdio.js";

const transport = new StdioClientTransport({
  command: "node",
  args: ["./puppeteer-mcp-server/index.js"]
});

const client = new Client({
  name: "data-extractor",
  version: "1.0.0"
}, {
  capabilities: {}
});

await client.connect(transport);

// Navigate and extract data
const result = await client.callTool({
  name: "puppeteer_navigate",
  arguments: { url: "https://example.com/products" }
});

// Click elements and scrape content
await client.callTool({
  name: "puppeteer_click",
  arguments: { selector: ".load-more-button" }
});

const data = await client.callTool({
  name: "puppeteer_evaluate",
  arguments: {
    expression: `
      Array.from(document.querySelectorAll('.product')).map(el => ({
        title: el.querySelector('.title').textContent,
        price: el.querySelector('.price').textContent,
        image: el.querySelector('img').src
      }))
    `
  }
});

2. Playwright MCP Server

The Playwright MCP server offers cross-browser support and is ideal for data extraction from complex web applications. When you need to handle browser sessions across different browsers, Playwright provides excellent compatibility.

Installation:

npm install @modelcontextprotocol/server-playwright
npx playwright install chromium

Configuration Example:

{
  "mcpServers": {
    "playwright": {
      "command": "node",
      "args": ["./node_modules/@modelcontextprotocol/server-playwright/dist/index.js"],
      "env": {
        "BROWSER": "chromium"
      }
    }
  }
}

Python Implementation:

import asyncio
from mcp import ClientSession, StdioServerParameters
from mcp.client.stdio import stdio_client

async def extract_data_with_playwright():
    server_params = StdioServerParameters(
        command="node",
        args=["./playwright-mcp-server/index.js"],
        env={"BROWSER": "chromium"}
    )

    async with stdio_client(server_params) as (read, write):
        async with ClientSession(read, write) as session:
            await session.initialize()

            # Navigate to target page
            await session.call_tool(
                "playwright_navigate",
                arguments={"url": "https://api-docs-site.com"}
            )

            # Wait for dynamic content
            await session.call_tool(
                "playwright_wait_for_selector",
                arguments={"selector": ".api-endpoint", "timeout": 5000}
            )

            # Extract structured data
            api_endpoints = await session.call_tool(
                "playwright_evaluate",
                arguments={
                    "expression": """
                        Array.from(document.querySelectorAll('.api-endpoint')).map(endpoint => ({
                            method: endpoint.querySelector('.method').textContent,
                            path: endpoint.querySelector('.path').textContent,
                            description: endpoint.querySelector('.description').textContent
                        }))
                    """
                }
            )

            return api_endpoints

# Run the extraction
data = asyncio.run(extract_data_with_playwright())
print(data)

3. Custom HTTP/REST API MCP Server

For extracting data from REST APIs, a custom MCP server can provide structured access to external data sources.

Server Implementation (Node.js):

import { Server } from "@modelcontextprotocol/sdk/server/index.js";
import { StdioServerTransport } from "@modelcontextprotocol/sdk/server/stdio.js";
import axios from "axios";

const server = new Server(
  {
    name: "api-extractor",
    version: "1.0.0"
  },
  {
    capabilities: {
      tools: {}
    }
  }
);

// Define tools for data extraction
server.setRequestHandler("tools/list", async () => ({
  tools: [
    {
      name: "fetch_api_data",
      description: "Fetch data from a REST API endpoint",
      inputSchema: {
        type: "object",
        properties: {
          url: { type: "string" },
          method: { type: "string", enum: ["GET", "POST"] },
          headers: { type: "object" },
          body: { type: "object" }
        },
        required: ["url"]
      }
    },
    {
      name: "extract_json_field",
      description: "Extract specific fields from JSON data",
      inputSchema: {
        type: "object",
        properties: {
          data: { type: "object" },
          path: { type: "string" }
        },
        required: ["data", "path"]
      }
    }
  ]
}));

server.setRequestHandler("tools/call", async (request) => {
  if (request.params.name === "fetch_api_data") {
    const { url, method = "GET", headers = {}, body } = request.params.arguments;

    try {
      const response = await axios({
        method,
        url,
        headers,
        data: body
      });

      return {
        content: [{
          type: "text",
          text: JSON.stringify(response.data, null, 2)
        }]
      };
    } catch (error) {
      return {
        content: [{
          type: "text",
          text: `Error: ${error.message}`
        }],
        isError: true
      };
    }
  }

  if (request.params.name === "extract_json_field") {
    const { data, path } = request.params.arguments;
    const value = path.split('.').reduce((obj, key) => obj?.[key], data);

    return {
      content: [{
        type: "text",
        text: JSON.stringify(value, null, 2)
      }]
    };
  }
});

const transport = new StdioServerTransport();
await server.connect(transport);

Client Usage:

async def extract_from_api():
    server_params = StdioServerParameters(
        command="node",
        args=["./api-extractor-server.js"]
    )

    async with stdio_client(server_params) as (read, write):
        async with ClientSession(read, write) as session:
            await session.initialize()

            # Fetch data from API
            result = await session.call_tool(
                "fetch_api_data",
                arguments={
                    "url": "https://api.github.com/repos/microsoft/playwright/issues",
                    "headers": {"Accept": "application/vnd.github.v3+json"}
                }
            )

            # Extract specific fields
            titles = await session.call_tool(
                "extract_json_field",
                arguments={
                    "data": result,
                    "path": "*.title"
                }
            )

            return titles

4. Filesystem MCP Server for Log Analysis

When extracting data from log files or local datasets, a filesystem MCP server provides efficient file access.

Installation:

npm install @modelcontextprotocol/server-filesystem

Configuration:

{
  "mcpServers": {
    "filesystem": {
      "command": "npx",
      "args": [
        "-y",
        "@modelcontextprotocol/server-filesystem",
        "/path/to/data/directory"
      ]
    }
  }
}

Python Example for Log Extraction:

async def extract_log_data():
    server_params = StdioServerParameters(
        command="npx",
        args=["-y", "@modelcontextprotocol/server-filesystem", "/var/logs"]
    )

    async with stdio_client(server_params) as (read, write):
        async with ClientSession(read, write) as session:
            await session.initialize()

            # Read log file
            log_content = await session.call_tool(
                "read_file",
                arguments={"path": "application.log"}
            )

            # Parse and extract error entries
            errors = []
            for line in log_content.split('\n'):
                if 'ERROR' in line:
                    errors.append(line)

            return errors

5. Database MCP Server for Structured Data

For extracting data from databases, a custom MCP server can provide query capabilities.

PostgreSQL MCP Server Example:

import { Server } from "@modelcontextprotocol/sdk/server/index.js";
import { StdioServerTransport } from "@modelcontextprotocol/sdk/server/stdio.js";
import pg from "pg";

const pool = new pg.Pool({
  host: process.env.DB_HOST,
  database: process.env.DB_NAME,
  user: process.env.DB_USER,
  password: process.env.DB_PASSWORD
});

const server = new Server(
  { name: "postgres-extractor", version: "1.0.0" },
  { capabilities: { tools: {} } }
);

server.setRequestHandler("tools/list", async () => ({
  tools: [{
    name: "query_database",
    description: "Execute SQL query and extract data",
    inputSchema: {
      type: "object",
      properties: {
        query: { type: "string" },
        params: { type: "array" }
      },
      required: ["query"]
    }
  }]
}));

server.setRequestHandler("tools/call", async (request) => {
  if (request.params.name === "query_database") {
    const { query, params = [] } = request.params.arguments;

    try {
      const result = await pool.query(query, params);
      return {
        content: [{
          type: "text",
          text: JSON.stringify(result.rows, null, 2)
        }]
      };
    } catch (error) {
      return {
        content: [{ type: "text", text: `Error: ${error.message}` }],
        isError: true
      };
    }
  }
});

const transport = new StdioServerTransport();
await server.connect(transport);

Advanced Data Extraction Patterns

Combining Multiple MCP Servers

You can use multiple MCP servers together for complex extraction workflows:

import asyncio
from mcp import ClientSession, StdioServerParameters
from mcp.client.stdio import stdio_client

async def multi_source_extraction():
    # Connect to Puppeteer server for web scraping
    puppeteer_params = StdioServerParameters(
        command="node",
        args=["./puppeteer-mcp-server/index.js"]
    )

    # Connect to filesystem server for data storage
    fs_params = StdioServerParameters(
        command="npx",
        args=["-y", "@modelcontextprotocol/server-filesystem", "./data"]
    )

    async with stdio_client(puppeteer_params) as (p_read, p_write), \
               stdio_client(fs_params) as (f_read, f_write):

        async with ClientSession(p_read, p_write) as puppeteer_session, \
                   ClientSession(f_read, f_write) as fs_session:

            await puppeteer_session.initialize()
            await fs_session.initialize()

            # Scrape data with Puppeteer
            await puppeteer_session.call_tool(
                "puppeteer_navigate",
                arguments={"url": "https://data-source.com"}
            )

            scraped_data = await puppeteer_session.call_tool(
                "puppeteer_evaluate",
                arguments={"expression": "document.body.innerText"}
            )

            # Save to filesystem
            await fs_session.call_tool(
                "write_file",
                arguments={
                    "path": "scraped_data.txt",
                    "content": scraped_data
                }
            )

Handling Pagination and Dynamic Content

When dealing with paginated content, similar to how you would handle AJAX requests using Puppeteer, MCP servers can automate the extraction process:

async function extractPaginatedData(session) {
  const allData = [];
  let currentPage = 1;
  let hasNextPage = true;

  while (hasNextPage) {
    // Navigate to page
    await session.callTool({
      name: "puppeteer_navigate",
      arguments: { url: `https://example.com/data?page=${currentPage}` }
    });

    // Wait for content to load
    await session.callTool({
      name: "puppeteer_wait_for_selector",
      arguments: { selector: ".data-item" }
    });

    // Extract data from current page
    const pageData = await session.callTool({
      name: "puppeteer_evaluate",
      arguments: {
        expression: `
          Array.from(document.querySelectorAll('.data-item')).map(item => ({
            id: item.dataset.id,
            title: item.querySelector('h2').textContent,
            content: item.querySelector('p').textContent
          }))
        `
      }
    });

    allData.push(...JSON.parse(pageData.content[0].text));

    // Check if next page exists
    const nextButton = await session.callTool({
      name: "puppeteer_query_selector",
      arguments: { selector: ".next-page:not(.disabled)" }
    });

    hasNextPage = nextButton !== null;
    currentPage++;
  }

  return allData;
}

Error Handling and Retries

Implement robust error handling when working with MCP servers for data extraction, especially when handling timeouts:

import asyncio
from tenacity import retry, stop_after_attempt, wait_exponential

@retry(stop=stop_after_attempt(3), wait=wait_exponential(multiplier=1, min=4, max=10))
async def resilient_extraction(session, url):
    try:
        # Navigate with timeout
        await asyncio.wait_for(
            session.call_tool(
                "puppeteer_navigate",
                arguments={"url": url, "timeout": 30000}
            ),
            timeout=35
        )

        # Extract data with error handling
        try:
            data = await session.call_tool(
                "puppeteer_evaluate",
                arguments={"expression": "document.querySelector('.data').textContent"}
            )
            return data
        except Exception as e:
            print(f"Extraction error: {e}")
            return None

    except asyncio.TimeoutError:
        print(f"Timeout while loading {url}")
        raise
    except Exception as e:
        print(f"Navigation error: {e}")
        raise

Best Practices for MCP-Based Data Extraction

Use Resource Management: Properly initialize and close MCP sessions to prevent resource leaks
Implement Rate Limiting: Add delays between requests to avoid overwhelming target servers
Cache Responses: Store extracted data to minimize redundant requests
Validate Data: Always validate extracted data structure before processing
Monitor Performance: Track extraction speed and success rates
Handle Authentication: Securely manage credentials when accessing protected resources
Log Activities: Maintain detailed logs for debugging and compliance

Conclusion

MCP servers provide a standardized, powerful approach to data extraction across various sources. Whether you're using Puppeteer for browser automation, custom servers for API access, or filesystem servers for local data processing, the Model Context Protocol offers a consistent interface that simplifies complex extraction workflows. By combining multiple MCP servers and implementing proper error handling, developers can build robust, scalable data extraction pipelines that integrate seamlessly with AI-powered applications.

The examples provided in this guide demonstrate practical implementations you can adapt to your specific data extraction needs, from simple web scraping to complex multi-source data aggregation workflows.

Table of contents

What are MCP server examples for data extraction?

Popular MCP Server Examples for Data Extraction

1. Puppeteer MCP Server

2. Playwright MCP Server

3. Custom HTTP/REST API MCP Server

4. Filesystem MCP Server for Log Analysis

5. Database MCP Server for Structured Data

Advanced Data Extraction Patterns

Combining Multiple MCP Servers

Handling Pagination and Dynamic Content

Error Handling and Retries

Best Practices for MCP-Based Data Extraction

Conclusion

Try WebScraping.AI for Your Web Scraping Needs

Key Features:

Getting Started:

Related Questions

How do I configure MCP server setup for my project?

How do I use MCP server authentication in my scraper?

How do I troubleshoot MCP server connection issues?

Get Started Now

Support