Table of contents

What are MCP server examples for data extraction?

Model Context Protocol (MCP) servers provide powerful tools for data extraction and web scraping. These servers act as intermediaries between AI models and various data sources, offering standardized interfaces for browser automation, API interactions, and custom data extraction workflows. This guide explores practical MCP server examples that developers can use to build robust data extraction pipelines.

Popular MCP Server Examples for Data Extraction

1. Puppeteer MCP Server

The Puppeteer MCP server is one of the most widely used implementations for browser automation and web scraping. It provides a comprehensive set of tools for interacting with web pages through the Chrome DevTools Protocol.

Installation:

npm install @modelcontextprotocol/server-puppeteer

Basic Configuration (claude_desktop_config.json):

{
  "mcpServers": {
    "puppeteer": {
      "command": "node",
      "args": [
        "/path/to/node_modules/@modelcontextprotocol/server-puppeteer/dist/index.js"
      ]
    }
  }
}

Example Usage in Python:

from mcp import ClientSession, StdioServerParameters
from mcp.client.stdio import stdio_client

# Connect to Puppeteer MCP server
server_params = StdioServerParameters(
    command="node",
    args=["/path/to/puppeteer-mcp-server/index.js"]
)

async with stdio_client(server_params) as (read, write):
    async with ClientSession(read, write) as session:
        await session.initialize()

        # Navigate to a page
        result = await session.call_tool(
            "puppeteer_navigate",
            arguments={"url": "https://example.com"}
        )

        # Extract data from the page
        content = await session.call_tool(
            "puppeteer_screenshot",
            arguments={"fullPage": True}
        )

JavaScript Example:

import { Client } from "@modelcontextprotocol/sdk/client/index.js";
import { StdioClientTransport } from "@modelcontextprotocol/sdk/client/stdio.js";

const transport = new StdioClientTransport({
  command: "node",
  args: ["./puppeteer-mcp-server/index.js"]
});

const client = new Client({
  name: "data-extractor",
  version: "1.0.0"
}, {
  capabilities: {}
});

await client.connect(transport);

// Navigate and extract data
const result = await client.callTool({
  name: "puppeteer_navigate",
  arguments: { url: "https://example.com/products" }
});

// Click elements and scrape content
await client.callTool({
  name: "puppeteer_click",
  arguments: { selector: ".load-more-button" }
});

const data = await client.callTool({
  name: "puppeteer_evaluate",
  arguments: {
    expression: `
      Array.from(document.querySelectorAll('.product')).map(el => ({
        title: el.querySelector('.title').textContent,
        price: el.querySelector('.price').textContent,
        image: el.querySelector('img').src
      }))
    `
  }
});

2. Playwright MCP Server

The Playwright MCP server offers cross-browser support and is ideal for data extraction from complex web applications. When you need to handle browser sessions across different browsers, Playwright provides excellent compatibility.

Installation:

npm install @modelcontextprotocol/server-playwright
npx playwright install chromium

Configuration Example:

{
  "mcpServers": {
    "playwright": {
      "command": "node",
      "args": ["./node_modules/@modelcontextprotocol/server-playwright/dist/index.js"],
      "env": {
        "BROWSER": "chromium"
      }
    }
  }
}

Python Implementation:

import asyncio
from mcp import ClientSession, StdioServerParameters
from mcp.client.stdio import stdio_client

async def extract_data_with_playwright():
    server_params = StdioServerParameters(
        command="node",
        args=["./playwright-mcp-server/index.js"],
        env={"BROWSER": "chromium"}
    )

    async with stdio_client(server_params) as (read, write):
        async with ClientSession(read, write) as session:
            await session.initialize()

            # Navigate to target page
            await session.call_tool(
                "playwright_navigate",
                arguments={"url": "https://api-docs-site.com"}
            )

            # Wait for dynamic content
            await session.call_tool(
                "playwright_wait_for_selector",
                arguments={"selector": ".api-endpoint", "timeout": 5000}
            )

            # Extract structured data
            api_endpoints = await session.call_tool(
                "playwright_evaluate",
                arguments={
                    "expression": """
                        Array.from(document.querySelectorAll('.api-endpoint')).map(endpoint => ({
                            method: endpoint.querySelector('.method').textContent,
                            path: endpoint.querySelector('.path').textContent,
                            description: endpoint.querySelector('.description').textContent
                        }))
                    """
                }
            )

            return api_endpoints

# Run the extraction
data = asyncio.run(extract_data_with_playwright())
print(data)

3. Custom HTTP/REST API MCP Server

For extracting data from REST APIs, a custom MCP server can provide structured access to external data sources.

Server Implementation (Node.js):

import { Server } from "@modelcontextprotocol/sdk/server/index.js";
import { StdioServerTransport } from "@modelcontextprotocol/sdk/server/stdio.js";
import axios from "axios";

const server = new Server(
  {
    name: "api-extractor",
    version: "1.0.0"
  },
  {
    capabilities: {
      tools: {}
    }
  }
);

// Define tools for data extraction
server.setRequestHandler("tools/list", async () => ({
  tools: [
    {
      name: "fetch_api_data",
      description: "Fetch data from a REST API endpoint",
      inputSchema: {
        type: "object",
        properties: {
          url: { type: "string" },
          method: { type: "string", enum: ["GET", "POST"] },
          headers: { type: "object" },
          body: { type: "object" }
        },
        required: ["url"]
      }
    },
    {
      name: "extract_json_field",
      description: "Extract specific fields from JSON data",
      inputSchema: {
        type: "object",
        properties: {
          data: { type: "object" },
          path: { type: "string" }
        },
        required: ["data", "path"]
      }
    }
  ]
}));

server.setRequestHandler("tools/call", async (request) => {
  if (request.params.name === "fetch_api_data") {
    const { url, method = "GET", headers = {}, body } = request.params.arguments;

    try {
      const response = await axios({
        method,
        url,
        headers,
        data: body
      });

      return {
        content: [{
          type: "text",
          text: JSON.stringify(response.data, null, 2)
        }]
      };
    } catch (error) {
      return {
        content: [{
          type: "text",
          text: `Error: ${error.message}`
        }],
        isError: true
      };
    }
  }

  if (request.params.name === "extract_json_field") {
    const { data, path } = request.params.arguments;
    const value = path.split('.').reduce((obj, key) => obj?.[key], data);

    return {
      content: [{
        type: "text",
        text: JSON.stringify(value, null, 2)
      }]
    };
  }
});

const transport = new StdioServerTransport();
await server.connect(transport);

Client Usage:

async def extract_from_api():
    server_params = StdioServerParameters(
        command="node",
        args=["./api-extractor-server.js"]
    )

    async with stdio_client(server_params) as (read, write):
        async with ClientSession(read, write) as session:
            await session.initialize()

            # Fetch data from API
            result = await session.call_tool(
                "fetch_api_data",
                arguments={
                    "url": "https://api.github.com/repos/microsoft/playwright/issues",
                    "headers": {"Accept": "application/vnd.github.v3+json"}
                }
            )

            # Extract specific fields
            titles = await session.call_tool(
                "extract_json_field",
                arguments={
                    "data": result,
                    "path": "*.title"
                }
            )

            return titles

4. Filesystem MCP Server for Log Analysis

When extracting data from log files or local datasets, a filesystem MCP server provides efficient file access.

Installation:

npm install @modelcontextprotocol/server-filesystem

Configuration:

{
  "mcpServers": {
    "filesystem": {
      "command": "npx",
      "args": [
        "-y",
        "@modelcontextprotocol/server-filesystem",
        "/path/to/data/directory"
      ]
    }
  }
}

Python Example for Log Extraction:

async def extract_log_data():
    server_params = StdioServerParameters(
        command="npx",
        args=["-y", "@modelcontextprotocol/server-filesystem", "/var/logs"]
    )

    async with stdio_client(server_params) as (read, write):
        async with ClientSession(read, write) as session:
            await session.initialize()

            # Read log file
            log_content = await session.call_tool(
                "read_file",
                arguments={"path": "application.log"}
            )

            # Parse and extract error entries
            errors = []
            for line in log_content.split('\n'):
                if 'ERROR' in line:
                    errors.append(line)

            return errors

5. Database MCP Server for Structured Data

For extracting data from databases, a custom MCP server can provide query capabilities.

PostgreSQL MCP Server Example:

import { Server } from "@modelcontextprotocol/sdk/server/index.js";
import { StdioServerTransport } from "@modelcontextprotocol/sdk/server/stdio.js";
import pg from "pg";

const pool = new pg.Pool({
  host: process.env.DB_HOST,
  database: process.env.DB_NAME,
  user: process.env.DB_USER,
  password: process.env.DB_PASSWORD
});

const server = new Server(
  { name: "postgres-extractor", version: "1.0.0" },
  { capabilities: { tools: {} } }
);

server.setRequestHandler("tools/list", async () => ({
  tools: [{
    name: "query_database",
    description: "Execute SQL query and extract data",
    inputSchema: {
      type: "object",
      properties: {
        query: { type: "string" },
        params: { type: "array" }
      },
      required: ["query"]
    }
  }]
}));

server.setRequestHandler("tools/call", async (request) => {
  if (request.params.name === "query_database") {
    const { query, params = [] } = request.params.arguments;

    try {
      const result = await pool.query(query, params);
      return {
        content: [{
          type: "text",
          text: JSON.stringify(result.rows, null, 2)
        }]
      };
    } catch (error) {
      return {
        content: [{ type: "text", text: `Error: ${error.message}` }],
        isError: true
      };
    }
  }
});

const transport = new StdioServerTransport();
await server.connect(transport);

Advanced Data Extraction Patterns

Combining Multiple MCP Servers

You can use multiple MCP servers together for complex extraction workflows:

import asyncio
from mcp import ClientSession, StdioServerParameters
from mcp.client.stdio import stdio_client

async def multi_source_extraction():
    # Connect to Puppeteer server for web scraping
    puppeteer_params = StdioServerParameters(
        command="node",
        args=["./puppeteer-mcp-server/index.js"]
    )

    # Connect to filesystem server for data storage
    fs_params = StdioServerParameters(
        command="npx",
        args=["-y", "@modelcontextprotocol/server-filesystem", "./data"]
    )

    async with stdio_client(puppeteer_params) as (p_read, p_write), \
               stdio_client(fs_params) as (f_read, f_write):

        async with ClientSession(p_read, p_write) as puppeteer_session, \
                   ClientSession(f_read, f_write) as fs_session:

            await puppeteer_session.initialize()
            await fs_session.initialize()

            # Scrape data with Puppeteer
            await puppeteer_session.call_tool(
                "puppeteer_navigate",
                arguments={"url": "https://data-source.com"}
            )

            scraped_data = await puppeteer_session.call_tool(
                "puppeteer_evaluate",
                arguments={"expression": "document.body.innerText"}
            )

            # Save to filesystem
            await fs_session.call_tool(
                "write_file",
                arguments={
                    "path": "scraped_data.txt",
                    "content": scraped_data
                }
            )

Handling Pagination and Dynamic Content

When dealing with paginated content, similar to how you would handle AJAX requests using Puppeteer, MCP servers can automate the extraction process:

async function extractPaginatedData(session) {
  const allData = [];
  let currentPage = 1;
  let hasNextPage = true;

  while (hasNextPage) {
    // Navigate to page
    await session.callTool({
      name: "puppeteer_navigate",
      arguments: { url: `https://example.com/data?page=${currentPage}` }
    });

    // Wait for content to load
    await session.callTool({
      name: "puppeteer_wait_for_selector",
      arguments: { selector: ".data-item" }
    });

    // Extract data from current page
    const pageData = await session.callTool({
      name: "puppeteer_evaluate",
      arguments: {
        expression: `
          Array.from(document.querySelectorAll('.data-item')).map(item => ({
            id: item.dataset.id,
            title: item.querySelector('h2').textContent,
            content: item.querySelector('p').textContent
          }))
        `
      }
    });

    allData.push(...JSON.parse(pageData.content[0].text));

    // Check if next page exists
    const nextButton = await session.callTool({
      name: "puppeteer_query_selector",
      arguments: { selector: ".next-page:not(.disabled)" }
    });

    hasNextPage = nextButton !== null;
    currentPage++;
  }

  return allData;
}

Error Handling and Retries

Implement robust error handling when working with MCP servers for data extraction, especially when handling timeouts:

import asyncio
from tenacity import retry, stop_after_attempt, wait_exponential

@retry(stop=stop_after_attempt(3), wait=wait_exponential(multiplier=1, min=4, max=10))
async def resilient_extraction(session, url):
    try:
        # Navigate with timeout
        await asyncio.wait_for(
            session.call_tool(
                "puppeteer_navigate",
                arguments={"url": url, "timeout": 30000}
            ),
            timeout=35
        )

        # Extract data with error handling
        try:
            data = await session.call_tool(
                "puppeteer_evaluate",
                arguments={"expression": "document.querySelector('.data').textContent"}
            )
            return data
        except Exception as e:
            print(f"Extraction error: {e}")
            return None

    except asyncio.TimeoutError:
        print(f"Timeout while loading {url}")
        raise
    except Exception as e:
        print(f"Navigation error: {e}")
        raise

Best Practices for MCP-Based Data Extraction

  1. Use Resource Management: Properly initialize and close MCP sessions to prevent resource leaks
  2. Implement Rate Limiting: Add delays between requests to avoid overwhelming target servers
  3. Cache Responses: Store extracted data to minimize redundant requests
  4. Validate Data: Always validate extracted data structure before processing
  5. Monitor Performance: Track extraction speed and success rates
  6. Handle Authentication: Securely manage credentials when accessing protected resources
  7. Log Activities: Maintain detailed logs for debugging and compliance

Conclusion

MCP servers provide a standardized, powerful approach to data extraction across various sources. Whether you're using Puppeteer for browser automation, custom servers for API access, or filesystem servers for local data processing, the Model Context Protocol offers a consistent interface that simplifies complex extraction workflows. By combining multiple MCP servers and implementing proper error handling, developers can build robust, scalable data extraction pipelines that integrate seamlessly with AI-powered applications.

The examples provided in this guide demonstrate practical implementations you can adapt to your specific data extraction needs, from simple web scraping to complex multi-source data aggregation workflows.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon