Table of contents

How do I use the MCP API for data extraction?

The Model Context Protocol (MCP) API provides a powerful interface for extracting data from web pages through standardized tools and resources. Whether you're building automated scraping workflows, integrating AI-powered data extraction, or creating custom web automation tools, the MCP API offers a flexible and efficient approach to accessing web content.

Understanding MCP API Architecture

The MCP API operates on a client-server architecture where:

  • MCP Server: Exposes tools and resources for web interaction (browser automation, scraping, etc.)
  • MCP Client: Connects to servers and invokes their tools
  • Tools: Executable functions like browser_snapshot, browser_click, or custom scraping operations
  • Resources: Accessible data endpoints that can be read without execution

This architecture allows you to build modular, reusable data extraction pipelines that can leverage multiple specialized servers simultaneously.

Setting Up MCP API Access

Installing the MCP SDK

First, install the MCP SDK in your preferred language:

Python:

pip install mcp

JavaScript/Node.js:

npm install @modelcontextprotocol/sdk

Connecting to an MCP Server

To use the MCP API for data extraction, you need to connect to an MCP server that provides scraping capabilities.

Python Example:

from mcp import ClientSession, StdioServerParameters
from mcp.client.stdio import stdio_client

# Configure server parameters
server_params = StdioServerParameters(
    command="npx",
    args=["-y", "@modelcontextprotocol/server-playwright"]
)

async def connect_to_mcp():
    async with stdio_client(server_params) as (read, write):
        async with ClientSession(read, write) as session:
            await session.initialize()

            # List available tools
            tools = await session.list_tools()
            print(f"Available tools: {[tool.name for tool in tools.tools]}")

            return session

JavaScript Example:

import { Client } from '@modelcontextprotocol/sdk/client/index.js';
import { StdioClientTransport } from '@modelcontextprotocol/sdk/client/stdio.js';

async function connectToMCP() {
  const transport = new StdioClientTransport({
    command: 'npx',
    args: ['-y', '@modelcontextprotocol/server-playwright']
  });

  const client = new Client({
    name: 'data-extraction-client',
    version: '1.0.0'
  }, {
    capabilities: {}
  });

  await client.connect(transport);

  // List available tools
  const tools = await client.listTools();
  console.log('Available tools:', tools.tools.map(t => t.name));

  return client;
}

Extracting Data with MCP Tools

Using Browser Snapshot for Data Extraction

The browser_snapshot tool captures the accessibility tree of a web page, providing a structured view of content that's ideal for extraction.

Python Example:

async def extract_page_data(session, url):
    # Navigate to the target URL
    await session.call_tool("browser_navigate", arguments={"url": url})

    # Wait for page to load
    await session.call_tool("browser_wait_for", arguments={"time": 2})

    # Capture page snapshot
    snapshot = await session.call_tool("browser_snapshot", arguments={})

    # Parse snapshot data
    page_content = snapshot.content[0].text
    print(f"Page structure:\n{page_content}")

    return page_content

JavaScript Example:

async function extractPageData(client, url) {
  // Navigate to the target URL
  await client.callTool({
    name: 'browser_navigate',
    arguments: { url }
  });

  // Wait for page to load
  await client.callTool({
    name: 'browser_wait_for',
    arguments: { time: 2 }
  });

  // Capture page snapshot
  const snapshot = await client.callTool({
    name: 'browser_snapshot',
    arguments: {}
  });

  const pageContent = snapshot.content[0].text;
  console.log('Page structure:', pageContent);

  return pageContent;
}

Extracting Specific Elements

To extract specific data elements, combine navigation with targeted interaction tools:

async def extract_product_details(session, product_url):
    # Navigate to product page
    await session.call_tool("browser_navigate", arguments={"url": product_url})

    # Take snapshot to find elements
    snapshot = await session.call_tool("browser_snapshot", arguments={})

    # Extract price element (assuming we have the ref from snapshot)
    price_element = await session.call_tool("browser_evaluate", arguments={
        "element": "price display",
        "function": "(element) => element.textContent"
    })

    # Extract product title
    title_element = await session.call_tool("browser_evaluate", arguments={
        "element": "product title",
        "function": "(element) => element.textContent"
    })

    return {
        "price": price_element.content[0].text,
        "title": title_element.content[0].text
    }

Handling Dynamic Content

For pages with AJAX-loaded content, similar to handling AJAX requests using Puppeteer, you can use MCP's wait functions:

async function extractDynamicContent(client, url) {
  await client.callTool({
    name: 'browser_navigate',
    arguments: { url }
  });

  // Wait for specific text to appear
  await client.callTool({
    name: 'browser_wait_for',
    arguments: { text: 'Loading complete' }
  });

  // Or wait for text to disappear
  await client.callTool({
    name: 'browser_wait_for',
    arguments: { textGone: 'Loading...' }
  });

  const snapshot = await client.callTool({
    name: 'browser_snapshot',
    arguments: {}
  });

  return snapshot.content[0].text;
}

Reading MCP Resources for Data Access

Resources provide a way to access data without executing tools. They're ideal for cached or static data sources.

Python Example:

async def read_resource_data(session):
    # List available resources
    resources = await session.list_resources()

    for resource in resources.resources:
        print(f"Resource: {resource.name} - {resource.uri}")

        # Read resource content
        content = await session.read_resource(resource.uri)
        print(f"Content: {content.contents[0].text}")

JavaScript Example:

async function readResourceData(client) {
  // List available resources
  const resources = await client.listResources();

  for (const resource of resources.resources) {
    console.log(`Resource: ${resource.name} - ${resource.uri}`);

    // Read resource content
    const content = await client.readResource({ uri: resource.uri });
    console.log('Content:', content.contents[0].text);
  }
}

Advanced Data Extraction Patterns

Pagination and Multi-Page Extraction

To extract data across multiple pages, similar to navigating to different pages using Puppeteer:

async def extract_paginated_data(session, base_url, max_pages=10):
    all_data = []

    for page_num in range(1, max_pages + 1):
        url = f"{base_url}?page={page_num}"
        await session.call_tool("browser_navigate", arguments={"url": url})

        # Wait for content to load
        await session.call_tool("browser_wait_for", arguments={"time": 1})

        # Extract data from current page
        snapshot = await session.call_tool("browser_snapshot", arguments={})
        page_data = parse_snapshot(snapshot.content[0].text)
        all_data.extend(page_data)

        # Check if next page exists
        has_next = await check_next_page(session)
        if not has_next:
            break

    return all_data

def parse_snapshot(snapshot_text):
    # Custom parsing logic based on your needs
    # Extract structured data from snapshot text
    return []

Form Filling and Data Submission

When you need to interact with forms to access data:

async function submitFormAndExtract(client, url) {
  await client.callTool({
    name: 'browser_navigate',
    arguments: { url }
  });

  // Fill form fields
  await client.callTool({
    name: 'browser_fill_form',
    arguments: {
      fields: [
        {
          name: 'search query',
          type: 'textbox',
          ref: 'input[name="q"]',
          value: 'data extraction'
        },
        {
          name: 'category filter',
          type: 'combobox',
          ref: 'select[name="category"]',
          value: 'Technology'
        }
      ]
    }
  });

  // Submit form by clicking button
  await client.callTool({
    name: 'browser_click',
    arguments: {
      element: 'submit button',
      ref: 'button[type="submit"]'
    }
  });

  // Wait for results
  await client.callTool({
    name: 'browser_wait_for',
    arguments: { time: 3 }
  });

  // Extract results
  const results = await client.callTool({
    name: 'browser_snapshot',
    arguments: {}
  });

  return results.content[0].text;
}

Error Handling and Retry Logic

Robust data extraction requires proper error handling, much like handling errors in Puppeteer:

import asyncio
from tenacity import retry, stop_after_attempt, wait_exponential

@retry(stop=stop_after_attempt(3), wait=wait_exponential(multiplier=1, min=4, max=10))
async def extract_with_retry(session, url):
    try:
        await session.call_tool("browser_navigate", arguments={"url": url})

        # Set a timeout
        snapshot = await asyncio.wait_for(
            session.call_tool("browser_snapshot", arguments={}),
            timeout=30.0
        )

        return snapshot.content[0].text

    except asyncio.TimeoutError:
        print(f"Timeout extracting {url}, retrying...")
        raise

    except Exception as e:
        print(f"Error extracting {url}: {e}")
        raise

async def safe_extract(session, url):
    try:
        return await extract_with_retry(session, url)
    except Exception as e:
        print(f"Failed after retries: {e}")
        return None

Optimizing MCP API Performance

Connection Pooling

Reuse MCP connections for multiple extractions:

class MCPExtractor:
    def __init__(self, server_params):
        self.server_params = server_params
        self.session = None

    async def __aenter__(self):
        self.client = stdio_client(self.server_params)
        self.read, self.write = await self.client.__aenter__()
        self.session = ClientSession(self.read, self.write)
        await self.session.__aenter__()
        await self.session.initialize()
        return self

    async def __aexit__(self, *args):
        await self.session.__aexit__(*args)
        await self.client.__aexit__(*args)

    async def extract(self, url):
        await self.session.call_tool("browser_navigate", arguments={"url": url})
        snapshot = await self.session.call_tool("browser_snapshot", arguments={})
        return snapshot.content[0].text

# Usage
async def batch_extract(urls):
    server_params = StdioServerParameters(
        command="npx",
        args=["-y", "@modelcontextprotocol/server-playwright"]
    )

    async with MCPExtractor(server_params) as extractor:
        results = []
        for url in urls:
            data = await extractor.extract(url)
            results.append(data)
        return results

Rate Limiting

Implement rate limiting to avoid overwhelming target servers:

import { setTimeout } from 'timers/promises';

async function extractWithRateLimit(client, urls, delayMs = 1000) {
  const results = [];

  for (const url of urls) {
    await client.callTool({
      name: 'browser_navigate',
      arguments: { url }
    });

    const snapshot = await client.callTool({
      name: 'browser_snapshot',
      arguments: {}
    });

    results.push({
      url,
      data: snapshot.content[0].text
    });

    // Wait before next request
    await setTimeout(delayMs);
  }

  return results;
}

Best Practices

  1. Always initialize sessions properly: Ensure MCP sessions are properly initialized before making tool calls
  2. Handle timeouts gracefully: Set appropriate timeouts for network-dependent operations
  3. Validate tool availability: Check that required tools are available before attempting to use them
  4. Parse snapshots efficiently: Use structured parsing for accessibility snapshots rather than regex on raw HTML
  5. Respect website terms of service: Implement rate limiting and user-agent identification
  6. Use resources for static data: Leverage MCP resources for cached or configuration data instead of repeated tool calls
  7. Close connections properly: Always clean up sessions and transport connections to avoid resource leaks

Conclusion

The MCP API provides a robust, standardized interface for web data extraction that combines browser automation capabilities with AI-powered tooling. By leveraging MCP servers, you can build scalable extraction pipelines that are maintainable, modular, and efficient. Whether you're extracting simple text content or complex interactive data, the MCP protocol offers the flexibility and power needed for modern web scraping workflows.

For production workloads requiring robust proxy support, CAPTCHA solving, and enterprise-grade reliability, consider using specialized scraping APIs like WebScraping.AI that handle these complexities for you while providing similar programmatic access to web data.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon