How do I use the MCP API for data extraction?
The Model Context Protocol (MCP) API provides a powerful interface for extracting data from web pages through standardized tools and resources. Whether you're building automated scraping workflows, integrating AI-powered data extraction, or creating custom web automation tools, the MCP API offers a flexible and efficient approach to accessing web content.
Understanding MCP API Architecture
The MCP API operates on a client-server architecture where:
- MCP Server: Exposes tools and resources for web interaction (browser automation, scraping, etc.)
- MCP Client: Connects to servers and invokes their tools
- Tools: Executable functions like
browser_snapshot
,browser_click
, or custom scraping operations - Resources: Accessible data endpoints that can be read without execution
This architecture allows you to build modular, reusable data extraction pipelines that can leverage multiple specialized servers simultaneously.
Setting Up MCP API Access
Installing the MCP SDK
First, install the MCP SDK in your preferred language:
Python:
pip install mcp
JavaScript/Node.js:
npm install @modelcontextprotocol/sdk
Connecting to an MCP Server
To use the MCP API for data extraction, you need to connect to an MCP server that provides scraping capabilities.
Python Example:
from mcp import ClientSession, StdioServerParameters
from mcp.client.stdio import stdio_client
# Configure server parameters
server_params = StdioServerParameters(
command="npx",
args=["-y", "@modelcontextprotocol/server-playwright"]
)
async def connect_to_mcp():
async with stdio_client(server_params) as (read, write):
async with ClientSession(read, write) as session:
await session.initialize()
# List available tools
tools = await session.list_tools()
print(f"Available tools: {[tool.name for tool in tools.tools]}")
return session
JavaScript Example:
import { Client } from '@modelcontextprotocol/sdk/client/index.js';
import { StdioClientTransport } from '@modelcontextprotocol/sdk/client/stdio.js';
async function connectToMCP() {
const transport = new StdioClientTransport({
command: 'npx',
args: ['-y', '@modelcontextprotocol/server-playwright']
});
const client = new Client({
name: 'data-extraction-client',
version: '1.0.0'
}, {
capabilities: {}
});
await client.connect(transport);
// List available tools
const tools = await client.listTools();
console.log('Available tools:', tools.tools.map(t => t.name));
return client;
}
Extracting Data with MCP Tools
Using Browser Snapshot for Data Extraction
The browser_snapshot
tool captures the accessibility tree of a web page, providing a structured view of content that's ideal for extraction.
Python Example:
async def extract_page_data(session, url):
# Navigate to the target URL
await session.call_tool("browser_navigate", arguments={"url": url})
# Wait for page to load
await session.call_tool("browser_wait_for", arguments={"time": 2})
# Capture page snapshot
snapshot = await session.call_tool("browser_snapshot", arguments={})
# Parse snapshot data
page_content = snapshot.content[0].text
print(f"Page structure:\n{page_content}")
return page_content
JavaScript Example:
async function extractPageData(client, url) {
// Navigate to the target URL
await client.callTool({
name: 'browser_navigate',
arguments: { url }
});
// Wait for page to load
await client.callTool({
name: 'browser_wait_for',
arguments: { time: 2 }
});
// Capture page snapshot
const snapshot = await client.callTool({
name: 'browser_snapshot',
arguments: {}
});
const pageContent = snapshot.content[0].text;
console.log('Page structure:', pageContent);
return pageContent;
}
Extracting Specific Elements
To extract specific data elements, combine navigation with targeted interaction tools:
async def extract_product_details(session, product_url):
# Navigate to product page
await session.call_tool("browser_navigate", arguments={"url": product_url})
# Take snapshot to find elements
snapshot = await session.call_tool("browser_snapshot", arguments={})
# Extract price element (assuming we have the ref from snapshot)
price_element = await session.call_tool("browser_evaluate", arguments={
"element": "price display",
"function": "(element) => element.textContent"
})
# Extract product title
title_element = await session.call_tool("browser_evaluate", arguments={
"element": "product title",
"function": "(element) => element.textContent"
})
return {
"price": price_element.content[0].text,
"title": title_element.content[0].text
}
Handling Dynamic Content
For pages with AJAX-loaded content, similar to handling AJAX requests using Puppeteer, you can use MCP's wait functions:
async function extractDynamicContent(client, url) {
await client.callTool({
name: 'browser_navigate',
arguments: { url }
});
// Wait for specific text to appear
await client.callTool({
name: 'browser_wait_for',
arguments: { text: 'Loading complete' }
});
// Or wait for text to disappear
await client.callTool({
name: 'browser_wait_for',
arguments: { textGone: 'Loading...' }
});
const snapshot = await client.callTool({
name: 'browser_snapshot',
arguments: {}
});
return snapshot.content[0].text;
}
Reading MCP Resources for Data Access
Resources provide a way to access data without executing tools. They're ideal for cached or static data sources.
Python Example:
async def read_resource_data(session):
# List available resources
resources = await session.list_resources()
for resource in resources.resources:
print(f"Resource: {resource.name} - {resource.uri}")
# Read resource content
content = await session.read_resource(resource.uri)
print(f"Content: {content.contents[0].text}")
JavaScript Example:
async function readResourceData(client) {
// List available resources
const resources = await client.listResources();
for (const resource of resources.resources) {
console.log(`Resource: ${resource.name} - ${resource.uri}`);
// Read resource content
const content = await client.readResource({ uri: resource.uri });
console.log('Content:', content.contents[0].text);
}
}
Advanced Data Extraction Patterns
Pagination and Multi-Page Extraction
To extract data across multiple pages, similar to navigating to different pages using Puppeteer:
async def extract_paginated_data(session, base_url, max_pages=10):
all_data = []
for page_num in range(1, max_pages + 1):
url = f"{base_url}?page={page_num}"
await session.call_tool("browser_navigate", arguments={"url": url})
# Wait for content to load
await session.call_tool("browser_wait_for", arguments={"time": 1})
# Extract data from current page
snapshot = await session.call_tool("browser_snapshot", arguments={})
page_data = parse_snapshot(snapshot.content[0].text)
all_data.extend(page_data)
# Check if next page exists
has_next = await check_next_page(session)
if not has_next:
break
return all_data
def parse_snapshot(snapshot_text):
# Custom parsing logic based on your needs
# Extract structured data from snapshot text
return []
Form Filling and Data Submission
When you need to interact with forms to access data:
async function submitFormAndExtract(client, url) {
await client.callTool({
name: 'browser_navigate',
arguments: { url }
});
// Fill form fields
await client.callTool({
name: 'browser_fill_form',
arguments: {
fields: [
{
name: 'search query',
type: 'textbox',
ref: 'input[name="q"]',
value: 'data extraction'
},
{
name: 'category filter',
type: 'combobox',
ref: 'select[name="category"]',
value: 'Technology'
}
]
}
});
// Submit form by clicking button
await client.callTool({
name: 'browser_click',
arguments: {
element: 'submit button',
ref: 'button[type="submit"]'
}
});
// Wait for results
await client.callTool({
name: 'browser_wait_for',
arguments: { time: 3 }
});
// Extract results
const results = await client.callTool({
name: 'browser_snapshot',
arguments: {}
});
return results.content[0].text;
}
Error Handling and Retry Logic
Robust data extraction requires proper error handling, much like handling errors in Puppeteer:
import asyncio
from tenacity import retry, stop_after_attempt, wait_exponential
@retry(stop=stop_after_attempt(3), wait=wait_exponential(multiplier=1, min=4, max=10))
async def extract_with_retry(session, url):
try:
await session.call_tool("browser_navigate", arguments={"url": url})
# Set a timeout
snapshot = await asyncio.wait_for(
session.call_tool("browser_snapshot", arguments={}),
timeout=30.0
)
return snapshot.content[0].text
except asyncio.TimeoutError:
print(f"Timeout extracting {url}, retrying...")
raise
except Exception as e:
print(f"Error extracting {url}: {e}")
raise
async def safe_extract(session, url):
try:
return await extract_with_retry(session, url)
except Exception as e:
print(f"Failed after retries: {e}")
return None
Optimizing MCP API Performance
Connection Pooling
Reuse MCP connections for multiple extractions:
class MCPExtractor:
def __init__(self, server_params):
self.server_params = server_params
self.session = None
async def __aenter__(self):
self.client = stdio_client(self.server_params)
self.read, self.write = await self.client.__aenter__()
self.session = ClientSession(self.read, self.write)
await self.session.__aenter__()
await self.session.initialize()
return self
async def __aexit__(self, *args):
await self.session.__aexit__(*args)
await self.client.__aexit__(*args)
async def extract(self, url):
await self.session.call_tool("browser_navigate", arguments={"url": url})
snapshot = await self.session.call_tool("browser_snapshot", arguments={})
return snapshot.content[0].text
# Usage
async def batch_extract(urls):
server_params = StdioServerParameters(
command="npx",
args=["-y", "@modelcontextprotocol/server-playwright"]
)
async with MCPExtractor(server_params) as extractor:
results = []
for url in urls:
data = await extractor.extract(url)
results.append(data)
return results
Rate Limiting
Implement rate limiting to avoid overwhelming target servers:
import { setTimeout } from 'timers/promises';
async function extractWithRateLimit(client, urls, delayMs = 1000) {
const results = [];
for (const url of urls) {
await client.callTool({
name: 'browser_navigate',
arguments: { url }
});
const snapshot = await client.callTool({
name: 'browser_snapshot',
arguments: {}
});
results.push({
url,
data: snapshot.content[0].text
});
// Wait before next request
await setTimeout(delayMs);
}
return results;
}
Best Practices
- Always initialize sessions properly: Ensure MCP sessions are properly initialized before making tool calls
- Handle timeouts gracefully: Set appropriate timeouts for network-dependent operations
- Validate tool availability: Check that required tools are available before attempting to use them
- Parse snapshots efficiently: Use structured parsing for accessibility snapshots rather than regex on raw HTML
- Respect website terms of service: Implement rate limiting and user-agent identification
- Use resources for static data: Leverage MCP resources for cached or configuration data instead of repeated tool calls
- Close connections properly: Always clean up sessions and transport connections to avoid resource leaks
Conclusion
The MCP API provides a robust, standardized interface for web data extraction that combines browser automation capabilities with AI-powered tooling. By leveraging MCP servers, you can build scalable extraction pipelines that are maintainable, modular, and efficient. Whether you're extracting simple text content or complex interactive data, the MCP protocol offers the flexibility and power needed for modern web scraping workflows.
For production workloads requiring robust proxy support, CAPTCHA solving, and enterprise-grade reliability, consider using specialized scraping APIs like WebScraping.AI that handle these complexities for you while providing similar programmatic access to web data.