How do I use MCP resources for structured data extraction?
MCP (Model Context Protocol) resources provide a powerful abstraction for accessing and extracting structured data from various sources within AI-powered applications. Resources in MCP represent data sources that can be read and processed, making them ideal for web scraping and data extraction workflows. This guide will show you how to effectively use MCP resources for structured data extraction.
Understanding MCP Resources
MCP resources are URI-addressable data sources that servers expose to clients. Unlike tools (which perform actions), resources are read-only entities that provide access to data. When used for web scraping, resources can represent:
- HTML content from web pages
- JSON data from APIs
- Structured datasets
- Database query results
- File system contents
Resources are particularly useful for structured data extraction because they provide a standardized way to access and process data across different sources and formats.
Setting Up MCP for Data Extraction
Installing Required Dependencies
Python Setup:
# Install the MCP SDK for Python
pip install mcp anthropic-sdk
# Install additional libraries for web scraping
pip install beautifulsoup4 lxml requests
JavaScript/TypeScript Setup:
# Install the MCP SDK for Node.js
npm install @modelcontextprotocol/sdk
# Install web scraping dependencies
npm install cheerio axios playwright
Connecting to MCP Resources
Python Implementation
Here's how to connect to an MCP server and access resources for data extraction:
import asyncio
from mcp import ClientSession, StdioServerParameters
from mcp.client.stdio import stdio_client
from anthropic import Anthropic
async def extract_data_from_resources():
# Configure server connection
server_params = StdioServerParameters(
command="mcp-server-playwright", # or your custom MCP server
args=[]
)
async with stdio_client(server_params) as (read, write):
async with ClientSession(read, write) as session:
# Initialize the session
await session.initialize()
# List available resources
resources = await session.list_resources()
print("Available resources:")
for resource in resources.resources:
print(f" - {resource.uri}: {resource.name}")
# Read a specific resource
resource_uri = "page://https://example.com/data"
content = await session.read_resource(resource_uri)
return content
# Run the async function
data = asyncio.run(extract_data_from_resources())
print(f"Extracted data: {data}")
JavaScript Implementation
import { Client } from '@modelcontextprotocol/sdk/client/index.js';
import { StdioClientTransport } from '@modelcontextprotocol/sdk/client/stdio.js';
async function extractDataFromResources() {
// Create transport layer
const transport = new StdioClientTransport({
command: 'mcp-server-playwright',
args: []
});
// Create and connect client
const client = new Client({
name: 'data-extraction-client',
version: '1.0.0'
}, {
capabilities: {
resources: {}
}
});
await client.connect(transport);
// List available resources
const resources = await client.listResources();
console.log('Available resources:');
resources.resources.forEach(resource => {
console.log(` - ${resource.uri}: ${resource.name}`);
});
// Read specific resource
const resourceUri = 'page://https://example.com/data';
const content = await client.readResource({ uri: resourceUri });
return content;
}
// Execute the extraction
extractDataFromResources()
.then(data => console.log('Extracted data:', data))
.catch(error => console.error('Error:', error));
Extracting Structured Data from Resources
Using Playwright MCP Server
The Playwright MCP server provides resources for accessing web page content. Here's how to extract structured data:
Python Example:
import asyncio
import json
from mcp import ClientSession, StdioServerParameters
from mcp.client.stdio import stdio_client
from bs4 import BeautifulSoup
async def scrape_product_data(url):
server_params = StdioServerParameters(
command="npx",
args=["-y", "@modelcontextprotocol/server-playwright"]
)
async with stdio_client(server_params) as (read, write):
async with ClientSession(read, write) as session:
await session.initialize()
# Navigate to the page using a tool
await session.call_tool("browser_navigate", {"url": url})
# Wait for content to load
await session.call_tool("browser_wait_for", {"time": 2})
# Get page snapshot as a resource
snapshot = await session.call_tool("browser_snapshot", {})
# Parse the HTML content
soup = BeautifulSoup(snapshot.content[0].text, 'html.parser')
# Extract structured data
products = []
for item in soup.select('.product-item'):
product = {
'name': item.select_one('.product-name').text.strip(),
'price': item.select_one('.product-price').text.strip(),
'rating': item.select_one('.product-rating').get('data-rating'),
'availability': item.select_one('.availability').text.strip()
}
products.append(product)
return products
# Extract data
products = asyncio.run(scrape_product_data("https://example-shop.com/products"))
print(json.dumps(products, indent=2))
JavaScript Example:
import { Client } from '@modelcontextprotocol/sdk/client/index.js';
import { StdioClientTransport } from '@modelcontextprotocol/sdk/client/stdio.js';
import * as cheerio from 'cheerio';
async function scrapeProductData(url) {
const transport = new StdioClientTransport({
command: 'npx',
args: ['-y', '@modelcontextprotocol/server-playwright']
});
const client = new Client({
name: 'product-scraper',
version: '1.0.0'
}, {
capabilities: { tools: {}, resources: {} }
});
await client.connect(transport);
// Navigate to the page
await client.callTool({
name: 'browser_navigate',
arguments: { url }
});
// Wait for dynamic content
await client.callTool({
name: 'browser_wait_for',
arguments: { time: 2 }
});
// Get page snapshot
const snapshot = await client.callTool({
name: 'browser_snapshot',
arguments: {}
});
// Parse HTML content
const $ = cheerio.load(snapshot.content[0].text);
// Extract structured data
const products = [];
$('.product-item').each((i, elem) => {
const product = {
name: $(elem).find('.product-name').text().trim(),
price: $(elem).find('.product-price').text().trim(),
rating: $(elem).find('.product-rating').attr('data-rating'),
availability: $(elem).find('.availability').text().trim()
};
products.push(product);
});
await client.close();
return products;
}
// Execute scraping
scrapeProductData('https://example-shop.com/products')
.then(products => console.log(JSON.stringify(products, null, 2)))
.catch(error => console.error('Error:', error));
Advanced Data Extraction Patterns
Handling Dynamic Content
When working with JavaScript-heavy websites, similar to how to handle AJAX requests using Puppeteer, you need to wait for dynamic content to load:
async def extract_dynamic_data(url, wait_selector):
async with stdio_client(server_params) as (read, write):
async with ClientSession(read, write) as session:
await session.initialize()
# Navigate to page
await session.call_tool("browser_navigate", {"url": url})
# Wait for specific element
await session.call_tool("browser_wait_for", {
"text": wait_selector
})
# Extract data after content loads
snapshot = await session.call_tool("browser_snapshot", {})
# Process the content
return parse_content(snapshot.content[0].text)
Paginated Data Extraction
Extract data across multiple pages:
async def extract_paginated_data(base_url, max_pages=10):
all_data = []
async with stdio_client(server_params) as (read, write):
async with ClientSession(read, write) as session:
await session.initialize()
for page_num in range(1, max_pages + 1):
url = f"{base_url}?page={page_num}"
# Navigate to page
await session.call_tool("browser_navigate", {"url": url})
await session.call_tool("browser_wait_for", {"time": 1})
# Get snapshot
snapshot = await session.call_tool("browser_snapshot", {})
# Extract data from current page
page_data = extract_items(snapshot.content[0].text)
all_data.extend(page_data)
# Check if there's a next page
soup = BeautifulSoup(snapshot.content[0].text, 'html.parser')
if not soup.select_one('.next-page'):
break
return all_data
def extract_items(html):
soup = BeautifulSoup(html, 'html.parser')
items = []
for item in soup.select('.data-item'):
items.append({
'title': item.select_one('.title').text.strip(),
'description': item.select_one('.description').text.strip(),
'metadata': item.get('data-metadata')
})
return items
Form Interaction and Data Extraction
Similar to how to handle authentication in Puppeteer, you can fill forms and extract resulting data:
async function extractSearchResults(searchQuery) {
const transport = new StdioClientTransport({
command: 'npx',
args: ['-y', '@modelcontextprotocol/server-playwright']
});
const client = new Client({
name: 'search-scraper',
version: '1.0.0'
}, {
capabilities: { tools: {}, resources: {} }
});
await client.connect(transport);
// Navigate to search page
await client.callTool({
name: 'browser_navigate',
arguments: { url: 'https://example.com/search' }
});
// Take snapshot to get form element references
const initialSnapshot = await client.callTool({
name: 'browser_snapshot',
arguments: {}
});
// Fill search form
await client.callTool({
name: 'browser_fill_form',
arguments: {
fields: [
{
name: 'search input',
type: 'textbox',
ref: 'search-input-ref', // From snapshot
value: searchQuery
}
]
}
});
// Submit and wait
await client.callTool({
name: 'browser_press_key',
arguments: { key: 'Enter' }
});
await client.callTool({
name: 'browser_wait_for',
arguments: { time: 2 }
});
// Extract results
const resultsSnapshot = await client.callTool({
name: 'browser_snapshot',
arguments: {}
});
const $ = cheerio.load(resultsSnapshot.content[0].text);
const results = [];
$('.search-result').each((i, elem) => {
results.push({
title: $(elem).find('.result-title').text().trim(),
url: $(elem).find('.result-link').attr('href'),
snippet: $(elem).find('.result-snippet').text().trim()
});
});
await client.close();
return results;
}
Error Handling and Best Practices
Robust Error Handling
import logging
from typing import Optional, List, Dict
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
async def safe_extract_data(url: str, max_retries: int = 3) -> Optional[List[Dict]]:
"""Extract data with error handling and retries"""
for attempt in range(max_retries):
try:
server_params = StdioServerParameters(
command="npx",
args=["-y", "@modelcontextprotocol/server-playwright"]
)
async with stdio_client(server_params) as (read, write):
async with ClientSession(read, write) as session:
await session.initialize()
# Navigate with timeout
await asyncio.wait_for(
session.call_tool("browser_navigate", {"url": url}),
timeout=30.0
)
# Wait for content
await session.call_tool("browser_wait_for", {"time": 2})
# Get snapshot
snapshot = await session.call_tool("browser_snapshot", {})
# Extract and validate data
data = extract_and_validate(snapshot.content[0].text)
if data:
logger.info(f"Successfully extracted {len(data)} items")
return data
else:
logger.warning(f"No data found on attempt {attempt + 1}")
except asyncio.TimeoutError:
logger.error(f"Timeout on attempt {attempt + 1}")
except Exception as e:
logger.error(f"Error on attempt {attempt + 1}: {str(e)}")
if attempt < max_retries - 1:
await asyncio.sleep(2 ** attempt) # Exponential backoff
logger.error(f"Failed to extract data after {max_retries} attempts")
return None
def extract_and_validate(html: str) -> List[Dict]:
"""Extract data and validate structure"""
soup = BeautifulSoup(html, 'html.parser')
items = []
for elem in soup.select('.data-item'):
try:
item = {
'id': elem.get('data-id'),
'title': elem.select_one('.title').text.strip(),
'value': elem.select_one('.value').text.strip()
}
# Validate required fields
if all(item.values()):
items.append(item)
else:
logger.warning(f"Skipping incomplete item: {item}")
except AttributeError as e:
logger.warning(f"Error parsing item: {e}")
continue
return items
Best Practices for MCP Resource Extraction
- Always validate resource URIs before attempting to read them
- Implement proper error handling with retries and exponential backoff
- Use appropriate wait strategies when dealing with dynamic content, similar to how to handle timeouts in Puppeteer
- Clean and validate extracted data before processing
- Close connections properly to avoid resource leaks
- Implement rate limiting to respect server resources
- Cache resources when appropriate to minimize redundant requests
- Log extraction activities for debugging and monitoring
Data Transformation and Output
Converting to Structured Formats
import json
import csv
from datetime import datetime
async def extract_and_export(url: str, output_format: str = 'json'):
"""Extract data and export in various formats"""
data = await safe_extract_data(url)
if not data:
raise ValueError("No data extracted")
timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
if output_format == 'json':
filename = f'extracted_data_{timestamp}.json'
with open(filename, 'w') as f:
json.dump(data, f, indent=2)
elif output_format == 'csv':
filename = f'extracted_data_{timestamp}.csv'
with open(filename, 'w', newline='') as f:
if data:
writer = csv.DictWriter(f, fieldnames=data[0].keys())
writer.writeheader()
writer.writerows(data)
elif output_format == 'jsonl':
filename = f'extracted_data_{timestamp}.jsonl'
with open(filename, 'w') as f:
for item in data:
f.write(json.dumps(item) + '\n')
print(f"Data exported to {filename}")
return filename
Conclusion
MCP resources provide a powerful and standardized approach to structured data extraction. By leveraging MCP servers like Playwright, you can build robust web scraping workflows that handle complex scenarios including dynamic content, authentication, and pagination. The key to success is implementing proper error handling, using appropriate wait strategies, and validating extracted data.
For production deployments, consider combining MCP resources with specialized web scraping APIs that handle proxy rotation, CAPTCHA solving, and rate limiting automatically. This hybrid approach gives you the flexibility of MCP's standardized interface while benefiting from managed infrastructure for reliable, large-scale data extraction.