How do I Integrate Multiple MCP Servers in One Scraping Workflow?
Integrating multiple MCP (Model Context Protocol) servers in a single scraping workflow allows you to leverage specialized capabilities from different servers simultaneously. This approach enables complex data extraction scenarios where you might need browser automation, API integration, data processing, and storage operations working together seamlessly.
Understanding Multi-Server MCP Architecture
The Model Context Protocol enables clients to connect to multiple MCP servers concurrently. Each server can provide different tools, resources, and capabilities that complement each other in a scraping workflow. For example, you might use one MCP server for browser automation, another for data transformation, and a third for database operations.
Benefits of Multi-Server Integration
- Specialized functionality: Each server handles what it does best
- Modularity: Easier to maintain and update individual components
- Scalability: Distribute workload across multiple servers
- Flexibility: Mix and match capabilities based on project needs
- Resilience: If one server fails, others can continue operating
Setting Up Multiple MCP Servers
Configuration in Claude Desktop
To configure multiple MCP servers, edit your Claude Desktop configuration file (claude_desktop_config.json
):
{
"mcpServers": {
"playwright-server": {
"command": "npx",
"args": ["-y", "@executeautomation/playwright-mcp-server"]
},
"webscraping-server": {
"command": "npx",
"args": ["-y", "@modelcontextprotocol/server-webscraping"]
},
"database-server": {
"command": "npx",
"args": ["-y", "@modelcontextprotocol/server-postgres"],
"env": {
"POSTGRES_CONNECTION_STRING": "postgresql://user:pass@localhost/scraping_db"
}
},
"filesystem-server": {
"command": "npx",
"args": ["-y", "@modelcontextprotocol/server-filesystem"],
"env": {
"ALLOWED_DIRECTORIES": "/path/to/scraping/output"
}
}
}
}
Python-Based MCP Client Configuration
When building a custom MCP client in Python, you can connect to multiple servers programmatically:
import asyncio
from mcp import ClientSession, StdioServerParameters
from mcp.client.stdio import stdio_client
async def create_multi_server_client():
"""Initialize connections to multiple MCP servers"""
# Server configurations
servers = {
'playwright': StdioServerParameters(
command='npx',
args=['-y', '@executeautomation/playwright-mcp-server']
),
'webscraping': StdioServerParameters(
command='npx',
args=['-y', '@modelcontextprotocol/server-webscraping']
),
'database': StdioServerParameters(
command='npx',
args=['-y', '@modelcontextprotocol/server-postgres'],
env={'POSTGRES_CONNECTION_STRING': 'postgresql://user:pass@localhost/db'}
)
}
# Create sessions for each server
sessions = {}
for server_name, params in servers.items():
stdio_transport = await stdio_client(params)
read, write = stdio_transport
session = ClientSession(read, write)
await session.initialize()
sessions[server_name] = session
print(f"Connected to {server_name} MCP server")
return sessions
# Usage
async def main():
sessions = await create_multi_server_client()
# List available tools from all servers
for server_name, session in sessions.items():
tools = await session.list_tools()
print(f"\n{server_name} tools: {[tool.name for tool in tools]}")
asyncio.run(main())
Building a Multi-Server Scraping Workflow
Example: E-commerce Product Scraper
Here's a practical example that combines multiple MCP servers to scrape product data, process it, and store it in a database:
import asyncio
from mcp import ClientSession
async def scrape_ecommerce_products(sessions, url):
"""
Multi-server workflow for scraping e-commerce products
Flow:
1. Playwright server - Navigate and extract data
2. WebScraping server - Parse additional content
3. Database server - Store results
4. Filesystem server - Save images
"""
playwright_session = sessions['playwright']
webscraping_session = sessions['webscraping']
db_session = sessions['database']
fs_session = sessions['filesystem']
# Step 1: Navigate using Playwright for JavaScript-heavy pages
print("Navigating to product page...")
await playwright_session.call_tool(
'browser_navigate',
arguments={'url': url}
)
# Wait for content to load
await playwright_session.call_tool(
'browser_wait_for',
arguments={'time': 2}
)
# Take snapshot of the page
snapshot_result = await playwright_session.call_tool(
'browser_snapshot',
arguments={}
)
# Step 2: Extract product links from snapshot
product_links = []
# Parse snapshot to find product URLs
# (Actual parsing logic would go here)
# Step 3: For each product, use WebScraping server for detailed extraction
products = []
for product_url in product_links[:5]: # Limit to 5 for example
print(f"Scraping product: {product_url}")
# Use WebScraping.AI server to extract structured data
product_data = await webscraping_session.call_tool(
'webscraping_ai_fields',
arguments={
'url': product_url,
'fields': {
'title': 'Product title',
'price': 'Current price',
'description': 'Product description',
'image_url': 'Main product image URL',
'rating': 'Customer rating',
'availability': 'Stock status'
}
}
)
products.append(product_data)
# Step 4: Store in database
print("Storing products in database...")
for product in products:
await db_session.call_tool(
'execute_query',
arguments={
'query': '''
INSERT INTO products
(title, price, description, image_url, rating, availability, scraped_at)
VALUES ($1, $2, $3, $4, $5, $6, NOW())
ON CONFLICT (title) DO UPDATE
SET price = EXCLUDED.price,
availability = EXCLUDED.availability,
scraped_at = NOW()
''',
'params': [
product['title'],
product['price'],
product['description'],
product['image_url'],
product['rating'],
product['availability']
]
}
)
# Step 5: Download and save product images using filesystem server
print("Saving product images...")
for idx, product in enumerate(products):
if product.get('image_url'):
# Download image (you'd need an HTTP client here)
# Then save using filesystem server
await fs_session.call_tool(
'write_file',
arguments={
'path': f'/path/to/scraping/output/images/product_{idx}.jpg',
'content': '...' # Image binary data
}
)
return products
JavaScript/Node.js Implementation
For Node.js-based workflows, you can use the MCP SDK similarly:
import { Client } from '@modelcontextprotocol/sdk/client/index.js';
import { StdioClientTransport } from '@modelcontextprotocol/sdk/client/stdio.js';
async function createMultiServerWorkflow() {
// Initialize multiple server connections
const servers = {
playwright: new Client({
name: 'playwright-client',
version: '1.0.0'
}, {
capabilities: {}
}),
webscraping: new Client({
name: 'webscraping-client',
version: '1.0.0'
}, {
capabilities: {}
})
};
// Connect to Playwright server
const playwrightTransport = new StdioClientTransport({
command: 'npx',
args: ['-y', '@executeautomation/playwright-mcp-server']
});
await servers.playwright.connect(playwrightTransport);
// Connect to WebScraping server
const webscrapingTransport = new StdioClientTransport({
command: 'npx',
args: ['-y', '@modelcontextprotocol/server-webscraping']
});
await servers.webscraping.connect(webscrapingTransport);
return servers;
}
async function scrapeWithMultipleServers(url) {
const servers = await createMultiServerWorkflow();
try {
// Use Playwright for dynamic content
await servers.playwright.request({
method: 'tools/call',
params: {
name: 'browser_navigate',
arguments: { url }
}
});
// Get page snapshot
const snapshot = await servers.playwright.request({
method: 'tools/call',
params: {
name: 'browser_snapshot',
arguments: {}
}
});
// Extract specific fields using WebScraping.AI
const productData = await servers.webscraping.request({
method: 'tools/call',
params: {
name: 'webscraping_ai_fields',
arguments: {
url: url,
fields: {
title: 'Product title',
price: 'Product price',
description: 'Product description'
}
}
}
});
console.log('Scraped data:', productData);
return productData;
} finally {
// Clean up connections
await servers.playwright.close();
await servers.webscraping.close();
}
}
// Execute the workflow
scrapeWithMultipleServers('https://example.com/products/item-123');
Best Practices for Multi-Server Integration
1. Error Handling and Fallbacks
When working with multiple servers, implement robust error handling similar to how to handle browser events in Puppeteer:
async def call_tool_with_fallback(sessions, primary_server, fallback_server, tool_name, arguments):
"""Call a tool with automatic fallback to another server"""
try:
result = await sessions[primary_server].call_tool(tool_name, arguments)
return result
except Exception as e:
print(f"Primary server {primary_server} failed: {e}")
print(f"Attempting fallback to {fallback_server}...")
try:
result = await sessions[fallback_server].call_tool(tool_name, arguments)
return result
except Exception as fallback_error:
print(f"Fallback server also failed: {fallback_error}")
raise
2. Resource Management
Properly manage connections to avoid resource leaks:
async def workflow_with_cleanup(urls):
sessions = None
try:
sessions = await create_multi_server_client()
for url in urls:
await scrape_ecommerce_products(sessions, url)
except Exception as e:
print(f"Workflow error: {e}")
raise
finally:
# Clean up all sessions
if sessions:
for server_name, session in sessions.items():
try:
await session.cleanup()
print(f"Closed {server_name} session")
except Exception as e:
print(f"Error closing {server_name}: {e}")
3. Parallel Processing
Leverage multiple servers for concurrent operations when dealing with multiple pages in parallel:
async def parallel_multi_server_scraping(urls, sessions):
"""Process multiple URLs concurrently using multiple servers"""
async def scrape_single_url(url):
try:
# Each URL gets its own workflow
return await scrape_ecommerce_products(sessions, url)
except Exception as e:
print(f"Error scraping {url}: {e}")
return None
# Process up to 5 URLs concurrently
semaphore = asyncio.Semaphore(5)
async def bounded_scrape(url):
async with semaphore:
return await scrape_single_url(url)
results = await asyncio.gather(
*[bounded_scrape(url) for url in urls],
return_exceptions=True
)
return [r for r in results if r is not None]
4. Server Health Monitoring
Implement health checks for your MCP servers:
async def check_server_health(sessions):
"""Verify all MCP servers are responding"""
health_status = {}
for server_name, session in sessions.items():
try:
# Try listing tools as a health check
tools = await session.list_tools()
health_status[server_name] = {
'status': 'healthy',
'tools_count': len(tools)
}
except Exception as e:
health_status[server_name] = {
'status': 'unhealthy',
'error': str(e)
}
return health_status
Common Patterns and Use Cases
Pattern 1: Browser + API Hybrid Scraping
Combine browser automation for JavaScript-rendered content with API calls for structured data:
async def hybrid_scraping_workflow(sessions, base_url):
"""Use browser for navigation, API for data extraction"""
# Navigate with browser
await sessions['playwright'].call_tool(
'browser_navigate',
arguments={'url': base_url}
)
# Get current URL (might have redirected)
current_url = await sessions['playwright'].call_tool(
'browser_evaluate',
arguments={'function': '() => window.location.href'}
)
# Extract data via API for efficiency
data = await sessions['webscraping'].call_tool(
'webscraping_ai_fields',
arguments={
'url': current_url,
'fields': {
'title': 'Page title',
'content': 'Main content'
}
}
)
return data
Pattern 2: Data Pipeline with Validation
Create a multi-stage pipeline with validation at each step:
async def validated_scraping_pipeline(sessions, urls):
"""Multi-server pipeline with validation"""
results = []
for url in urls:
# Stage 1: Scrape
raw_data = await sessions['webscraping'].call_tool(
'webscraping_ai_question',
arguments={
'url': url,
'question': 'Extract all product information as JSON'
}
)
# Stage 2: Validate
if not validate_product_data(raw_data):
print(f"Invalid data from {url}, using browser fallback")
# Fallback to browser-based extraction
raw_data = await browser_based_extraction(sessions['playwright'], url)
# Stage 3: Transform
cleaned_data = transform_product_data(raw_data)
# Stage 4: Store
await sessions['database'].call_tool(
'execute_query',
arguments={
'query': 'INSERT INTO products (...) VALUES (...)',
'params': cleaned_data
}
)
results.append(cleaned_data)
return results
def validate_product_data(data):
"""Validate scraped data meets requirements"""
required_fields = ['title', 'price', 'description']
return all(field in data for field in required_fields)
Troubleshooting Multi-Server Setups
Connection Issues
If servers fail to connect:
- Verify each server is installed:
npx -y <server-package> --version
- Check server logs in Claude Desktop (Help → View Logs)
- Ensure environment variables are correctly set
- Test servers individually before combining them
Performance Optimization
- Connection pooling: Reuse sessions across multiple operations
- Caching: Cache responses from servers when appropriate
- Timeout configuration: Set appropriate timeouts for each server type
- Rate limiting: Implement rate limiting to avoid overwhelming servers
Conclusion
Integrating multiple MCP servers in a scraping workflow provides powerful capabilities for complex data extraction tasks. By combining specialized servers for browser automation, API access, data storage, and file operations, you can build robust, scalable scraping solutions that handle diverse requirements efficiently.
The key to success is proper architecture planning, robust error handling, and understanding each server's strengths. Start with simple two-server integrations and gradually expand as your requirements grow.
For more advanced techniques, explore how different tools work together, such as combining browser automation with specialized extraction methods to create comprehensive data collection systems.