What are the best MCP server tools for JSON extraction?
Model Context Protocol (MCP) servers provide powerful tools for extracting JSON data from web pages, APIs, and dynamic content. This guide explores the most effective MCP server tools and techniques for JSON extraction in web scraping workflows.
Understanding MCP Tools for JSON Extraction
MCP servers expose various tools that can be leveraged for JSON extraction. These tools range from browser automation capabilities to specialized data extraction functions. The key to successful JSON extraction lies in selecting the right tool for your specific use case.
Primary MCP Server Tools for JSON Data
1. Browser Automation Tools (Playwright/Puppeteer MCP)
The Playwright and Puppeteer MCP servers offer the most comprehensive JSON extraction capabilities through browser automation. These tools can:
- Execute JavaScript to access JSON data in dynamic applications
- Intercept network requests to capture API responses
- Evaluate scripts to extract embedded JSON from pages
2. WebScraping.AI MCP Integration
The WebScraping.AI MCP server provides AI-powered extraction capabilities specifically designed for structured data:
- Field extraction with natural language instructions
- Question-based data retrieval
- Automatic JSON parsing and validation
3. HTTP Client MCP Tools
Standard HTTP client tools in MCP servers enable direct API interaction:
- Direct JSON API consumption
- Response parsing and transformation
- Header and authentication management
Extracting JSON with Playwright MCP
The Playwright MCP server provides the most robust solution for extracting JSON from modern web applications. Here's how to use it effectively:
Capturing Network JSON Responses
// Using Playwright MCP to intercept JSON API calls
const { browser_network_requests } = mcpPlaywright;
// Navigate to the page
await browser_navigate({ url: 'https://example.com/data-page' });
// Wait for content to load
await browser_wait_for({ time: 3 });
// Get all network requests
const requests = await browser_network_requests({});
// Filter for JSON API responses
const jsonRequests = requests.filter(req =>
req.url.includes('/api/') &&
req.response?.headers['content-type']?.includes('application/json')
);
// Parse JSON from responses
jsonRequests.forEach(req => {
const jsonData = JSON.parse(req.response.body);
console.log('Extracted JSON:', jsonData);
});
Extracting Embedded JSON Data
Many websites embed JSON data directly in <script>
tags for client-side rendering. The Playwright MCP server excels at extracting this data:
// Extract JSON from script tags
const jsonData = await browser_evaluate({
element: 'script tag containing JSON data',
function: `() => {
const scriptTags = document.querySelectorAll('script[type="application/json"]');
const dataArray = [];
scriptTags.forEach(script => {
try {
const jsonContent = JSON.parse(script.textContent);
dataArray.push(jsonContent);
} catch (e) {
console.error('Failed to parse JSON:', e);
}
});
return dataArray;
}`
});
Extracting JSON from Single Page Applications
When working with single page applications, you often need to wait for AJAX requests to complete before extracting JSON:
// Navigate and wait for specific data to load
await browser_navigate({ url: 'https://spa-example.com' });
// Wait for the API call to complete
await browser_wait_for({
text: 'Data loaded' // Wait for UI indicator
});
// Extract the data from window object
const appData = await browser_evaluate({
element: 'page data object',
function: `() => {
// Many SPAs store data in window object
return window.__INITIAL_STATE__ ||
window.__PRELOADED_STATE__ ||
window.APP_DATA;
}`
});
console.log('SPA JSON Data:', JSON.stringify(appData, null, 2));
Using WebScraping.AI MCP for JSON Extraction
The WebScraping.AI MCP server provides specialized tools for AI-powered JSON extraction:
Field-Based JSON Extraction
# Using WebScraping.AI MCP for structured JSON extraction
from mcp import webscraping_ai
# Define the fields you want to extract
fields = {
"title": "The main title of the product",
"price": "The product price in USD",
"availability": "Whether the product is in stock",
"rating": "The average customer rating",
"reviews_count": "Total number of reviews"
}
# Extract structured JSON data
result = await webscraping_ai.webscraping_ai_fields(
url="https://example.com/product/12345",
fields=fields,
js=True # Enable JavaScript rendering
)
# The response is already in JSON format
product_data = result['data']
print(json.dumps(product_data, indent=2))
Question-Based JSON Extraction
# Extract specific data using natural language questions
response = await webscraping_ai.webscraping_ai_question(
url="https://example.com/pricing",
question="What are all the pricing tiers and their features? Return as JSON."
)
# The AI will structure the response as JSON
pricing_json = json.loads(response['answer'])
Python Implementation with MCP Servers
Here's a comprehensive Python example combining multiple MCP tools for JSON extraction:
import json
import asyncio
from typing import Dict, List, Any
class MCPJSONExtractor:
def __init__(self, mcp_client):
self.client = mcp_client
async def extract_from_api(self, url: str) -> Dict[str, Any]:
"""Extract JSON directly from API endpoints"""
response = await self.client.call_tool(
'webscraping_ai_html',
url=url,
format='json'
)
return json.loads(response['html'])
async def extract_from_page(self, url: str, selector: str) -> List[Dict]:
"""Extract JSON from page elements"""
# Get HTML content
html_response = await self.client.call_tool(
'webscraping_ai_selected',
url=url,
selector=selector
)
# Parse JSON from selected elements
elements = html_response['selected']
json_data = []
for element in elements:
try:
data = json.loads(element['text'])
json_data.append(data)
except json.JSONDecodeError:
continue
return json_data
async def extract_with_browser(self, url: str) -> Dict[str, Any]:
"""Extract JSON using browser automation"""
# Navigate to page
await self.client.call_tool(
'browser_navigate',
url=url
)
# Execute JavaScript to extract JSON
result = await self.client.call_tool(
'browser_evaluate',
element='window data',
function="""() => {
// Common patterns for JSON data
const patterns = [
() => window.__NEXT_DATA__,
() => window.__INITIAL_STATE__,
() => document.getElementById('__NEXT_DATA__')?.textContent,
() => {
const scripts = document.querySelectorAll('script[type="application/json"]');
return Array.from(scripts).map(s => JSON.parse(s.textContent));
}
];
for (let pattern of patterns) {
try {
const data = pattern();
if (data) return data;
} catch (e) {
continue;
}
}
return null;
}"""
)
return result
# Usage example
async def main():
extractor = MCPJSONExtractor(mcp_client)
# Extract from multiple sources
api_data = await extractor.extract_from_api('https://api.example.com/data')
page_data = await extractor.extract_from_page(
'https://example.com/products',
'script[type="application/ld+json"]'
)
browser_data = await extractor.extract_with_browser('https://example.com/app')
# Combine and process
combined_data = {
'api': api_data,
'structured_data': page_data,
'app_state': browser_data
}
print(json.dumps(combined_data, indent=2))
asyncio.run(main())
JavaScript/Node.js Implementation
For Node.js environments, here's how to implement JSON extraction with MCP servers:
const { MCPClient } = require('@anthropic/mcp-client');
class JSONExtractor {
constructor(mcpClient) {
this.client = mcpClient;
}
async extractFromNetworkRequests(url) {
// Navigate to page
await this.client.callTool('browser_navigate', { url });
// Wait for page to load
await this.client.callTool('browser_wait_for', { time: 2 });
// Get network requests
const requests = await this.client.callTool('browser_network_requests', {});
// Filter and parse JSON responses
const jsonData = requests
.filter(req => {
const contentType = req.response?.headers['content-type'] || '';
return contentType.includes('application/json');
})
.map(req => {
try {
return {
url: req.url,
data: JSON.parse(req.response.body)
};
} catch (e) {
return null;
}
})
.filter(item => item !== null);
return jsonData;
}
async extractStructuredData(url) {
// Use WebScraping.AI for structured data extraction
const result = await this.client.callTool('webscraping_ai_question', {
url,
question: 'Extract all structured data from this page and return it as valid JSON'
});
try {
return JSON.parse(result.answer);
} catch (e) {
// If not valid JSON, try to extract from markdown code blocks
const jsonMatch = result.answer.match(/```language-json\n([\s\S]*?)\n```/);
if (jsonMatch) {
return JSON.parse(jsonMatch[1]);
}
throw new Error('Could not parse JSON from response');
}
}
async extractPageState(url) {
await this.client.callTool('browser_navigate', { url });
const state = await this.client.callTool('browser_evaluate', {
element: 'application state',
function: `() => {
// Extract Redux state if available
if (window.__REDUX_DEVTOOLS_EXTENSION__) {
return window.__REDUX_DEVTOOLS_EXTENSION__.store.getState();
}
// Extract Next.js data
if (window.__NEXT_DATA__) {
return window.__NEXT_DATA__;
}
// Extract Nuxt.js data
if (window.__NUXT__) {
return window.__NUXT__;
}
return null;
}`
});
return state;
}
}
// Usage
(async () => {
const client = new MCPClient();
const extractor = new JSONExtractor(client);
// Extract JSON from different sources
const networkData = await extractor.extractFromNetworkRequests(
'https://example.com/products'
);
const structuredData = await extractor.extractStructuredData(
'https://example.com/api-docs'
);
const pageState = await extractor.extractPageState(
'https://app.example.com'
);
console.log('Extracted JSON Data:', {
network: networkData,
structured: structuredData,
state: pageState
});
})();
Advanced JSON Extraction Techniques
Handling Paginated JSON APIs
When dealing with paginated data, you need to extract JSON across multiple requests:
async def extract_paginated_json(base_url: str, max_pages: int = 10):
all_data = []
page = 1
while page <= max_pages:
# Navigate to paginated URL
url = f"{base_url}?page={page}"
await browser_navigate(url=url)
# Extract JSON from network response
requests = await browser_network_requests()
json_response = next(
(req for req in requests if '/api/items' in req['url']),
None
)
if json_response:
data = json.loads(json_response['response']['body'])
all_data.extend(data['items'])
# Check if there's a next page
if not data.get('has_next'):
break
page += 1
return all_data
Extracting JSON from iFrames
When JSON data is embedded within iframes, similar to handling iframes in Puppeteer, you need a specialized approach:
async function extractJSONFromIframe(iframeSelector) {
const jsonData = await browser_evaluate({
element: 'iframe content',
function: `() => {
const iframe = document.querySelector('${iframeSelector}');
const iframeDoc = iframe.contentDocument || iframe.contentWindow.document;
// Look for JSON in iframe scripts
const scripts = iframeDoc.querySelectorAll('script[type="application/json"]');
const data = [];
scripts.forEach(script => {
try {
data.push(JSON.parse(script.textContent));
} catch (e) {}
});
return data;
}`
});
return jsonData;
}
Monitoring and Validating JSON Extraction
To ensure reliable JSON extraction, implement validation and monitoring similar to monitoring network requests:
import jsonschema
from typing import Dict, Any
class JSONValidator:
def __init__(self, schema: Dict[str, Any]):
self.schema = schema
def validate(self, data: Any) -> bool:
try:
jsonschema.validate(instance=data, schema=self.schema)
return True
except jsonschema.exceptions.ValidationError as e:
print(f"Validation error: {e.message}")
return False
async def extract_and_validate(self, url: str):
# Extract JSON data
result = await webscraping_ai_question(
url=url,
question="Extract all product data as JSON"
)
try:
data = json.loads(result['answer'])
# Validate against schema
if self.validate(data):
return data
else:
raise ValueError("Extracted JSON does not match expected schema")
except json.JSONDecodeError as e:
raise ValueError(f"Invalid JSON: {e}")
# Define expected schema
product_schema = {
"type": "object",
"properties": {
"products": {
"type": "array",
"items": {
"type": "object",
"properties": {
"id": {"type": "string"},
"name": {"type": "string"},
"price": {"type": "number"}
},
"required": ["id", "name", "price"]
}
}
},
"required": ["products"]
}
# Use validator
validator = JSONValidator(product_schema)
valid_data = await validator.extract_and_validate('https://example.com/products')
Best Practices for JSON Extraction
Use the Right Tool: Browser automation for dynamic content, direct HTTP for APIs, AI-powered extraction for complex structures
Implement Error Handling: Always wrap JSON parsing in try-catch blocks and validate data structure
Cache and Rate Limit: Store extracted JSON to minimize redundant requests
Validate Schema: Use JSON Schema validation to ensure data consistency
Monitor Network Traffic: Identify JSON endpoints by monitoring network requests during page load
Handle Authentication: Many JSON APIs require proper authentication headers
Parse Incrementally: For large JSON datasets, consider streaming parsers to reduce memory usage
Conclusion
MCP servers provide a comprehensive toolkit for JSON extraction, from browser automation with Playwright and Puppeteer to AI-powered extraction with WebScraping.AI. The best tool depends on your specific use case: use browser automation for dynamic web applications, direct HTTP clients for REST APIs, and AI-powered extraction for complex, unstructured data. By combining these tools and following best practices, you can build robust JSON extraction pipelines that handle any web scraping scenario.