Table of contents

What are the best MCP server tools for JSON extraction?

Model Context Protocol (MCP) servers provide powerful tools for extracting JSON data from web pages, APIs, and dynamic content. This guide explores the most effective MCP server tools and techniques for JSON extraction in web scraping workflows.

Understanding MCP Tools for JSON Extraction

MCP servers expose various tools that can be leveraged for JSON extraction. These tools range from browser automation capabilities to specialized data extraction functions. The key to successful JSON extraction lies in selecting the right tool for your specific use case.

Primary MCP Server Tools for JSON Data

1. Browser Automation Tools (Playwright/Puppeteer MCP)

The Playwright and Puppeteer MCP servers offer the most comprehensive JSON extraction capabilities through browser automation. These tools can:

  • Execute JavaScript to access JSON data in dynamic applications
  • Intercept network requests to capture API responses
  • Evaluate scripts to extract embedded JSON from pages

2. WebScraping.AI MCP Integration

The WebScraping.AI MCP server provides AI-powered extraction capabilities specifically designed for structured data:

  • Field extraction with natural language instructions
  • Question-based data retrieval
  • Automatic JSON parsing and validation

3. HTTP Client MCP Tools

Standard HTTP client tools in MCP servers enable direct API interaction:

  • Direct JSON API consumption
  • Response parsing and transformation
  • Header and authentication management

Extracting JSON with Playwright MCP

The Playwright MCP server provides the most robust solution for extracting JSON from modern web applications. Here's how to use it effectively:

Capturing Network JSON Responses

// Using Playwright MCP to intercept JSON API calls
const { browser_network_requests } = mcpPlaywright;

// Navigate to the page
await browser_navigate({ url: 'https://example.com/data-page' });

// Wait for content to load
await browser_wait_for({ time: 3 });

// Get all network requests
const requests = await browser_network_requests({});

// Filter for JSON API responses
const jsonRequests = requests.filter(req =>
  req.url.includes('/api/') &&
  req.response?.headers['content-type']?.includes('application/json')
);

// Parse JSON from responses
jsonRequests.forEach(req => {
  const jsonData = JSON.parse(req.response.body);
  console.log('Extracted JSON:', jsonData);
});

Extracting Embedded JSON Data

Many websites embed JSON data directly in <script> tags for client-side rendering. The Playwright MCP server excels at extracting this data:

// Extract JSON from script tags
const jsonData = await browser_evaluate({
  element: 'script tag containing JSON data',
  function: `() => {
    const scriptTags = document.querySelectorAll('script[type="application/json"]');
    const dataArray = [];

    scriptTags.forEach(script => {
      try {
        const jsonContent = JSON.parse(script.textContent);
        dataArray.push(jsonContent);
      } catch (e) {
        console.error('Failed to parse JSON:', e);
      }
    });

    return dataArray;
  }`
});

Extracting JSON from Single Page Applications

When working with single page applications, you often need to wait for AJAX requests to complete before extracting JSON:

// Navigate and wait for specific data to load
await browser_navigate({ url: 'https://spa-example.com' });

// Wait for the API call to complete
await browser_wait_for({
  text: 'Data loaded' // Wait for UI indicator
});

// Extract the data from window object
const appData = await browser_evaluate({
  element: 'page data object',
  function: `() => {
    // Many SPAs store data in window object
    return window.__INITIAL_STATE__ ||
           window.__PRELOADED_STATE__ ||
           window.APP_DATA;
  }`
});

console.log('SPA JSON Data:', JSON.stringify(appData, null, 2));

Using WebScraping.AI MCP for JSON Extraction

The WebScraping.AI MCP server provides specialized tools for AI-powered JSON extraction:

Field-Based JSON Extraction

# Using WebScraping.AI MCP for structured JSON extraction
from mcp import webscraping_ai

# Define the fields you want to extract
fields = {
    "title": "The main title of the product",
    "price": "The product price in USD",
    "availability": "Whether the product is in stock",
    "rating": "The average customer rating",
    "reviews_count": "Total number of reviews"
}

# Extract structured JSON data
result = await webscraping_ai.webscraping_ai_fields(
    url="https://example.com/product/12345",
    fields=fields,
    js=True  # Enable JavaScript rendering
)

# The response is already in JSON format
product_data = result['data']
print(json.dumps(product_data, indent=2))

Question-Based JSON Extraction

# Extract specific data using natural language questions
response = await webscraping_ai.webscraping_ai_question(
    url="https://example.com/pricing",
    question="What are all the pricing tiers and their features? Return as JSON."
)

# The AI will structure the response as JSON
pricing_json = json.loads(response['answer'])

Python Implementation with MCP Servers

Here's a comprehensive Python example combining multiple MCP tools for JSON extraction:

import json
import asyncio
from typing import Dict, List, Any

class MCPJSONExtractor:
    def __init__(self, mcp_client):
        self.client = mcp_client

    async def extract_from_api(self, url: str) -> Dict[str, Any]:
        """Extract JSON directly from API endpoints"""
        response = await self.client.call_tool(
            'webscraping_ai_html',
            url=url,
            format='json'
        )

        return json.loads(response['html'])

    async def extract_from_page(self, url: str, selector: str) -> List[Dict]:
        """Extract JSON from page elements"""
        # Get HTML content
        html_response = await self.client.call_tool(
            'webscraping_ai_selected',
            url=url,
            selector=selector
        )

        # Parse JSON from selected elements
        elements = html_response['selected']
        json_data = []

        for element in elements:
            try:
                data = json.loads(element['text'])
                json_data.append(data)
            except json.JSONDecodeError:
                continue

        return json_data

    async def extract_with_browser(self, url: str) -> Dict[str, Any]:
        """Extract JSON using browser automation"""
        # Navigate to page
        await self.client.call_tool(
            'browser_navigate',
            url=url
        )

        # Execute JavaScript to extract JSON
        result = await self.client.call_tool(
            'browser_evaluate',
            element='window data',
            function="""() => {
                // Common patterns for JSON data
                const patterns = [
                    () => window.__NEXT_DATA__,
                    () => window.__INITIAL_STATE__,
                    () => document.getElementById('__NEXT_DATA__')?.textContent,
                    () => {
                        const scripts = document.querySelectorAll('script[type="application/json"]');
                        return Array.from(scripts).map(s => JSON.parse(s.textContent));
                    }
                ];

                for (let pattern of patterns) {
                    try {
                        const data = pattern();
                        if (data) return data;
                    } catch (e) {
                        continue;
                    }
                }

                return null;
            }"""
        )

        return result

# Usage example
async def main():
    extractor = MCPJSONExtractor(mcp_client)

    # Extract from multiple sources
    api_data = await extractor.extract_from_api('https://api.example.com/data')
    page_data = await extractor.extract_from_page(
        'https://example.com/products',
        'script[type="application/ld+json"]'
    )
    browser_data = await extractor.extract_with_browser('https://example.com/app')

    # Combine and process
    combined_data = {
        'api': api_data,
        'structured_data': page_data,
        'app_state': browser_data
    }

    print(json.dumps(combined_data, indent=2))

asyncio.run(main())

JavaScript/Node.js Implementation

For Node.js environments, here's how to implement JSON extraction with MCP servers:

const { MCPClient } = require('@anthropic/mcp-client');

class JSONExtractor {
  constructor(mcpClient) {
    this.client = mcpClient;
  }

  async extractFromNetworkRequests(url) {
    // Navigate to page
    await this.client.callTool('browser_navigate', { url });

    // Wait for page to load
    await this.client.callTool('browser_wait_for', { time: 2 });

    // Get network requests
    const requests = await this.client.callTool('browser_network_requests', {});

    // Filter and parse JSON responses
    const jsonData = requests
      .filter(req => {
        const contentType = req.response?.headers['content-type'] || '';
        return contentType.includes('application/json');
      })
      .map(req => {
        try {
          return {
            url: req.url,
            data: JSON.parse(req.response.body)
          };
        } catch (e) {
          return null;
        }
      })
      .filter(item => item !== null);

    return jsonData;
  }

  async extractStructuredData(url) {
    // Use WebScraping.AI for structured data extraction
    const result = await this.client.callTool('webscraping_ai_question', {
      url,
      question: 'Extract all structured data from this page and return it as valid JSON'
    });

    try {
      return JSON.parse(result.answer);
    } catch (e) {
      // If not valid JSON, try to extract from markdown code blocks
      const jsonMatch = result.answer.match(/```language-json\n([\s\S]*?)\n```/);
      if (jsonMatch) {
        return JSON.parse(jsonMatch[1]);
      }
      throw new Error('Could not parse JSON from response');
    }
  }

  async extractPageState(url) {
    await this.client.callTool('browser_navigate', { url });

    const state = await this.client.callTool('browser_evaluate', {
      element: 'application state',
      function: `() => {
        // Extract Redux state if available
        if (window.__REDUX_DEVTOOLS_EXTENSION__) {
          return window.__REDUX_DEVTOOLS_EXTENSION__.store.getState();
        }

        // Extract Next.js data
        if (window.__NEXT_DATA__) {
          return window.__NEXT_DATA__;
        }

        // Extract Nuxt.js data
        if (window.__NUXT__) {
          return window.__NUXT__;
        }

        return null;
      }`
    });

    return state;
  }
}

// Usage
(async () => {
  const client = new MCPClient();
  const extractor = new JSONExtractor(client);

  // Extract JSON from different sources
  const networkData = await extractor.extractFromNetworkRequests(
    'https://example.com/products'
  );

  const structuredData = await extractor.extractStructuredData(
    'https://example.com/api-docs'
  );

  const pageState = await extractor.extractPageState(
    'https://app.example.com'
  );

  console.log('Extracted JSON Data:', {
    network: networkData,
    structured: structuredData,
    state: pageState
  });
})();

Advanced JSON Extraction Techniques

Handling Paginated JSON APIs

When dealing with paginated data, you need to extract JSON across multiple requests:

async def extract_paginated_json(base_url: str, max_pages: int = 10):
    all_data = []
    page = 1

    while page <= max_pages:
        # Navigate to paginated URL
        url = f"{base_url}?page={page}"
        await browser_navigate(url=url)

        # Extract JSON from network response
        requests = await browser_network_requests()

        json_response = next(
            (req for req in requests if '/api/items' in req['url']),
            None
        )

        if json_response:
            data = json.loads(json_response['response']['body'])
            all_data.extend(data['items'])

            # Check if there's a next page
            if not data.get('has_next'):
                break

        page += 1

    return all_data

Extracting JSON from iFrames

When JSON data is embedded within iframes, similar to handling iframes in Puppeteer, you need a specialized approach:

async function extractJSONFromIframe(iframeSelector) {
  const jsonData = await browser_evaluate({
    element: 'iframe content',
    function: `() => {
      const iframe = document.querySelector('${iframeSelector}');
      const iframeDoc = iframe.contentDocument || iframe.contentWindow.document;

      // Look for JSON in iframe scripts
      const scripts = iframeDoc.querySelectorAll('script[type="application/json"]');
      const data = [];

      scripts.forEach(script => {
        try {
          data.push(JSON.parse(script.textContent));
        } catch (e) {}
      });

      return data;
    }`
  });

  return jsonData;
}

Monitoring and Validating JSON Extraction

To ensure reliable JSON extraction, implement validation and monitoring similar to monitoring network requests:

import jsonschema
from typing import Dict, Any

class JSONValidator:
    def __init__(self, schema: Dict[str, Any]):
        self.schema = schema

    def validate(self, data: Any) -> bool:
        try:
            jsonschema.validate(instance=data, schema=self.schema)
            return True
        except jsonschema.exceptions.ValidationError as e:
            print(f"Validation error: {e.message}")
            return False

    async def extract_and_validate(self, url: str):
        # Extract JSON data
        result = await webscraping_ai_question(
            url=url,
            question="Extract all product data as JSON"
        )

        try:
            data = json.loads(result['answer'])

            # Validate against schema
            if self.validate(data):
                return data
            else:
                raise ValueError("Extracted JSON does not match expected schema")
        except json.JSONDecodeError as e:
            raise ValueError(f"Invalid JSON: {e}")

# Define expected schema
product_schema = {
    "type": "object",
    "properties": {
        "products": {
            "type": "array",
            "items": {
                "type": "object",
                "properties": {
                    "id": {"type": "string"},
                    "name": {"type": "string"},
                    "price": {"type": "number"}
                },
                "required": ["id", "name", "price"]
            }
        }
    },
    "required": ["products"]
}

# Use validator
validator = JSONValidator(product_schema)
valid_data = await validator.extract_and_validate('https://example.com/products')

Best Practices for JSON Extraction

  1. Use the Right Tool: Browser automation for dynamic content, direct HTTP for APIs, AI-powered extraction for complex structures

  2. Implement Error Handling: Always wrap JSON parsing in try-catch blocks and validate data structure

  3. Cache and Rate Limit: Store extracted JSON to minimize redundant requests

  4. Validate Schema: Use JSON Schema validation to ensure data consistency

  5. Monitor Network Traffic: Identify JSON endpoints by monitoring network requests during page load

  6. Handle Authentication: Many JSON APIs require proper authentication headers

  7. Parse Incrementally: For large JSON datasets, consider streaming parsers to reduce memory usage

Conclusion

MCP servers provide a comprehensive toolkit for JSON extraction, from browser automation with Playwright and Puppeteer to AI-powered extraction with WebScraping.AI. The best tool depends on your specific use case: use browser automation for dynamic web applications, direct HTTP clients for REST APIs, and AI-powered extraction for complex, unstructured data. By combining these tools and following best practices, you can build robust JSON extraction pipelines that handle any web scraping scenario.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon