Table of contents

How do I use MCP resources for structured data extraction?

MCP (Model Context Protocol) resources provide a powerful abstraction for accessing and extracting structured data from various sources within AI-powered applications. Resources in MCP represent data sources that can be read and processed, making them ideal for web scraping and data extraction workflows. This guide will show you how to effectively use MCP resources for structured data extraction.

Understanding MCP Resources

MCP resources are URI-addressable data sources that servers expose to clients. Unlike tools (which perform actions), resources are read-only entities that provide access to data. When used for web scraping, resources can represent:

  • HTML content from web pages
  • JSON data from APIs
  • Structured datasets
  • Database query results
  • File system contents

Resources are particularly useful for structured data extraction because they provide a standardized way to access and process data across different sources and formats.

Setting Up MCP for Data Extraction

Installing Required Dependencies

Python Setup:

# Install the MCP SDK for Python
pip install mcp anthropic-sdk

# Install additional libraries for web scraping
pip install beautifulsoup4 lxml requests

JavaScript/TypeScript Setup:

# Install the MCP SDK for Node.js
npm install @modelcontextprotocol/sdk

# Install web scraping dependencies
npm install cheerio axios playwright

Connecting to MCP Resources

Python Implementation

Here's how to connect to an MCP server and access resources for data extraction:

import asyncio
from mcp import ClientSession, StdioServerParameters
from mcp.client.stdio import stdio_client
from anthropic import Anthropic

async def extract_data_from_resources():
    # Configure server connection
    server_params = StdioServerParameters(
        command="mcp-server-playwright",  # or your custom MCP server
        args=[]
    )

    async with stdio_client(server_params) as (read, write):
        async with ClientSession(read, write) as session:
            # Initialize the session
            await session.initialize()

            # List available resources
            resources = await session.list_resources()

            print("Available resources:")
            for resource in resources.resources:
                print(f"  - {resource.uri}: {resource.name}")

            # Read a specific resource
            resource_uri = "page://https://example.com/data"
            content = await session.read_resource(resource_uri)

            return content

# Run the async function
data = asyncio.run(extract_data_from_resources())
print(f"Extracted data: {data}")

JavaScript Implementation

import { Client } from '@modelcontextprotocol/sdk/client/index.js';
import { StdioClientTransport } from '@modelcontextprotocol/sdk/client/stdio.js';

async function extractDataFromResources() {
  // Create transport layer
  const transport = new StdioClientTransport({
    command: 'mcp-server-playwright',
    args: []
  });

  // Create and connect client
  const client = new Client({
    name: 'data-extraction-client',
    version: '1.0.0'
  }, {
    capabilities: {
      resources: {}
    }
  });

  await client.connect(transport);

  // List available resources
  const resources = await client.listResources();

  console.log('Available resources:');
  resources.resources.forEach(resource => {
    console.log(`  - ${resource.uri}: ${resource.name}`);
  });

  // Read specific resource
  const resourceUri = 'page://https://example.com/data';
  const content = await client.readResource({ uri: resourceUri });

  return content;
}

// Execute the extraction
extractDataFromResources()
  .then(data => console.log('Extracted data:', data))
  .catch(error => console.error('Error:', error));

Extracting Structured Data from Resources

Using Playwright MCP Server

The Playwright MCP server provides resources for accessing web page content. Here's how to extract structured data:

Python Example:

import asyncio
import json
from mcp import ClientSession, StdioServerParameters
from mcp.client.stdio import stdio_client
from bs4 import BeautifulSoup

async def scrape_product_data(url):
    server_params = StdioServerParameters(
        command="npx",
        args=["-y", "@modelcontextprotocol/server-playwright"]
    )

    async with stdio_client(server_params) as (read, write):
        async with ClientSession(read, write) as session:
            await session.initialize()

            # Navigate to the page using a tool
            await session.call_tool("browser_navigate", {"url": url})

            # Wait for content to load
            await session.call_tool("browser_wait_for", {"time": 2})

            # Get page snapshot as a resource
            snapshot = await session.call_tool("browser_snapshot", {})

            # Parse the HTML content
            soup = BeautifulSoup(snapshot.content[0].text, 'html.parser')

            # Extract structured data
            products = []
            for item in soup.select('.product-item'):
                product = {
                    'name': item.select_one('.product-name').text.strip(),
                    'price': item.select_one('.product-price').text.strip(),
                    'rating': item.select_one('.product-rating').get('data-rating'),
                    'availability': item.select_one('.availability').text.strip()
                }
                products.append(product)

            return products

# Extract data
products = asyncio.run(scrape_product_data("https://example-shop.com/products"))
print(json.dumps(products, indent=2))

JavaScript Example:

import { Client } from '@modelcontextprotocol/sdk/client/index.js';
import { StdioClientTransport } from '@modelcontextprotocol/sdk/client/stdio.js';
import * as cheerio from 'cheerio';

async function scrapeProductData(url) {
  const transport = new StdioClientTransport({
    command: 'npx',
    args: ['-y', '@modelcontextprotocol/server-playwright']
  });

  const client = new Client({
    name: 'product-scraper',
    version: '1.0.0'
  }, {
    capabilities: { tools: {}, resources: {} }
  });

  await client.connect(transport);

  // Navigate to the page
  await client.callTool({
    name: 'browser_navigate',
    arguments: { url }
  });

  // Wait for dynamic content
  await client.callTool({
    name: 'browser_wait_for',
    arguments: { time: 2 }
  });

  // Get page snapshot
  const snapshot = await client.callTool({
    name: 'browser_snapshot',
    arguments: {}
  });

  // Parse HTML content
  const $ = cheerio.load(snapshot.content[0].text);

  // Extract structured data
  const products = [];
  $('.product-item').each((i, elem) => {
    const product = {
      name: $(elem).find('.product-name').text().trim(),
      price: $(elem).find('.product-price').text().trim(),
      rating: $(elem).find('.product-rating').attr('data-rating'),
      availability: $(elem).find('.availability').text().trim()
    };
    products.push(product);
  });

  await client.close();
  return products;
}

// Execute scraping
scrapeProductData('https://example-shop.com/products')
  .then(products => console.log(JSON.stringify(products, null, 2)))
  .catch(error => console.error('Error:', error));

Advanced Data Extraction Patterns

Handling Dynamic Content

When working with JavaScript-heavy websites, similar to how to handle AJAX requests using Puppeteer, you need to wait for dynamic content to load:

async def extract_dynamic_data(url, wait_selector):
    async with stdio_client(server_params) as (read, write):
        async with ClientSession(read, write) as session:
            await session.initialize()

            # Navigate to page
            await session.call_tool("browser_navigate", {"url": url})

            # Wait for specific element
            await session.call_tool("browser_wait_for", {
                "text": wait_selector
            })

            # Extract data after content loads
            snapshot = await session.call_tool("browser_snapshot", {})

            # Process the content
            return parse_content(snapshot.content[0].text)

Paginated Data Extraction

Extract data across multiple pages:

async def extract_paginated_data(base_url, max_pages=10):
    all_data = []

    async with stdio_client(server_params) as (read, write):
        async with ClientSession(read, write) as session:
            await session.initialize()

            for page_num in range(1, max_pages + 1):
                url = f"{base_url}?page={page_num}"

                # Navigate to page
                await session.call_tool("browser_navigate", {"url": url})
                await session.call_tool("browser_wait_for", {"time": 1})

                # Get snapshot
                snapshot = await session.call_tool("browser_snapshot", {})

                # Extract data from current page
                page_data = extract_items(snapshot.content[0].text)
                all_data.extend(page_data)

                # Check if there's a next page
                soup = BeautifulSoup(snapshot.content[0].text, 'html.parser')
                if not soup.select_one('.next-page'):
                    break

            return all_data

def extract_items(html):
    soup = BeautifulSoup(html, 'html.parser')
    items = []

    for item in soup.select('.data-item'):
        items.append({
            'title': item.select_one('.title').text.strip(),
            'description': item.select_one('.description').text.strip(),
            'metadata': item.get('data-metadata')
        })

    return items

Form Interaction and Data Extraction

Similar to how to handle authentication in Puppeteer, you can fill forms and extract resulting data:

async function extractSearchResults(searchQuery) {
  const transport = new StdioClientTransport({
    command: 'npx',
    args: ['-y', '@modelcontextprotocol/server-playwright']
  });

  const client = new Client({
    name: 'search-scraper',
    version: '1.0.0'
  }, {
    capabilities: { tools: {}, resources: {} }
  });

  await client.connect(transport);

  // Navigate to search page
  await client.callTool({
    name: 'browser_navigate',
    arguments: { url: 'https://example.com/search' }
  });

  // Take snapshot to get form element references
  const initialSnapshot = await client.callTool({
    name: 'browser_snapshot',
    arguments: {}
  });

  // Fill search form
  await client.callTool({
    name: 'browser_fill_form',
    arguments: {
      fields: [
        {
          name: 'search input',
          type: 'textbox',
          ref: 'search-input-ref', // From snapshot
          value: searchQuery
        }
      ]
    }
  });

  // Submit and wait
  await client.callTool({
    name: 'browser_press_key',
    arguments: { key: 'Enter' }
  });

  await client.callTool({
    name: 'browser_wait_for',
    arguments: { time: 2 }
  });

  // Extract results
  const resultsSnapshot = await client.callTool({
    name: 'browser_snapshot',
    arguments: {}
  });

  const $ = cheerio.load(resultsSnapshot.content[0].text);
  const results = [];

  $('.search-result').each((i, elem) => {
    results.push({
      title: $(elem).find('.result-title').text().trim(),
      url: $(elem).find('.result-link').attr('href'),
      snippet: $(elem).find('.result-snippet').text().trim()
    });
  });

  await client.close();
  return results;
}

Error Handling and Best Practices

Robust Error Handling

import logging
from typing import Optional, List, Dict

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

async def safe_extract_data(url: str, max_retries: int = 3) -> Optional[List[Dict]]:
    """Extract data with error handling and retries"""

    for attempt in range(max_retries):
        try:
            server_params = StdioServerParameters(
                command="npx",
                args=["-y", "@modelcontextprotocol/server-playwright"]
            )

            async with stdio_client(server_params) as (read, write):
                async with ClientSession(read, write) as session:
                    await session.initialize()

                    # Navigate with timeout
                    await asyncio.wait_for(
                        session.call_tool("browser_navigate", {"url": url}),
                        timeout=30.0
                    )

                    # Wait for content
                    await session.call_tool("browser_wait_for", {"time": 2})

                    # Get snapshot
                    snapshot = await session.call_tool("browser_snapshot", {})

                    # Extract and validate data
                    data = extract_and_validate(snapshot.content[0].text)

                    if data:
                        logger.info(f"Successfully extracted {len(data)} items")
                        return data
                    else:
                        logger.warning(f"No data found on attempt {attempt + 1}")

        except asyncio.TimeoutError:
            logger.error(f"Timeout on attempt {attempt + 1}")
        except Exception as e:
            logger.error(f"Error on attempt {attempt + 1}: {str(e)}")

        if attempt < max_retries - 1:
            await asyncio.sleep(2 ** attempt)  # Exponential backoff

    logger.error(f"Failed to extract data after {max_retries} attempts")
    return None

def extract_and_validate(html: str) -> List[Dict]:
    """Extract data and validate structure"""
    soup = BeautifulSoup(html, 'html.parser')
    items = []

    for elem in soup.select('.data-item'):
        try:
            item = {
                'id': elem.get('data-id'),
                'title': elem.select_one('.title').text.strip(),
                'value': elem.select_one('.value').text.strip()
            }

            # Validate required fields
            if all(item.values()):
                items.append(item)
            else:
                logger.warning(f"Skipping incomplete item: {item}")

        except AttributeError as e:
            logger.warning(f"Error parsing item: {e}")
            continue

    return items

Best Practices for MCP Resource Extraction

  1. Always validate resource URIs before attempting to read them
  2. Implement proper error handling with retries and exponential backoff
  3. Use appropriate wait strategies when dealing with dynamic content, similar to how to handle timeouts in Puppeteer
  4. Clean and validate extracted data before processing
  5. Close connections properly to avoid resource leaks
  6. Implement rate limiting to respect server resources
  7. Cache resources when appropriate to minimize redundant requests
  8. Log extraction activities for debugging and monitoring

Data Transformation and Output

Converting to Structured Formats

import json
import csv
from datetime import datetime

async def extract_and_export(url: str, output_format: str = 'json'):
    """Extract data and export in various formats"""

    data = await safe_extract_data(url)

    if not data:
        raise ValueError("No data extracted")

    timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')

    if output_format == 'json':
        filename = f'extracted_data_{timestamp}.json'
        with open(filename, 'w') as f:
            json.dump(data, f, indent=2)

    elif output_format == 'csv':
        filename = f'extracted_data_{timestamp}.csv'
        with open(filename, 'w', newline='') as f:
            if data:
                writer = csv.DictWriter(f, fieldnames=data[0].keys())
                writer.writeheader()
                writer.writerows(data)

    elif output_format == 'jsonl':
        filename = f'extracted_data_{timestamp}.jsonl'
        with open(filename, 'w') as f:
            for item in data:
                f.write(json.dumps(item) + '\n')

    print(f"Data exported to {filename}")
    return filename

Conclusion

MCP resources provide a powerful and standardized approach to structured data extraction. By leveraging MCP servers like Playwright, you can build robust web scraping workflows that handle complex scenarios including dynamic content, authentication, and pagination. The key to success is implementing proper error handling, using appropriate wait strategies, and validating extracted data.

For production deployments, consider combining MCP resources with specialized web scraping APIs that handle proxy rotation, CAPTCHA solving, and rate limiting automatically. This hybrid approach gives you the flexibility of MCP's standardized interface while benefiting from managed infrastructure for reliable, large-scale data extraction.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon