Table of contents

What are the best MCP server tutorials for web scraping?

The Model Context Protocol (MCP) has revolutionized how developers build web scraping tools by providing a standardized way to connect AI assistants with data sources and automation capabilities. If you're looking to master web scraping with MCP servers, this comprehensive guide covers the best tutorials and learning resources available.

Understanding MCP Server Basics

Before diving into web scraping tutorials, it's essential to understand what MCP servers are and how they work. The Model Context Protocol is an open standard that enables seamless integration between AI applications and external data sources. For web scraping, MCP servers act as intermediaries that provide tools, resources, and prompts to AI assistants like Claude.

Getting Started with MCP

The best place to start is the official MCP documentation at modelcontextprotocol.io. This resource provides:

  • Architecture Overview: Understanding the client-server relationship
  • Protocol Specifications: How messages are exchanged between components
  • Security Best Practices: Authentication and authorization patterns
  • SDK Documentation: Official TypeScript and Python SDKs

Essential MCP Server Tutorials for Web Scraping

1. Playwright MCP Server Tutorial

The Playwright MCP Server is one of the most powerful tools for web scraping. Here's a step-by-step tutorial to get started:

Installation

# Install the Playwright MCP server via npm
npm install -g @automatalabs/mcp-server-playwright

# Or install locally in your project
npm install @automatalabs/mcp-server-playwright

Configuration

Create an MCP configuration file (mcp.json):

{
  "mcpServers": {
    "playwright": {
      "command": "npx",
      "args": ["-y", "@automatalabs/mcp-server-playwright"],
      "env": {
        "PLAYWRIGHT_BROWSER": "chromium"
      }
    }
  }
}

Basic Scraping Example

Once configured with Claude Desktop or another MCP client, you can use natural language to control the browser:

// Example workflow - you would describe this to Claude
// 1. Navigate to target website
// 2. Take a snapshot of the page
// 3. Click on specific elements
// 4. Extract data from the page

// The MCP server translates these commands into Playwright actions
const browser = await playwright.chromium.launch();
const page = await browser.newPage();
await page.goto('https://example.com');

// Extract data
const data = await page.evaluate(() => {
  return {
    title: document.querySelector('h1').textContent,
    paragraphs: Array.from(document.querySelectorAll('p')).map(p => p.textContent)
  };
});

await browser.close();

This approach is similar to how you handle browser sessions in Puppeteer, but with the added benefit of AI-assisted automation through MCP.

2. Puppeteer MCP Server Tutorial

The Puppeteer MCP server provides another excellent option for browser automation and web scraping:

Setup

# Install the Puppeteer MCP server
npm install @executeautomation/puppeteer-mcp-server

Configuration

Add to your MCP configuration:

{
  "mcpServers": {
    "puppeteer": {
      "command": "node",
      "args": ["path/to/puppeteer-mcp-server/dist/index.js"],
      "env": {
        "HEADLESS": "true"
      }
    }
  }
}

Python Example with Puppeteer MCP

# Using the MCP Python SDK to interact with Puppeteer server
from mcp import ClientSession, StdioServerParameters
from mcp.client.stdio import stdio_client

async def scrape_with_puppeteer():
    # Connect to the MCP server
    server_params = StdioServerParameters(
        command="node",
        args=["puppeteer-mcp-server/dist/index.js"]
    )

    async with stdio_client(server_params) as (read, write):
        async with ClientSession(read, write) as session:
            # Initialize connection
            await session.initialize()

            # Call tools provided by the server
            result = await session.call_tool(
                "navigate",
                arguments={"url": "https://example.com"}
            )

            # Extract data
            data = await session.call_tool(
                "evaluate",
                arguments={
                    "script": "document.querySelector('h1').textContent"
                }
            )

            return data

3. Building a Custom MCP Server for Web Scraping

For advanced use cases, you may want to build your own MCP server. Here's a tutorial outline:

TypeScript Implementation

import { Server } from "@modelcontextprotocol/sdk/server/index.js";
import { StdioServerTransport } from "@modelcontextprotocol/sdk/server/stdio.js";
import axios from "axios";
import * as cheerio from "cheerio";

// Create a new MCP server
const server = new Server(
  {
    name: "custom-scraper",
    version: "1.0.0"
  },
  {
    capabilities: {
      tools: {}
    }
  }
);

// Define a scraping tool
server.setRequestHandler("tools/list", async () => {
  return {
    tools: [
      {
        name: "scrape_url",
        description: "Scrape content from a URL",
        inputSchema: {
          type: "object",
          properties: {
            url: {
              type: "string",
              description: "URL to scrape"
            },
            selector: {
              type: "string",
              description: "CSS selector for content"
            }
          },
          required: ["url"]
        }
      }
    ]
  };
});

// Implement the scraping logic
server.setRequestHandler("tools/call", async (request) => {
  if (request.params.name === "scrape_url") {
    const { url, selector } = request.params.arguments;

    try {
      const response = await axios.get(url);
      const $ = cheerio.load(response.data);

      const content = selector
        ? $(selector).text()
        : $('body').text();

      return {
        content: [
          {
            type: "text",
            text: content
          }
        ]
      };
    } catch (error) {
      return {
        content: [
          {
            type: "text",
            text: `Error: ${error.message}`
          }
        ],
        isError: true
      };
    }
  }
});

// Start the server
async function main() {
  const transport = new StdioServerTransport();
  await server.connect(transport);
}

main();

4. WebScraping.AI MCP Integration Tutorial

For developers who want a managed solution, integrating WebScraping.AI with MCP provides powerful capabilities without managing browser infrastructure:

Installation

npm install @webscraping-ai/mcp-server

Configuration Example

{
  "mcpServers": {
    "webscraping-ai": {
      "command": "npx",
      "args": ["-y", "@webscraping-ai/mcp-server"],
      "env": {
        "WEBSCRAPING_AI_API_KEY": "your-api-key-here"
      }
    }
  }
}

Usage Example

import asyncio
from mcp import ClientSession, StdioServerParameters
from mcp.client.stdio import stdio_client

async def scrape_with_api():
    server_params = StdioServerParameters(
        command="npx",
        args=["-y", "@webscraping-ai/mcp-server"],
        env={"WEBSCRAPING_AI_API_KEY": "your-api-key"}
    )

    async with stdio_client(server_params) as (read, write):
        async with ClientSession(read, write) as session:
            await session.initialize()

            # Scrape with AI question answering
            result = await session.call_tool(
                "scrape_question",
                arguments={
                    "url": "https://example.com/products",
                    "question": "What are all the product names and prices?"
                }
            )

            print(result)

asyncio.run(scrape_with_api())

Advanced MCP Web Scraping Patterns

Handling Dynamic Content

When scraping JavaScript-heavy websites, MCP servers with browser automation capabilities excel. Here's an advanced pattern for handling AJAX requests using Puppeteer through MCP:

// Describe to your MCP-enabled AI assistant:
// "Navigate to the page, wait for the AJAX request to complete,
// then extract the dynamically loaded data"

// The MCP server executes:
await page.goto('https://example.com/dynamic-content');
await page.waitForSelector('.loaded-content');
await page.waitForTimeout(1000); // Wait for AJAX

const data = await page.evaluate(() => {
  return Array.from(document.querySelectorAll('.item')).map(item => ({
    title: item.querySelector('.title').textContent,
    price: item.querySelector('.price').textContent
  }));
});

Error Handling and Retry Logic

// Custom MCP tool with robust error handling
server.setRequestHandler("tools/call", async (request) => {
  const maxRetries = 3;
  let lastError;

  for (let i = 0; i < maxRetries; i++) {
    try {
      const response = await axios.get(request.params.arguments.url, {
        timeout: 10000,
        headers: {
          'User-Agent': 'Mozilla/5.0 (compatible; MCPBot/1.0)'
        }
      });

      return {
        content: [{ type: "text", text: response.data }]
      };
    } catch (error) {
      lastError = error;
      await new Promise(resolve => setTimeout(resolve, 1000 * (i + 1)));
    }
  }

  return {
    content: [{ type: "text", text: `Failed after ${maxRetries} retries: ${lastError.message}` }],
    isError: true
  };
});

Working with Pagination

# MCP tool call for paginated scraping
async def scrape_all_pages(session, base_url, max_pages=10):
    all_data = []

    for page_num in range(1, max_pages + 1):
        url = f"{base_url}?page={page_num}"

        result = await session.call_tool(
            "scrape_url",
            arguments={
                "url": url,
                "selector": ".product-list .item"
            }
        )

        if not result.content:
            break  # No more data

        all_data.extend(result.content)

    return all_data

Best Practices for MCP Web Scraping

1. Resource Management

When using browser-based MCP servers, always ensure proper cleanup:

// In your MCP server implementation
async function cleanupBrowser() {
  if (browser) {
    await browser.close();
    browser = null;
  }
}

// Handle process termination
process.on('SIGINT', cleanupBrowser);
process.on('SIGTERM', cleanupBrowser);

2. Rate Limiting

Implement rate limiting in your custom MCP servers:

class RateLimiter {
  private queue: Array<() => Promise<any>> = [];
  private processing = false;

  async add<T>(fn: () => Promise<T>): Promise<T> {
    return new Promise((resolve, reject) => {
      this.queue.push(async () => {
        try {
          const result = await fn();
          resolve(result);
        } catch (error) {
          reject(error);
        }
      });

      this.process();
    });
  }

  private async process() {
    if (this.processing || this.queue.length === 0) return;

    this.processing = true;
    const fn = this.queue.shift();

    if (fn) {
      await fn();
      await new Promise(resolve => setTimeout(resolve, 1000)); // 1 second delay
    }

    this.processing = false;
    this.process();
  }
}

3. Structured Data Extraction

Use JSON schema to define expected data structures:

{
  "tools": [
    {
      "name": "extract_product_data",
      "description": "Extract structured product information",
      "inputSchema": {
        "type": "object",
        "properties": {
          "url": { "type": "string" }
        }
      },
      "outputSchema": {
        "type": "object",
        "properties": {
          "products": {
            "type": "array",
            "items": {
              "type": "object",
              "properties": {
                "name": { "type": "string" },
                "price": { "type": "number" },
                "availability": { "type": "boolean" }
              }
            }
          }
        }
      }
    }
  ]
}

Learning Resources and Community

Official Documentation

  • MCP Specification: https://spec.modelcontextprotocol.io/
  • MCP SDK Documentation: GitHub repositories for TypeScript and Python SDKs
  • Example Servers: The official MCP GitHub organization hosts numerous example servers

Community Tutorials

  • Anthropic Developer Forums: Active community discussing MCP implementations
  • GitHub Discussions: Many MCP server repositories have detailed discussions
  • YouTube Channels: Search for "MCP server tutorial" for video walkthroughs

Sample Projects

Clone and study these open-source MCP servers:

# Playwright MCP Server
git clone https://github.com/automatalabs/mcp-server-playwright

# Puppeteer MCP Server
git clone https://github.com/executeautomation/puppeteer-mcp-server

# Fetch MCP Server (HTTP requests)
git clone https://github.com/modelcontextprotocol/servers/tree/main/src/fetch

Conclusion

MCP servers provide a powerful, standardized approach to web scraping that combines the flexibility of traditional scraping libraries with AI-assisted automation. Whether you're using pre-built servers like Playwright and Puppeteer MCP or building your own custom solutions, the tutorials and patterns outlined here will help you get started and scale your web scraping operations effectively.

Start with the official MCP documentation, experiment with existing servers, and gradually build your own custom tools as your needs evolve. The combination of browser automation capabilities similar to navigating to different pages using Puppeteer with AI-powered orchestration makes MCP an excellent choice for modern web scraping projects.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon