Table of contents

How do I scrape websites using Puppeteer with MCP?

Puppeteer is a powerful Node.js library that provides a high-level API to control Chrome or Chromium browsers. When combined with the Model Context Protocol (MCP), you can create sophisticated web scraping workflows that leverage AI-assisted browser automation. This guide demonstrates how to use Puppeteer with MCP servers for efficient web scraping.

Understanding Puppeteer and MCP Integration

The Model Context Protocol (MCP) is an open protocol that standardizes how applications interact with AI models and tools. When you integrate Puppeteer with MCP, you can expose browser automation capabilities as MCP tools, allowing AI assistants like Claude to control browsers and extract data programmatically.

The MCP Puppeteer server provides a standardized interface for browser automation tasks including:

  • Page navigation and interaction
  • DOM element selection and manipulation
  • Screenshot capture
  • Network request monitoring
  • JavaScript execution in browser context
  • Form filling and submission

Setting Up Puppeteer with MCP

Installation

First, install the necessary dependencies:

# Install Puppeteer
npm install puppeteer

# Install MCP SDK
npm install @modelcontextprotocol/sdk

For Docker environments, you may need additional dependencies. Learn more about using Puppeteer with Docker.

Creating a Basic MCP Puppeteer Server

Here's a minimal MCP server implementation that exposes Puppeteer functionality:

import { Server } from '@modelcontextprotocol/sdk/server/index.js';
import { StdioServerTransport } from '@modelcontextprotocol/sdk/server/stdio.js';
import puppeteer from 'puppeteer';

class PuppeteerMCPServer {
  constructor() {
    this.server = new Server({
      name: 'puppeteer-scraper',
      version: '1.0.0'
    }, {
      capabilities: {
        tools: {}
      }
    });

    this.browser = null;
    this.page = null;

    this.setupTools();
  }

  setupTools() {
    // Register navigation tool
    this.server.setRequestHandler('tools/list', async () => ({
      tools: [
        {
          name: 'navigate',
          description: 'Navigate to a URL',
          inputSchema: {
            type: 'object',
            properties: {
              url: { type: 'string', description: 'URL to navigate to' }
            },
            required: ['url']
          }
        },
        {
          name: 'scrape_content',
          description: 'Extract text content from current page',
          inputSchema: {
            type: 'object',
            properties: {
              selector: { type: 'string', description: 'CSS selector' }
            }
          }
        },
        {
          name: 'screenshot',
          description: 'Take a screenshot of the page',
          inputSchema: {
            type: 'object',
            properties: {
              path: { type: 'string', description: 'File path to save' }
            },
            required: ['path']
          }
        }
      ]
    }));

    // Handle tool execution
    this.server.setRequestHandler('tools/call', async (request) => {
      const { name, arguments: args } = request.params;

      // Initialize browser if needed
      if (!this.browser) {
        this.browser = await puppeteer.launch({
          headless: 'new',
          args: ['--no-sandbox', '--disable-setuid-sandbox']
        });
        this.page = await this.browser.newPage();
      }

      switch (name) {
        case 'navigate':
          await this.page.goto(args.url, { waitUntil: 'networkidle0' });
          return { content: [{ type: 'text', text: `Navigated to ${args.url}` }] };

        case 'scrape_content':
          const content = await this.page.evaluate((selector) => {
            const elements = selector
              ? document.querySelectorAll(selector)
              : [document.body];
            return Array.from(elements).map(el => el.textContent.trim());
          }, args.selector);
          return { content: [{ type: 'text', text: JSON.stringify(content) }] };

        case 'screenshot':
          await this.page.screenshot({ path: args.path });
          return { content: [{ type: 'text', text: `Screenshot saved to ${args.path}` }] };

        default:
          throw new Error(`Unknown tool: ${name}`);
      }
    });
  }

  async run() {
    const transport = new StdioServerTransport();
    await this.server.connect(transport);
  }

  async cleanup() {
    if (this.browser) {
      await this.browser.close();
    }
  }
}

// Start the server
const server = new PuppeteerMCPServer();
server.run().catch(console.error);

process.on('SIGINT', async () => {
  await server.cleanup();
  process.exit(0);
});

Advanced Web Scraping with Puppeteer MCP

Handling Dynamic Content

Many modern websites load content dynamically via AJAX. To effectively handle AJAX requests using Puppeteer, you need to wait for content to load:

{
  name: 'wait_and_scrape',
  description: 'Wait for an element and scrape content',
  inputSchema: {
    type: 'object',
    properties: {
      selector: { type: 'string' },
      timeout: { type: 'number', default: 30000 }
    },
    required: ['selector']
  }
}

// In the handler:
case 'wait_and_scrape':
  await this.page.waitForSelector(args.selector, {
    timeout: args.timeout || 30000
  });
  const data = await this.page.$eval(args.selector, el => ({
    text: el.textContent,
    html: el.innerHTML,
    attributes: Array.from(el.attributes).map(attr => ({
      name: attr.name,
      value: attr.value
    }))
  }));
  return { content: [{ type: 'text', text: JSON.stringify(data) }] };

Scraping with Authentication

To handle websites that require login:

{
  name: 'login',
  description: 'Authenticate with username and password',
  inputSchema: {
    type: 'object',
    properties: {
      url: { type: 'string' },
      username: { type: 'string' },
      password: { type: 'string' },
      usernameSelector: { type: 'string' },
      passwordSelector: { type: 'string' },
      submitSelector: { type: 'string' }
    },
    required: ['url', 'username', 'password']
  }
}

// Handler implementation:
case 'login':
  await this.page.goto(args.url);
  await this.page.type(args.usernameSelector || 'input[type="text"]', args.username);
  await this.page.type(args.passwordSelector || 'input[type="password"]', args.password);
  await this.page.click(args.submitSelector || 'button[type="submit"]');
  await this.page.waitForNavigation({ waitUntil: 'networkidle0' });
  return { content: [{ type: 'text', text: 'Login successful' }] };

Learn more about handling authentication in Puppeteer.

Extracting Structured Data

For extracting structured data like product listings or table data:

{
  name: 'extract_structured_data',
  description: 'Extract structured data from repeating elements',
  inputSchema: {
    type: 'object',
    properties: {
      containerSelector: { type: 'string' },
      fields: {
        type: 'object',
        description: 'Field name to selector mapping'
      }
    },
    required: ['containerSelector', 'fields']
  }
}

// Handler:
case 'extract_structured_data':
  const results = await this.page.$$eval(
    args.containerSelector,
    (elements, fields) => {
      return elements.map(element => {
        const item = {};
        for (const [fieldName, selector] of Object.entries(fields)) {
          const el = element.querySelector(selector);
          item[fieldName] = el ? el.textContent.trim() : null;
        }
        return item;
      });
    },
    args.fields
  );
  return { content: [{ type: 'text', text: JSON.stringify(results, null, 2) }] };

Python Implementation with MCP

You can also create an MCP server using Python with Pyppeteer (a Python port of Puppeteer):

from mcp.server import Server
from mcp.server.stdio import stdio_server
from pyppeteer import launch
import asyncio
import json

class PuppeteerMCPServer:
    def __init__(self):
        self.server = Server("puppeteer-scraper")
        self.browser = None
        self.page = None

        # Register tools
        @self.server.list_tools()
        async def list_tools():
            return [
                {
                    "name": "navigate",
                    "description": "Navigate to a URL",
                    "inputSchema": {
                        "type": "object",
                        "properties": {
                            "url": {"type": "string"}
                        },
                        "required": ["url"]
                    }
                },
                {
                    "name": "scrape_text",
                    "description": "Extract text from elements",
                    "inputSchema": {
                        "type": "object",
                        "properties": {
                            "selector": {"type": "string"}
                        },
                        "required": ["selector"]
                    }
                }
            ]

        @self.server.call_tool()
        async def call_tool(name: str, arguments: dict):
            # Initialize browser if needed
            if not self.browser:
                self.browser = await launch(
                    headless=True,
                    args=['--no-sandbox', '--disable-setuid-sandbox']
                )
                self.page = await self.browser.newPage()

            if name == "navigate":
                await self.page.goto(arguments["url"], {
                    'waitUntil': 'networkidle0'
                })
                return f"Navigated to {arguments['url']}"

            elif name == "scrape_text":
                elements = await self.page.querySelectorAll(arguments["selector"])
                texts = []
                for element in elements:
                    text = await self.page.evaluate('(element) => element.textContent', element)
                    texts.append(text.strip())
                return json.dumps(texts)

            else:
                raise ValueError(f"Unknown tool: {name}")

    async def run(self):
        async with stdio_server() as (read_stream, write_stream):
            await self.server.run(
                read_stream,
                write_stream,
                self.server.create_initialization_options()
            )

    async def cleanup(self):
        if self.browser:
            await self.browser.close()

# Run the server
async def main():
    server = PuppeteerMCPServer()
    try:
        await server.run()
    finally:
        await server.cleanup()

if __name__ == "__main__":
    asyncio.run(main())

Best Practices for Puppeteer MCP Scraping

1. Resource Management

Always properly manage browser resources:

async cleanup() {
  if (this.page) {
    await this.page.close();
  }
  if (this.browser) {
    await this.browser.close();
  }
}

// Set up cleanup handlers
process.on('SIGTERM', cleanup);
process.on('SIGINT', cleanup);

2. Error Handling

Implement robust error handling for network issues and page load failures:

case 'navigate':
  try {
    await this.page.goto(args.url, {
      waitUntil: 'networkidle0',
      timeout: 30000
    });
    return { content: [{ type: 'text', text: `Successfully loaded ${args.url}` }] };
  } catch (error) {
    if (error.name === 'TimeoutError') {
      return {
        content: [{ type: 'text', text: `Timeout loading ${args.url}` }],
        isError: true
      };
    }
    throw error;
  }

3. Rate Limiting

Implement delays between requests to avoid overwhelming servers:

async function delay(ms) {
  return new Promise(resolve => setTimeout(resolve, ms));
}

case 'scrape_multiple':
  const results = [];
  for (const url of args.urls) {
    await this.page.goto(url);
    const data = await this.page.content();
    results.push(data);
    await delay(2000); // 2-second delay between requests
  }
  return { content: [{ type: 'text', text: JSON.stringify(results) }] };

4. Session Management

Maintain browser sessions efficiently to reuse cookies and reduce overhead. For complex scenarios, review techniques for handling browser sessions in Puppeteer.

Configuring MCP Client to Use Puppeteer Server

To use your Puppeteer MCP server with Claude Desktop or other MCP clients, add this configuration to your MCP settings file:

{
  "mcpServers": {
    "puppeteer-scraper": {
      "command": "node",
      "args": ["/path/to/your/puppeteer-mcp-server.js"]
    }
  }
}

For Claude Desktop on macOS, this file is located at: ~/Library/Application Support/Claude/claude_desktop_config.json

Example Use Cases

E-commerce Product Scraping

// Tool definition for product scraping
{
  name: 'scrape_product',
  description: 'Extract product information from e-commerce page',
  inputSchema: {
    type: 'object',
    properties: {
      url: { type: 'string' },
      selectors: {
        type: 'object',
        properties: {
          title: { type: 'string' },
          price: { type: 'string' },
          description: { type: 'string' },
          images: { type: 'string' }
        }
      }
    },
    required: ['url', 'selectors']
  }
}

News Article Extraction

case 'scrape_article':
  await this.page.goto(args.url);
  const article = await this.page.evaluate(() => {
    return {
      title: document.querySelector('h1')?.textContent,
      author: document.querySelector('[rel="author"]')?.textContent,
      date: document.querySelector('time')?.getAttribute('datetime'),
      content: document.querySelector('article')?.textContent,
      images: Array.from(document.querySelectorAll('article img'))
        .map(img => img.src)
    };
  });
  return { content: [{ type: 'text', text: JSON.stringify(article) }] };

Troubleshooting Common Issues

Browser Launch Failures

If the browser fails to launch, ensure you have the necessary dependencies:

# Linux
sudo apt-get install -y \
  chromium-browser \
  libnss3 \
  libatk-bridge2.0-0 \
  libdrm2 \
  libxkbcommon0 \
  libgbm1

# macOS (via Homebrew)
brew install chromium

Memory Leaks

Close pages and contexts when done to prevent memory leaks:

async rotatePages() {
  if (this.page) {
    await this.page.close();
  }
  this.page = await this.browser.newPage();
}

Timeout Issues

Adjust timeouts based on page complexity:

await this.page.goto(url, {
  waitUntil: ['load', 'domcontentloaded', 'networkidle0'],
  timeout: 60000 // 60 seconds for slow pages
});

Conclusion

Integrating Puppeteer with MCP creates a powerful framework for AI-assisted web scraping. By exposing browser automation capabilities through the Model Context Protocol, you enable AI assistants to intelligently navigate websites, extract data, and handle complex scraping scenarios. This approach combines the flexibility of Puppeteer with the intelligence of AI models, making web scraping more accessible and maintainable.

Remember to always respect robots.txt files, terms of service, and implement appropriate rate limiting when scraping websites. For production environments, consider implementing proxy rotation, CAPTCHA handling, and comprehensive error recovery mechanisms.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon