How do I scrape websites using Puppeteer with MCP?
Puppeteer is a powerful Node.js library that provides a high-level API to control Chrome or Chromium browsers. When combined with the Model Context Protocol (MCP), you can create sophisticated web scraping workflows that leverage AI-assisted browser automation. This guide demonstrates how to use Puppeteer with MCP servers for efficient web scraping.
Understanding Puppeteer and MCP Integration
The Model Context Protocol (MCP) is an open protocol that standardizes how applications interact with AI models and tools. When you integrate Puppeteer with MCP, you can expose browser automation capabilities as MCP tools, allowing AI assistants like Claude to control browsers and extract data programmatically.
The MCP Puppeteer server provides a standardized interface for browser automation tasks including:
- Page navigation and interaction
- DOM element selection and manipulation
- Screenshot capture
- Network request monitoring
- JavaScript execution in browser context
- Form filling and submission
Setting Up Puppeteer with MCP
Installation
First, install the necessary dependencies:
# Install Puppeteer
npm install puppeteer
# Install MCP SDK
npm install @modelcontextprotocol/sdk
For Docker environments, you may need additional dependencies. Learn more about using Puppeteer with Docker.
Creating a Basic MCP Puppeteer Server
Here's a minimal MCP server implementation that exposes Puppeteer functionality:
import { Server } from '@modelcontextprotocol/sdk/server/index.js';
import { StdioServerTransport } from '@modelcontextprotocol/sdk/server/stdio.js';
import puppeteer from 'puppeteer';
class PuppeteerMCPServer {
constructor() {
this.server = new Server({
name: 'puppeteer-scraper',
version: '1.0.0'
}, {
capabilities: {
tools: {}
}
});
this.browser = null;
this.page = null;
this.setupTools();
}
setupTools() {
// Register navigation tool
this.server.setRequestHandler('tools/list', async () => ({
tools: [
{
name: 'navigate',
description: 'Navigate to a URL',
inputSchema: {
type: 'object',
properties: {
url: { type: 'string', description: 'URL to navigate to' }
},
required: ['url']
}
},
{
name: 'scrape_content',
description: 'Extract text content from current page',
inputSchema: {
type: 'object',
properties: {
selector: { type: 'string', description: 'CSS selector' }
}
}
},
{
name: 'screenshot',
description: 'Take a screenshot of the page',
inputSchema: {
type: 'object',
properties: {
path: { type: 'string', description: 'File path to save' }
},
required: ['path']
}
}
]
}));
// Handle tool execution
this.server.setRequestHandler('tools/call', async (request) => {
const { name, arguments: args } = request.params;
// Initialize browser if needed
if (!this.browser) {
this.browser = await puppeteer.launch({
headless: 'new',
args: ['--no-sandbox', '--disable-setuid-sandbox']
});
this.page = await this.browser.newPage();
}
switch (name) {
case 'navigate':
await this.page.goto(args.url, { waitUntil: 'networkidle0' });
return { content: [{ type: 'text', text: `Navigated to ${args.url}` }] };
case 'scrape_content':
const content = await this.page.evaluate((selector) => {
const elements = selector
? document.querySelectorAll(selector)
: [document.body];
return Array.from(elements).map(el => el.textContent.trim());
}, args.selector);
return { content: [{ type: 'text', text: JSON.stringify(content) }] };
case 'screenshot':
await this.page.screenshot({ path: args.path });
return { content: [{ type: 'text', text: `Screenshot saved to ${args.path}` }] };
default:
throw new Error(`Unknown tool: ${name}`);
}
});
}
async run() {
const transport = new StdioServerTransport();
await this.server.connect(transport);
}
async cleanup() {
if (this.browser) {
await this.browser.close();
}
}
}
// Start the server
const server = new PuppeteerMCPServer();
server.run().catch(console.error);
process.on('SIGINT', async () => {
await server.cleanup();
process.exit(0);
});
Advanced Web Scraping with Puppeteer MCP
Handling Dynamic Content
Many modern websites load content dynamically via AJAX. To effectively handle AJAX requests using Puppeteer, you need to wait for content to load:
{
name: 'wait_and_scrape',
description: 'Wait for an element and scrape content',
inputSchema: {
type: 'object',
properties: {
selector: { type: 'string' },
timeout: { type: 'number', default: 30000 }
},
required: ['selector']
}
}
// In the handler:
case 'wait_and_scrape':
await this.page.waitForSelector(args.selector, {
timeout: args.timeout || 30000
});
const data = await this.page.$eval(args.selector, el => ({
text: el.textContent,
html: el.innerHTML,
attributes: Array.from(el.attributes).map(attr => ({
name: attr.name,
value: attr.value
}))
}));
return { content: [{ type: 'text', text: JSON.stringify(data) }] };
Scraping with Authentication
To handle websites that require login:
{
name: 'login',
description: 'Authenticate with username and password',
inputSchema: {
type: 'object',
properties: {
url: { type: 'string' },
username: { type: 'string' },
password: { type: 'string' },
usernameSelector: { type: 'string' },
passwordSelector: { type: 'string' },
submitSelector: { type: 'string' }
},
required: ['url', 'username', 'password']
}
}
// Handler implementation:
case 'login':
await this.page.goto(args.url);
await this.page.type(args.usernameSelector || 'input[type="text"]', args.username);
await this.page.type(args.passwordSelector || 'input[type="password"]', args.password);
await this.page.click(args.submitSelector || 'button[type="submit"]');
await this.page.waitForNavigation({ waitUntil: 'networkidle0' });
return { content: [{ type: 'text', text: 'Login successful' }] };
Learn more about handling authentication in Puppeteer.
Extracting Structured Data
For extracting structured data like product listings or table data:
{
name: 'extract_structured_data',
description: 'Extract structured data from repeating elements',
inputSchema: {
type: 'object',
properties: {
containerSelector: { type: 'string' },
fields: {
type: 'object',
description: 'Field name to selector mapping'
}
},
required: ['containerSelector', 'fields']
}
}
// Handler:
case 'extract_structured_data':
const results = await this.page.$$eval(
args.containerSelector,
(elements, fields) => {
return elements.map(element => {
const item = {};
for (const [fieldName, selector] of Object.entries(fields)) {
const el = element.querySelector(selector);
item[fieldName] = el ? el.textContent.trim() : null;
}
return item;
});
},
args.fields
);
return { content: [{ type: 'text', text: JSON.stringify(results, null, 2) }] };
Python Implementation with MCP
You can also create an MCP server using Python with Pyppeteer (a Python port of Puppeteer):
from mcp.server import Server
from mcp.server.stdio import stdio_server
from pyppeteer import launch
import asyncio
import json
class PuppeteerMCPServer:
def __init__(self):
self.server = Server("puppeteer-scraper")
self.browser = None
self.page = None
# Register tools
@self.server.list_tools()
async def list_tools():
return [
{
"name": "navigate",
"description": "Navigate to a URL",
"inputSchema": {
"type": "object",
"properties": {
"url": {"type": "string"}
},
"required": ["url"]
}
},
{
"name": "scrape_text",
"description": "Extract text from elements",
"inputSchema": {
"type": "object",
"properties": {
"selector": {"type": "string"}
},
"required": ["selector"]
}
}
]
@self.server.call_tool()
async def call_tool(name: str, arguments: dict):
# Initialize browser if needed
if not self.browser:
self.browser = await launch(
headless=True,
args=['--no-sandbox', '--disable-setuid-sandbox']
)
self.page = await self.browser.newPage()
if name == "navigate":
await self.page.goto(arguments["url"], {
'waitUntil': 'networkidle0'
})
return f"Navigated to {arguments['url']}"
elif name == "scrape_text":
elements = await self.page.querySelectorAll(arguments["selector"])
texts = []
for element in elements:
text = await self.page.evaluate('(element) => element.textContent', element)
texts.append(text.strip())
return json.dumps(texts)
else:
raise ValueError(f"Unknown tool: {name}")
async def run(self):
async with stdio_server() as (read_stream, write_stream):
await self.server.run(
read_stream,
write_stream,
self.server.create_initialization_options()
)
async def cleanup(self):
if self.browser:
await self.browser.close()
# Run the server
async def main():
server = PuppeteerMCPServer()
try:
await server.run()
finally:
await server.cleanup()
if __name__ == "__main__":
asyncio.run(main())
Best Practices for Puppeteer MCP Scraping
1. Resource Management
Always properly manage browser resources:
async cleanup() {
if (this.page) {
await this.page.close();
}
if (this.browser) {
await this.browser.close();
}
}
// Set up cleanup handlers
process.on('SIGTERM', cleanup);
process.on('SIGINT', cleanup);
2. Error Handling
Implement robust error handling for network issues and page load failures:
case 'navigate':
try {
await this.page.goto(args.url, {
waitUntil: 'networkidle0',
timeout: 30000
});
return { content: [{ type: 'text', text: `Successfully loaded ${args.url}` }] };
} catch (error) {
if (error.name === 'TimeoutError') {
return {
content: [{ type: 'text', text: `Timeout loading ${args.url}` }],
isError: true
};
}
throw error;
}
3. Rate Limiting
Implement delays between requests to avoid overwhelming servers:
async function delay(ms) {
return new Promise(resolve => setTimeout(resolve, ms));
}
case 'scrape_multiple':
const results = [];
for (const url of args.urls) {
await this.page.goto(url);
const data = await this.page.content();
results.push(data);
await delay(2000); // 2-second delay between requests
}
return { content: [{ type: 'text', text: JSON.stringify(results) }] };
4. Session Management
Maintain browser sessions efficiently to reuse cookies and reduce overhead. For complex scenarios, review techniques for handling browser sessions in Puppeteer.
Configuring MCP Client to Use Puppeteer Server
To use your Puppeteer MCP server with Claude Desktop or other MCP clients, add this configuration to your MCP settings file:
{
"mcpServers": {
"puppeteer-scraper": {
"command": "node",
"args": ["/path/to/your/puppeteer-mcp-server.js"]
}
}
}
For Claude Desktop on macOS, this file is located at:
~/Library/Application Support/Claude/claude_desktop_config.json
Example Use Cases
E-commerce Product Scraping
// Tool definition for product scraping
{
name: 'scrape_product',
description: 'Extract product information from e-commerce page',
inputSchema: {
type: 'object',
properties: {
url: { type: 'string' },
selectors: {
type: 'object',
properties: {
title: { type: 'string' },
price: { type: 'string' },
description: { type: 'string' },
images: { type: 'string' }
}
}
},
required: ['url', 'selectors']
}
}
News Article Extraction
case 'scrape_article':
await this.page.goto(args.url);
const article = await this.page.evaluate(() => {
return {
title: document.querySelector('h1')?.textContent,
author: document.querySelector('[rel="author"]')?.textContent,
date: document.querySelector('time')?.getAttribute('datetime'),
content: document.querySelector('article')?.textContent,
images: Array.from(document.querySelectorAll('article img'))
.map(img => img.src)
};
});
return { content: [{ type: 'text', text: JSON.stringify(article) }] };
Troubleshooting Common Issues
Browser Launch Failures
If the browser fails to launch, ensure you have the necessary dependencies:
# Linux
sudo apt-get install -y \
chromium-browser \
libnss3 \
libatk-bridge2.0-0 \
libdrm2 \
libxkbcommon0 \
libgbm1
# macOS (via Homebrew)
brew install chromium
Memory Leaks
Close pages and contexts when done to prevent memory leaks:
async rotatePages() {
if (this.page) {
await this.page.close();
}
this.page = await this.browser.newPage();
}
Timeout Issues
Adjust timeouts based on page complexity:
await this.page.goto(url, {
waitUntil: ['load', 'domcontentloaded', 'networkidle0'],
timeout: 60000 // 60 seconds for slow pages
});
Conclusion
Integrating Puppeteer with MCP creates a powerful framework for AI-assisted web scraping. By exposing browser automation capabilities through the Model Context Protocol, you enable AI assistants to intelligently navigate websites, extract data, and handle complex scraping scenarios. This approach combines the flexibility of Puppeteer with the intelligence of AI models, making web scraping more accessible and maintainable.
Remember to always respect robots.txt files, terms of service, and implement appropriate rate limiting when scraping websites. For production environments, consider implementing proxy rotation, CAPTCHA handling, and comprehensive error recovery mechanisms.