What are the best MCP server tutorials for web scraping?
The Model Context Protocol (MCP) has revolutionized how developers build web scraping tools by providing a standardized way to connect AI assistants with data sources and automation capabilities. If you're looking to master web scraping with MCP servers, this comprehensive guide covers the best tutorials and learning resources available.
Understanding MCP Server Basics
Before diving into web scraping tutorials, it's essential to understand what MCP servers are and how they work. The Model Context Protocol is an open standard that enables seamless integration between AI applications and external data sources. For web scraping, MCP servers act as intermediaries that provide tools, resources, and prompts to AI assistants like Claude.
Getting Started with MCP
The best place to start is the official MCP documentation at modelcontextprotocol.io
. This resource provides:
- Architecture Overview: Understanding the client-server relationship
- Protocol Specifications: How messages are exchanged between components
- Security Best Practices: Authentication and authorization patterns
- SDK Documentation: Official TypeScript and Python SDKs
Essential MCP Server Tutorials for Web Scraping
1. Playwright MCP Server Tutorial
The Playwright MCP Server is one of the most powerful tools for web scraping. Here's a step-by-step tutorial to get started:
Installation
# Install the Playwright MCP server via npm
npm install -g @automatalabs/mcp-server-playwright
# Or install locally in your project
npm install @automatalabs/mcp-server-playwright
Configuration
Create an MCP configuration file (mcp.json
):
{
"mcpServers": {
"playwright": {
"command": "npx",
"args": ["-y", "@automatalabs/mcp-server-playwright"],
"env": {
"PLAYWRIGHT_BROWSER": "chromium"
}
}
}
}
Basic Scraping Example
Once configured with Claude Desktop or another MCP client, you can use natural language to control the browser:
// Example workflow - you would describe this to Claude
// 1. Navigate to target website
// 2. Take a snapshot of the page
// 3. Click on specific elements
// 4. Extract data from the page
// The MCP server translates these commands into Playwright actions
const browser = await playwright.chromium.launch();
const page = await browser.newPage();
await page.goto('https://example.com');
// Extract data
const data = await page.evaluate(() => {
return {
title: document.querySelector('h1').textContent,
paragraphs: Array.from(document.querySelectorAll('p')).map(p => p.textContent)
};
});
await browser.close();
This approach is similar to how you handle browser sessions in Puppeteer, but with the added benefit of AI-assisted automation through MCP.
2. Puppeteer MCP Server Tutorial
The Puppeteer MCP server provides another excellent option for browser automation and web scraping:
Setup
# Install the Puppeteer MCP server
npm install @executeautomation/puppeteer-mcp-server
Configuration
Add to your MCP configuration:
{
"mcpServers": {
"puppeteer": {
"command": "node",
"args": ["path/to/puppeteer-mcp-server/dist/index.js"],
"env": {
"HEADLESS": "true"
}
}
}
}
Python Example with Puppeteer MCP
# Using the MCP Python SDK to interact with Puppeteer server
from mcp import ClientSession, StdioServerParameters
from mcp.client.stdio import stdio_client
async def scrape_with_puppeteer():
# Connect to the MCP server
server_params = StdioServerParameters(
command="node",
args=["puppeteer-mcp-server/dist/index.js"]
)
async with stdio_client(server_params) as (read, write):
async with ClientSession(read, write) as session:
# Initialize connection
await session.initialize()
# Call tools provided by the server
result = await session.call_tool(
"navigate",
arguments={"url": "https://example.com"}
)
# Extract data
data = await session.call_tool(
"evaluate",
arguments={
"script": "document.querySelector('h1').textContent"
}
)
return data
3. Building a Custom MCP Server for Web Scraping
For advanced use cases, you may want to build your own MCP server. Here's a tutorial outline:
TypeScript Implementation
import { Server } from "@modelcontextprotocol/sdk/server/index.js";
import { StdioServerTransport } from "@modelcontextprotocol/sdk/server/stdio.js";
import axios from "axios";
import * as cheerio from "cheerio";
// Create a new MCP server
const server = new Server(
{
name: "custom-scraper",
version: "1.0.0"
},
{
capabilities: {
tools: {}
}
}
);
// Define a scraping tool
server.setRequestHandler("tools/list", async () => {
return {
tools: [
{
name: "scrape_url",
description: "Scrape content from a URL",
inputSchema: {
type: "object",
properties: {
url: {
type: "string",
description: "URL to scrape"
},
selector: {
type: "string",
description: "CSS selector for content"
}
},
required: ["url"]
}
}
]
};
});
// Implement the scraping logic
server.setRequestHandler("tools/call", async (request) => {
if (request.params.name === "scrape_url") {
const { url, selector } = request.params.arguments;
try {
const response = await axios.get(url);
const $ = cheerio.load(response.data);
const content = selector
? $(selector).text()
: $('body').text();
return {
content: [
{
type: "text",
text: content
}
]
};
} catch (error) {
return {
content: [
{
type: "text",
text: `Error: ${error.message}`
}
],
isError: true
};
}
}
});
// Start the server
async function main() {
const transport = new StdioServerTransport();
await server.connect(transport);
}
main();
4. WebScraping.AI MCP Integration Tutorial
For developers who want a managed solution, integrating WebScraping.AI with MCP provides powerful capabilities without managing browser infrastructure:
Installation
npm install @webscraping-ai/mcp-server
Configuration Example
{
"mcpServers": {
"webscraping-ai": {
"command": "npx",
"args": ["-y", "@webscraping-ai/mcp-server"],
"env": {
"WEBSCRAPING_AI_API_KEY": "your-api-key-here"
}
}
}
}
Usage Example
import asyncio
from mcp import ClientSession, StdioServerParameters
from mcp.client.stdio import stdio_client
async def scrape_with_api():
server_params = StdioServerParameters(
command="npx",
args=["-y", "@webscraping-ai/mcp-server"],
env={"WEBSCRAPING_AI_API_KEY": "your-api-key"}
)
async with stdio_client(server_params) as (read, write):
async with ClientSession(read, write) as session:
await session.initialize()
# Scrape with AI question answering
result = await session.call_tool(
"scrape_question",
arguments={
"url": "https://example.com/products",
"question": "What are all the product names and prices?"
}
)
print(result)
asyncio.run(scrape_with_api())
Advanced MCP Web Scraping Patterns
Handling Dynamic Content
When scraping JavaScript-heavy websites, MCP servers with browser automation capabilities excel. Here's an advanced pattern for handling AJAX requests using Puppeteer through MCP:
// Describe to your MCP-enabled AI assistant:
// "Navigate to the page, wait for the AJAX request to complete,
// then extract the dynamically loaded data"
// The MCP server executes:
await page.goto('https://example.com/dynamic-content');
await page.waitForSelector('.loaded-content');
await page.waitForTimeout(1000); // Wait for AJAX
const data = await page.evaluate(() => {
return Array.from(document.querySelectorAll('.item')).map(item => ({
title: item.querySelector('.title').textContent,
price: item.querySelector('.price').textContent
}));
});
Error Handling and Retry Logic
// Custom MCP tool with robust error handling
server.setRequestHandler("tools/call", async (request) => {
const maxRetries = 3;
let lastError;
for (let i = 0; i < maxRetries; i++) {
try {
const response = await axios.get(request.params.arguments.url, {
timeout: 10000,
headers: {
'User-Agent': 'Mozilla/5.0 (compatible; MCPBot/1.0)'
}
});
return {
content: [{ type: "text", text: response.data }]
};
} catch (error) {
lastError = error;
await new Promise(resolve => setTimeout(resolve, 1000 * (i + 1)));
}
}
return {
content: [{ type: "text", text: `Failed after ${maxRetries} retries: ${lastError.message}` }],
isError: true
};
});
Working with Pagination
# MCP tool call for paginated scraping
async def scrape_all_pages(session, base_url, max_pages=10):
all_data = []
for page_num in range(1, max_pages + 1):
url = f"{base_url}?page={page_num}"
result = await session.call_tool(
"scrape_url",
arguments={
"url": url,
"selector": ".product-list .item"
}
)
if not result.content:
break # No more data
all_data.extend(result.content)
return all_data
Best Practices for MCP Web Scraping
1. Resource Management
When using browser-based MCP servers, always ensure proper cleanup:
// In your MCP server implementation
async function cleanupBrowser() {
if (browser) {
await browser.close();
browser = null;
}
}
// Handle process termination
process.on('SIGINT', cleanupBrowser);
process.on('SIGTERM', cleanupBrowser);
2. Rate Limiting
Implement rate limiting in your custom MCP servers:
class RateLimiter {
private queue: Array<() => Promise<any>> = [];
private processing = false;
async add<T>(fn: () => Promise<T>): Promise<T> {
return new Promise((resolve, reject) => {
this.queue.push(async () => {
try {
const result = await fn();
resolve(result);
} catch (error) {
reject(error);
}
});
this.process();
});
}
private async process() {
if (this.processing || this.queue.length === 0) return;
this.processing = true;
const fn = this.queue.shift();
if (fn) {
await fn();
await new Promise(resolve => setTimeout(resolve, 1000)); // 1 second delay
}
this.processing = false;
this.process();
}
}
3. Structured Data Extraction
Use JSON schema to define expected data structures:
{
"tools": [
{
"name": "extract_product_data",
"description": "Extract structured product information",
"inputSchema": {
"type": "object",
"properties": {
"url": { "type": "string" }
}
},
"outputSchema": {
"type": "object",
"properties": {
"products": {
"type": "array",
"items": {
"type": "object",
"properties": {
"name": { "type": "string" },
"price": { "type": "number" },
"availability": { "type": "boolean" }
}
}
}
}
}
}
]
}
Learning Resources and Community
Official Documentation
- MCP Specification:
https://spec.modelcontextprotocol.io/
- MCP SDK Documentation: GitHub repositories for TypeScript and Python SDKs
- Example Servers: The official MCP GitHub organization hosts numerous example servers
Community Tutorials
- Anthropic Developer Forums: Active community discussing MCP implementations
- GitHub Discussions: Many MCP server repositories have detailed discussions
- YouTube Channels: Search for "MCP server tutorial" for video walkthroughs
Sample Projects
Clone and study these open-source MCP servers:
# Playwright MCP Server
git clone https://github.com/automatalabs/mcp-server-playwright
# Puppeteer MCP Server
git clone https://github.com/executeautomation/puppeteer-mcp-server
# Fetch MCP Server (HTTP requests)
git clone https://github.com/modelcontextprotocol/servers/tree/main/src/fetch
Conclusion
MCP servers provide a powerful, standardized approach to web scraping that combines the flexibility of traditional scraping libraries with AI-assisted automation. Whether you're using pre-built servers like Playwright and Puppeteer MCP or building your own custom solutions, the tutorials and patterns outlined here will help you get started and scale your web scraping operations effectively.
Start with the official MCP documentation, experiment with existing servers, and gradually build your own custom tools as your needs evolve. The combination of browser automation capabilities similar to navigating to different pages using Puppeteer with AI-powered orchestration makes MCP an excellent choice for modern web scraping projects.