How do I use MCP server authentication in my scraper?
Integrating MCP server authentication into your web scraping workflows requires understanding how to securely pass credentials, manage authenticated sessions, and handle multi-service authentication patterns. Unlike standalone scrapers where you might hardcode API keys or manage credentials manually, MCP servers provide a structured approach to credential management that enhances security and maintainability.
This guide demonstrates practical implementation patterns for integrating authentication into your scrapers, whether you're building custom MCP servers or using existing ones to handle authenticated web scraping tasks.
Understanding MCP Authentication in Scraping Context
When you use MCP servers for web scraping, authentication operates at multiple levels:
- MCP Server Access: The server itself runs as a trusted process with access to environment variables containing credentials
- Scraping API Authentication: Your scraper authenticates with services like WebScraping.AI using API keys
- Target Website Authentication: Scrapers may need to pass cookies, tokens, or credentials to access protected content
- Proxy Authentication: Requests may route through authenticated proxy services
The key advantage of using MCP servers is centralizing credential management while allowing your scraping tools to access them securely.
Basic MCP Server Authentication Setup
Python Scraper with MCP Authentication
Here's a complete example of a Python-based MCP server that implements authentication for web scraping:
import os
import asyncio
from mcp.server import Server
from mcp.server.stdio import stdio_server
from mcp.types import Tool, TextContent
import httpx
from typing import Dict, Optional
class AuthenticatedScraper:
"""Web scraper with MCP-managed authentication"""
def __init__(self):
# Load credentials from environment (set via MCP config)
self.api_key = os.environ.get("WEBSCRAPING_AI_API_KEY")
self.proxy_username = os.environ.get("PROXY_USERNAME")
self.proxy_password = os.environ.get("PROXY_PASSWORD")
if not self.api_key:
raise ValueError(
"WEBSCRAPING_AI_API_KEY must be set in MCP server config"
)
async def scrape_html(
self,
url: str,
use_proxy: bool = False,
wait_for: Optional[str] = None,
headers: Optional[Dict[str, str]] = None
) -> str:
"""
Scrape HTML with authenticated API access
Args:
url: Target URL to scrape
use_proxy: Whether to use authenticated proxy
wait_for: CSS selector to wait for
headers: Custom headers for request
Returns:
Scraped HTML content
"""
params = {
"url": url,
"api_key": self.api_key,
"js": "true"
}
# Add proxy authentication if enabled
if use_proxy and self.proxy_username and self.proxy_password:
params["proxy"] = "residential"
params["proxy_username"] = self.proxy_username
params["proxy_password"] = self.proxy_password
# Add wait condition
if wait_for:
params["wait_for"] = wait_for
# Add custom headers
if headers:
params["headers"] = headers
async with httpx.AsyncClient(timeout=60.0) as client:
response = await client.get(
"https://api.webscraping.ai/html",
params=params
)
response.raise_for_status()
return response.text
async def scrape_with_cookies(
self,
url: str,
cookies: Dict[str, str]
) -> str:
"""
Scrape authenticated pages using session cookies
Args:
url: Target URL requiring authentication
cookies: Session cookies for authentication
Returns:
Scraped HTML content
"""
async with httpx.AsyncClient(timeout=60.0) as client:
response = await client.post(
"https://api.webscraping.ai/html",
params={
"url": url,
"api_key": self.api_key,
"js": "true"
},
json={"cookies": cookies}
)
response.raise_for_status()
return response.text
# Initialize MCP server with authentication
app = Server("authenticated-scraper")
scraper = AuthenticatedScraper()
@app.list_tools()
async def list_tools() -> list[Tool]:
"""Define available scraping tools"""
return [
Tool(
name="scrape_page",
description="Scrape any webpage with API authentication",
inputSchema={
"type": "object",
"properties": {
"url": {
"type": "string",
"description": "URL to scrape"
},
"use_proxy": {
"type": "boolean",
"description": "Use authenticated proxy (default: false)"
},
"wait_for": {
"type": "string",
"description": "CSS selector to wait for before extraction"
}
},
"required": ["url"]
}
),
Tool(
name="scrape_authenticated_page",
description="Scrape pages requiring login/session cookies",
inputSchema={
"type": "object",
"properties": {
"url": {
"type": "string",
"description": "URL requiring authentication"
},
"cookies": {
"type": "object",
"description": "Session cookies as key-value pairs"
}
},
"required": ["url", "cookies"]
}
)
]
@app.call_tool()
async def call_tool(name: str, arguments: dict) -> list[TextContent]:
"""Handle tool execution with authentication"""
try:
if name == "scrape_page":
html = await scraper.scrape_html(
url=arguments["url"],
use_proxy=arguments.get("use_proxy", False),
wait_for=arguments.get("wait_for")
)
return [TextContent(
type="text",
text=f"Successfully scraped {arguments['url']}\n\n{html}"
)]
elif name == "scrape_authenticated_page":
html = await scraper.scrape_with_cookies(
url=arguments["url"],
cookies=arguments["cookies"]
)
return [TextContent(
type="text",
text=f"Successfully scraped authenticated page\n\n{html}"
)]
else:
raise ValueError(f"Unknown tool: {name}")
except httpx.HTTPStatusError as e:
return [TextContent(
type="text",
text=f"Scraping failed: {e.response.status_code} - {e.response.text}"
)]
except Exception as e:
return [TextContent(
type="text",
text=f"Error: {str(e)}"
)]
async def main():
"""Start the authenticated MCP server"""
async with stdio_server() as (read_stream, write_stream):
await app.run(read_stream, write_stream)
if __name__ == "__main__":
asyncio.run(main())
JavaScript/TypeScript Scraper with MCP Authentication
For Node.js environments, implement authentication similarly:
import { Server } from "@modelcontextprotocol/sdk/server/index.js";
import { StdioServerTransport } from "@modelcontextprotocol/sdk/server/stdio.js";
import {
CallToolRequestSchema,
ListToolsRequestSchema,
} from "@modelcontextprotocol/sdk/types.js";
import axios, { AxiosError } from "axios";
interface ScraperConfig {
apiKey: string;
proxyUsername?: string;
proxyPassword?: string;
}
class AuthenticatedScraper {
private config: ScraperConfig;
constructor() {
// Load credentials from environment
const apiKey = process.env.WEBSCRAPING_AI_API_KEY;
if (!apiKey) {
throw new Error(
"WEBSCRAPING_AI_API_KEY must be configured in MCP server environment"
);
}
this.config = {
apiKey,
proxyUsername: process.env.PROXY_USERNAME,
proxyPassword: process.env.PROXY_PASSWORD,
};
}
async scrapeHtml(
url: string,
options: {
useProxy?: boolean;
waitFor?: string;
timeout?: number;
headers?: Record<string, string>;
} = {}
): Promise<string> {
const params: any = {
url,
api_key: this.config.apiKey,
js: true,
timeout: options.timeout || 30000,
};
// Configure proxy authentication
if (options.useProxy && this.config.proxyUsername && this.config.proxyPassword) {
params.proxy = "residential";
params.proxy_username = this.config.proxyUsername;
params.proxy_password = this.config.proxyPassword;
}
// Add wait condition
if (options.waitFor) {
params.wait_for = options.waitFor;
}
// Add custom headers
if (options.headers) {
params.headers = JSON.stringify(options.headers);
}
try {
const response = await axios.get(
"https://api.webscraping.ai/html",
{
params,
timeout: 60000,
}
);
return response.data;
} catch (error) {
if (axios.isAxiosError(error)) {
throw new Error(
`Scraping failed: ${error.response?.status} - ${error.response?.data}`
);
}
throw error;
}
}
async scrapeWithAuth(
url: string,
cookies: Record<string, string>
): Promise<string> {
try {
const response = await axios.post(
"https://api.webscraping.ai/html",
{
cookies,
},
{
params: {
url,
api_key: this.config.apiKey,
js: true,
},
timeout: 60000,
}
);
return response.data;
} catch (error) {
if (axios.isAxiosError(error)) {
throw new Error(
`Authentication failed: ${error.response?.status} - ${error.response?.data}`
);
}
throw error;
}
}
async extractData(url: string, fields: Record<string, string>): Promise<any> {
try {
const response = await axios.post(
"https://api.webscraping.ai/fields",
{
fields,
},
{
params: {
url,
api_key: this.config.apiKey,
},
timeout: 60000,
}
);
return response.data;
} catch (error) {
if (axios.isAxiosError(error)) {
throw new Error(
`Data extraction failed: ${error.response?.status}`
);
}
throw error;
}
}
}
// Initialize server and scraper
const server = new Server(
{
name: "authenticated-web-scraper",
version: "1.0.0",
},
{
capabilities: {
tools: {},
},
}
);
const scraper = new AuthenticatedScraper();
// Define available tools
server.setRequestHandler(ListToolsRequestSchema, async () => {
return {
tools: [
{
name: "scrape_webpage",
description: "Scrape any webpage with authenticated API access",
inputSchema: {
type: "object",
properties: {
url: {
type: "string",
description: "Target URL to scrape",
},
use_proxy: {
type: "boolean",
description: "Use authenticated residential proxy",
},
wait_for: {
type: "string",
description: "CSS selector to wait for",
},
timeout: {
type: "number",
description: "Request timeout in milliseconds",
},
},
required: ["url"],
},
},
{
name: "scrape_with_cookies",
description: "Scrape authenticated pages using session cookies",
inputSchema: {
type: "object",
properties: {
url: {
type: "string",
description: "URL requiring authentication",
},
cookies: {
type: "object",
description: "Session cookies as key-value pairs",
},
},
required: ["url", "cookies"],
},
},
{
name: "extract_fields",
description: "Extract specific data fields using AI",
inputSchema: {
type: "object",
properties: {
url: {
type: "string",
description: "URL to extract data from",
},
fields: {
type: "object",
description: "Field names and extraction instructions",
},
},
required: ["url", "fields"],
},
},
],
};
});
// Handle tool execution
server.setRequestHandler(CallToolRequestSchema, async (request) => {
const { name, arguments: args } = request.params;
try {
switch (name) {
case "scrape_webpage": {
const html = await scraper.scrapeHtml(args.url, {
useProxy: args.use_proxy,
waitFor: args.wait_for,
timeout: args.timeout,
});
return {
content: [
{
type: "text",
text: `Successfully scraped ${args.url}\n\n${html}`,
},
],
};
}
case "scrape_with_cookies": {
const html = await scraper.scrapeWithAuth(args.url, args.cookies);
return {
content: [
{
type: "text",
text: `Successfully scraped authenticated page\n\n${html}`,
},
],
};
}
case "extract_fields": {
const data = await scraper.extractData(args.url, args.fields);
return {
content: [
{
type: "text",
text: JSON.stringify(data, null, 2),
},
],
};
}
default:
throw new Error(`Unknown tool: ${name}`);
}
} catch (error: any) {
return {
content: [
{
type: "text",
text: `Error: ${error.message}`,
},
],
isError: true,
};
}
});
// Start the server
async function main() {
const transport = new StdioServerTransport();
await server.connect(transport);
console.error("Authenticated web scraping MCP server running");
}
main().catch((error) => {
console.error("Failed to start server:", error);
process.exit(1);
});
MCP Server Configuration with Credentials
To use your authenticated scraper, configure the MCP client with necessary credentials:
macOS Configuration
Edit ~/Library/Application Support/Claude/claude_desktop_config.json
:
{
"mcpServers": {
"web-scraper": {
"command": "python",
"args": ["/path/to/authenticated_scraper.py"],
"env": {
"WEBSCRAPING_AI_API_KEY": "your_api_key_here",
"PROXY_USERNAME": "your_proxy_username",
"PROXY_PASSWORD": "your_proxy_password"
}
}
}
}
Windows Configuration
Edit %APPDATA%\Claude\claude_desktop_config.json
:
{
"mcpServers": {
"web-scraper": {
"command": "node",
"args": ["C:\\path\\to\\authenticated_scraper.js"],
"env": {
"WEBSCRAPING_AI_API_KEY": "your_api_key_here",
"PROXY_USERNAME": "your_proxy_username",
"PROXY_PASSWORD": "your_proxy_password"
}
}
}
}
Using System Environment Variables
For better security, reference system environment variables:
{
"mcpServers": {
"web-scraper": {
"command": "python",
"args": ["/path/to/scraper.py"],
"env": {
"WEBSCRAPING_AI_API_KEY": "${WEBSCRAPING_AI_API_KEY}",
"PROXY_USERNAME": "${PROXY_USERNAME}",
"PROXY_PASSWORD": "${PROXY_PASSWORD}"
}
}
}
}
Set system variables:
# macOS/Linux - Add to ~/.bashrc or ~/.zshrc
export WEBSCRAPING_AI_API_KEY="your_key"
export PROXY_USERNAME="your_username"
export PROXY_PASSWORD="your_password"
# Windows PowerShell
[System.Environment]::SetEnvironmentVariable('WEBSCRAPING_AI_API_KEY', 'your_key', 'User')
Advanced Authentication Patterns for Scrapers
Multi-Target Scraping with Session Management
When scraping multiple authenticated sites, manage separate session credentials:
from dataclasses import dataclass
from typing import Dict
import httpx
@dataclass
class SessionCredentials:
"""Manage credentials for different target sites"""
cookies: Dict[str, str]
headers: Dict[str, str]
auth_token: str = ""
class MultiSiteScraper:
def __init__(self, api_key: str):
self.api_key = api_key
self.sessions: Dict[str, SessionCredentials] = {}
def add_session(self, site_name: str, credentials: SessionCredentials):
"""Register authentication credentials for a specific site"""
self.sessions[site_name] = credentials
async def scrape_with_session(self, site_name: str, url: str) -> str:
"""Scrape using stored session credentials"""
if site_name not in self.sessions:
raise ValueError(f"No session found for {site_name}")
creds = self.sessions[site_name]
async with httpx.AsyncClient() as client:
# Combine session cookies with headers
headers = {**creds.headers}
if creds.auth_token:
headers["Authorization"] = f"Bearer {creds.auth_token}"
response = await client.post(
"https://api.webscraping.ai/html",
params={
"url": url,
"api_key": self.api_key,
"js": "true"
},
json={
"cookies": creds.cookies,
"headers": headers
}
)
return response.text
# Usage in MCP server
@app.call_tool()
async def call_tool(name: str, arguments: dict):
if name == "scrape_with_session":
# Add session before scraping
scraper.add_session(
site_name=arguments["site_name"],
credentials=SessionCredentials(
cookies=arguments["cookies"],
headers=arguments.get("headers", {}),
auth_token=arguments.get("auth_token", "")
)
)
html = await scraper.scrape_with_session(
site_name=arguments["site_name"],
url=arguments["url"]
)
return [TextContent(type="text", text=html)]
Rotating Proxy Authentication
Implement proxy rotation with authentication for large-scale scraping:
class ProxyRotator {
private proxyList: Array<{
host: string;
username: string;
password: string;
}>;
private currentIndex: number = 0;
constructor() {
// Load proxy credentials from environment
this.proxyList = JSON.parse(
process.env.PROXY_LIST || "[]"
);
}
getNextProxy() {
if (this.proxyList.length === 0) {
return null;
}
const proxy = this.proxyList[this.currentIndex];
this.currentIndex = (this.currentIndex + 1) % this.proxyList.length;
return proxy;
}
async scrapeWithRotation(url: string, apiKey: string): Promise<string> {
const proxy = this.getNextProxy();
if (!proxy) {
throw new Error("No proxies configured");
}
const response = await axios.get("https://api.webscraping.ai/html", {
params: {
url,
api_key: apiKey,
proxy: "custom",
proxy_url: `http://${proxy.username}:${proxy.password}@${proxy.host}`,
},
});
return response.data;
}
}
const proxyRotator = new ProxyRotator();
server.setRequestHandler(CallToolRequestSchema, async (request) => {
if (request.params.name === "scrape_with_rotating_proxy") {
const html = await proxyRotator.scrapeWithRotation(
request.params.arguments.url,
process.env.WEBSCRAPING_AI_API_KEY!
);
return {
content: [{ type: "text", text: html }],
};
}
});
Browser Automation with Authentication
For complex authentication flows similar to handling authentication in Puppeteer, integrate browser automation:
@app.call_tool()
async def call_tool(name: str, arguments: dict):
if name == "scrape_with_browser_auth":
# First, perform login flow to get session cookies
login_response = await client.post(
"https://api.webscraping.ai/html",
params={
"url": arguments["login_url"],
"api_key": API_KEY,
"js": "true"
},
json={
"js_script": f"""
document.querySelector('#username').value = '{arguments["username"]}';
document.querySelector('#password').value = '{arguments["password"]}';
document.querySelector('form').submit();
"""
}
)
# Extract cookies from response headers
session_cookies = extract_cookies(login_response.headers)
# Use session cookies to scrape protected pages
protected_response = await client.post(
"https://api.webscraping.ai/html",
params={
"url": arguments["target_url"],
"api_key": API_KEY
},
json={"cookies": session_cookies}
)
return [TextContent(type="text", text=protected_response.text)]
Security Best Practices for Authenticated Scrapers
1. Credential Validation at Startup
def validate_credentials():
"""Validate all required credentials are present"""
required = {
"WEBSCRAPING_AI_API_KEY": "WebScraping.AI API key",
"PROXY_USERNAME": "Proxy username (optional)",
"PROXY_PASSWORD": "Proxy password (optional)"
}
missing = []
for var, description in required.items():
if var.endswith("(optional)"):
continue
if not os.environ.get(var):
missing.append(f"{var} ({description})")
if missing:
raise ValueError(
f"Missing required credentials:\n" +
"\n".join(f" - {item}" for item in missing)
)
# Validate before starting server
validate_credentials()
2. Secure Logging
Never log sensitive authentication data:
import winston from "winston";
const logger = winston.createLogger({
level: "info",
format: winston.format.combine(
winston.format.timestamp(),
winston.format.json()
),
transports: [
new winston.transports.File({ filename: "scraper.log" }),
],
});
// Good: Log without exposing credentials
logger.info("Scraping request", {
url: targetUrl,
useProxy: true,
timestamp: new Date().toISOString(),
});
// Bad: Never log credentials
// logger.info("Request", { apiKey: API_KEY }); // DON'T DO THIS
3. Rate Limiting with Authentication
Protect your API quota:
from datetime import datetime, timedelta
from collections import deque
class AuthenticatedRateLimiter:
def __init__(self, max_requests_per_minute: int):
self.max_requests = max_requests_per_minute
self.requests = deque()
async def check_limit(self):
"""Enforce rate limiting"""
now = datetime.now()
# Remove requests older than 1 minute
while self.requests and self.requests[0] < now - timedelta(minutes=1):
self.requests.popleft()
if len(self.requests) >= self.max_requests:
wait_time = (self.requests[0] + timedelta(minutes=1) - now).total_seconds()
raise ValueError(f"Rate limit exceeded. Wait {wait_time:.0f} seconds")
self.requests.append(now)
rate_limiter = AuthenticatedRateLimiter(max_requests_per_minute=60)
@app.call_tool()
async def call_tool(name: str, arguments: dict):
await rate_limiter.check_limit()
# Proceed with authenticated scraping
html = await scraper.scrape_html(arguments["url"])
return [TextContent(type="text", text=html)]
4. Error Handling for Authentication Failures
async function scrapeWithRetry(
url: string,
maxRetries: number = 3
): Promise<string> {
let lastError: Error;
for (let attempt = 1; attempt <= maxRetries; attempt++) {
try {
const response = await axios.get("https://api.webscraping.ai/html", {
params: {
url,
api_key: process.env.WEBSCRAPING_AI_API_KEY,
},
});
return response.data;
} catch (error: any) {
lastError = error;
if (error.response?.status === 401) {
throw new Error("Authentication failed: Invalid API key");
}
if (error.response?.status === 429) {
const waitTime = Math.pow(2, attempt) * 1000;
console.error(`Rate limited. Waiting ${waitTime}ms before retry ${attempt}/${maxRetries}`);
await new Promise((resolve) => setTimeout(resolve, waitTime));
continue;
}
if (attempt === maxRetries) {
throw new Error(`Scraping failed after ${maxRetries} attempts: ${error.message}`);
}
}
}
throw lastError!;
}
Testing Your Authenticated Scraper
Create a test suite to verify authentication works correctly:
import asyncio
import os
async def test_scraper_authentication():
"""Test MCP scraper authentication"""
print("Testing MCP scraper authentication...\n")
# Test 1: Environment variables
api_key = os.environ.get("WEBSCRAPING_AI_API_KEY")
if not api_key:
print("❌ FAIL: API key not found")
return False
print("✅ PASS: API key loaded from environment")
# Test 2: Basic scraping
try:
scraper = AuthenticatedScraper()
html = await scraper.scrape_html("https://example.com")
if html and len(html) > 0:
print("✅ PASS: Basic scraping works")
else:
print("❌ FAIL: Empty response")
return False
except Exception as e:
print(f"❌ FAIL: Scraping error: {e}")
return False
# Test 3: Proxy authentication (if configured)
proxy_user = os.environ.get("PROXY_USERNAME")
if proxy_user:
try:
html = await scraper.scrape_html(
"https://httpbin.org/ip",
use_proxy=True
)
print("✅ PASS: Proxy authentication works")
except Exception as e:
print(f"⚠️ WARNING: Proxy test failed: {e}")
print("\n✅ All tests passed!")
return True
if __name__ == "__main__":
asyncio.run(test_scraper_authentication())
Run tests:
python test_scraper.py
Troubleshooting Common Issues
Issue: "API key not configured"
Solution: Verify MCP configuration includes environment variables:
# Check MCP config file
cat ~/Library/Application\ Support/Claude/claude_desktop_config.json
# Ensure env section exists
{
"mcpServers": {
"scraper": {
"env": {
"WEBSCRAPING_AI_API_KEY": "your_key"
}
}
}
}
Issue: "401 Unauthorized" errors
Causes: - Expired or invalid API key - API key not properly passed to scraping service - Incorrect parameter format
Solution:
# Add debug logging
import logging
logging.basicConfig(level=logging.DEBUG)
# Verify API key is being sent
print(f"Using API key: {API_KEY[:10]}...{API_KEY[-4:]}") # Partial display only
Issue: Authentication works locally but fails in MCP
Solution: Restart MCP client after configuration changes:
- Quit Claude Desktop completely
- Update
claude_desktop_config.json
- Relaunch Claude Desktop
- Test scraper tools
Conclusion
Integrating authentication into MCP-based scrapers provides a secure, maintainable approach to credential management while enabling powerful web scraping capabilities. By centralizing authentication in your MCP server configuration and implementing proper error handling, rate limiting, and security practices, you can build robust scraping tools that handle both public and authenticated content reliably.
When building your scraper, remember to validate credentials at startup, implement comprehensive logging without exposing secrets, and test authentication thoroughly before deploying to production. For more complex scenarios requiring browser session handling or navigating multi-step authentication flows, consider combining MCP server authentication with browser automation tools.
The patterns demonstrated here work with any web scraping API that uses API key authentication, making them broadly applicable whether you're scraping social media platforms, e-commerce sites, or internal web applications that require authenticated access.