Table of contents

How do I use JavaScript for web scraping with MCP servers?

JavaScript is a powerful language for web scraping, especially when dealing with dynamic websites that rely heavily on client-side rendering. Model Context Protocol (MCP) servers provide a standardized way to integrate JavaScript-based scraping tools like Puppeteer and Playwright into your AI-powered workflows. This guide explores how to leverage JavaScript for web scraping through MCP servers.

Understanding MCP Servers and JavaScript

MCP servers act as bridges between AI assistants (like Claude) and external tools or data sources. When it comes to web scraping, MCP servers can expose JavaScript-based scraping capabilities through a standardized protocol, allowing you to:

  • Execute JavaScript code in headless browsers
  • Manipulate DOM elements programmatically
  • Handle dynamic content and AJAX requests
  • Automate complex user interactions
  • Extract data from modern web applications

Setting Up a JavaScript MCP Server

Using the Playwright MCP Server

The Playwright MCP server is one of the most popular options for JavaScript-based web scraping. Here's how to set it up:

# Install the Playwright MCP server
npm install -g @modelcontextprotocol/server-playwright

# Or install it locally in your project
npm install @modelcontextprotocol/server-playwright

Configure the MCP server in your Claude desktop configuration file (claude_desktop_config.json):

{
  "mcpServers": {
    "playwright": {
      "command": "npx",
      "args": [
        "-y",
        "@modelcontextprotocol/server-playwright"
      ],
      "env": {
        "PLAYWRIGHT_BROWSER": "chromium"
      }
    }
  }
}

Using the Puppeteer MCP Server

Alternatively, you can use a Puppeteer-based MCP server for similar functionality:

npm install @modelcontextprotocol/server-puppeteer

Configuration example:

{
  "mcpServers": {
    "puppeteer": {
      "command": "node",
      "args": [
        "/path/to/puppeteer-mcp-server/index.js"
      ]
    }
  }
}

Executing JavaScript Code Through MCP

Basic Page Navigation and Scraping

Once your MCP server is configured, you can use JavaScript to navigate and scrape websites. Here's a practical example using the Playwright MCP server:

// Navigate to a webpage
await browser.navigate({ url: "https://example.com" });

// Take a snapshot to see the page structure
const snapshot = await browser.snapshot();

// Execute JavaScript to extract data
const result = await browser.evaluate({
  function: `() => {
    return {
      title: document.title,
      headings: Array.from(document.querySelectorAll('h1, h2, h3'))
        .map(el => el.textContent.trim()),
      links: Array.from(document.querySelectorAll('a'))
        .map(a => ({ text: a.textContent, href: a.href }))
    };
  }`
});

Advanced Data Extraction with Custom JavaScript

For more complex scraping scenarios, you can execute sophisticated JavaScript code to extract structured data:

// Extract product information from an e-commerce page
const productData = await browser.evaluate({
  function: `() => {
    const product = {
      name: document.querySelector('.product-title')?.textContent?.trim(),
      price: document.querySelector('.price')?.textContent?.trim(),
      description: document.querySelector('.description')?.textContent?.trim(),
      images: [],
      specifications: {},
      reviews: []
    };

    // Extract images
    document.querySelectorAll('.product-image img').forEach(img => {
      product.images.push(img.src);
    });

    // Extract specifications
    document.querySelectorAll('.specs-table tr').forEach(row => {
      const key = row.querySelector('th')?.textContent?.trim();
      const value = row.querySelector('td')?.textContent?.trim();
      if (key && value) {
        product.specifications[key] = value;
      }
    });

    // Extract review data
    document.querySelectorAll('.review').forEach(review => {
      product.reviews.push({
        author: review.querySelector('.author')?.textContent?.trim(),
        rating: review.querySelector('.rating')?.getAttribute('data-rating'),
        text: review.querySelector('.review-text')?.textContent?.trim(),
        date: review.querySelector('.date')?.textContent?.trim()
      });
    });

    return product;
  }`
});

Handling Dynamic Content

JavaScript excels at handling dynamic content that loads after the initial page load. When handling AJAX requests using Puppeteer or Playwright through MCP, you can wait for specific elements or network activity:

// Wait for a specific element to appear
await browser.wait_for({
  text: "Loading complete"
});

// Execute JavaScript after content loads
const dynamicData = await browser.evaluate({
  function: `() => {
    // Wait for dynamic content to be fully rendered
    const container = document.querySelector('.dynamic-content');
    return Array.from(container.querySelectorAll('.item')).map(item => ({
      id: item.dataset.id,
      name: item.querySelector('.name')?.textContent,
      value: item.querySelector('.value')?.textContent
    }));
  }`
});

Working with Complex Interactions

MCP servers allow you to automate complex user interactions using JavaScript:

// Fill and submit a form
await browser.fill_form({
  fields: [
    {
      name: "Search input",
      ref: "input[name='q']",
      type: "textbox",
      value: "web scraping"
    }
  ]
});

// Click a button and wait for navigation
await browser.click({
  element: "Search button",
  ref: "button[type='submit']"
});

// Execute custom JavaScript after interaction
await browser.evaluate({
  function: `() => {
    // Scroll to load more content
    window.scrollTo(0, document.body.scrollHeight);

    // Return the loaded items
    return document.querySelectorAll('.search-result').length;
  }`
});

Building a Custom JavaScript MCP Server

For specialized scraping needs, you can create a custom MCP server that exposes JavaScript scraping capabilities:

// custom-scraper-mcp.js
import { Server } from "@modelcontextprotocol/sdk/server/index.js";
import { StdioServerTransport } from "@modelcontextprotocol/sdk/server/stdio.js";
import puppeteer from "puppeteer";

class ScraperMCP {
  constructor() {
    this.server = new Server(
      {
        name: "custom-scraper",
        version: "1.0.0"
      },
      {
        capabilities: {
          tools: {}
        }
      }
    );

    this.browser = null;
    this.setupToolHandlers();
  }

  setupToolHandlers() {
    this.server.setRequestHandler("tools/list", async () => ({
      tools: [
        {
          name: "scrape_page",
          description: "Scrape a webpage using custom JavaScript",
          inputSchema: {
            type: "object",
            properties: {
              url: { type: "string", description: "URL to scrape" },
              script: { type: "string", description: "JavaScript to execute" }
            },
            required: ["url", "script"]
          }
        }
      ]
    }));

    this.server.setRequestHandler("tools/call", async (request) => {
      if (request.params.name === "scrape_page") {
        return await this.scrapePage(
          request.params.arguments.url,
          request.params.arguments.script
        );
      }
    });
  }

  async scrapePage(url, script) {
    if (!this.browser) {
      this.browser = await puppeteer.launch({ headless: true });
    }

    const page = await this.browser.newPage();

    try {
      await page.goto(url, { waitUntil: "networkidle0" });

      const result = await page.evaluate(script);

      return {
        content: [
          {
            type: "text",
            text: JSON.stringify(result, null, 2)
          }
        ]
      };
    } finally {
      await page.close();
    }
  }

  async run() {
    const transport = new StdioServerTransport();
    await this.server.connect(transport);
  }
}

const scraper = new ScraperMCP();
scraper.run();

Best Practices for JavaScript Scraping with MCP

1. Error Handling

Always implement robust error handling when executing JavaScript through MCP servers:

try {
  const result = await browser.evaluate({
    function: `() => {
      try {
        return {
          data: document.querySelector('.data')?.textContent,
          error: null
        };
      } catch (error) {
        return {
          data: null,
          error: error.message
        };
      }
    }`
  });
} catch (error) {
  console.error("Scraping failed:", error);
}

2. Respecting Rate Limits

Implement delays between requests to avoid overwhelming servers:

// Add delays between navigation and scraping
await browser.navigate({ url: "https://example.com" });
await browser.wait_for({ time: 2 }); // Wait 2 seconds

// Execute scraping logic
const data = await browser.evaluate({
  function: `() => { /* extraction logic */ }`
});

3. Handling Pagination

When scraping multiple pages, use JavaScript to detect and navigate through pagination:

const scrapeAllPages = async () => {
  const allData = [];
  let hasNextPage = true;

  while (hasNextPage) {
    // Extract data from current page
    const pageData = await browser.evaluate({
      function: `() => {
        return Array.from(document.querySelectorAll('.item')).map(item => ({
          title: item.querySelector('.title')?.textContent,
          content: item.querySelector('.content')?.textContent
        }));
      }`
    });

    allData.push(...pageData);

    // Check if next page exists
    const nextButton = await browser.evaluate({
      function: `() => {
        return document.querySelector('.next-page') !== null;
      }`
    });

    if (nextButton) {
      await browser.click({
        element: "Next page button",
        ref: ".next-page"
      });
      await browser.wait_for({ time: 1 });
    } else {
      hasNextPage = false;
    }
  }

  return allData;
};

Debugging JavaScript Scraping Code

When your scraping logic isn't working as expected, use these debugging techniques:

// Log page content for debugging
const debugInfo = await browser.evaluate({
  function: `() => {
    return {
      url: window.location.href,
      title: document.title,
      bodyText: document.body.textContent.substring(0, 500),
      selectors: {
        hasTargetElement: !!document.querySelector('.target'),
        allClasses: Array.from(document.querySelectorAll('[class]'))
          .map(el => el.className)
          .slice(0, 10)
      }
    };
  }`
});

console.log("Debug info:", debugInfo);

Integration with Other Tools

JavaScript scraping through MCP servers can be combined with other tools for enhanced functionality. For example, you can use browser automation techniques to inject custom scripts that modify page behavior before extraction.

Performance Optimization

Optimize your JavaScript scraping code for better performance:

// Use efficient selectors
const fastData = await browser.evaluate({
  function: `() => {
    // Use getElementById when possible (fastest)
    const header = document.getElementById('header');

    // Use querySelectorAll with specific selectors
    const items = document.querySelectorAll('div.item[data-active="true"]');

    // Minimize DOM manipulation
    return Array.from(items).map(item => ({
      id: item.dataset.id,
      name: item.querySelector('.name')?.textContent
    }));
  }`
});

Conclusion

JavaScript is an excellent choice for web scraping through MCP servers, especially when dealing with modern, dynamic websites. By leveraging tools like Puppeteer and Playwright through the MCP protocol, you can create powerful, AI-assisted scraping workflows that handle complex scenarios with ease.

The key to successful JavaScript scraping with MCP servers is understanding how to properly execute code in the browser context, handle asynchronous operations, and structure your extraction logic for maintainability. Whether you're using pre-built MCP servers or building custom solutions, JavaScript provides the flexibility and power needed for sophisticated web scraping tasks.

For more advanced scenarios, consider exploring how to interact with DOM elements in Puppeteer or learning about handling browser sessions to maintain state across multiple scraping operations.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon