Table of contents

How do I optimize memory usage when scraping large datasets with JavaScript?

When scraping large datasets with JavaScript, memory optimization becomes crucial to prevent crashes, improve performance, and ensure scalable applications. Memory leaks and excessive memory consumption are common issues that can severely impact your scraping operations. This guide provides comprehensive strategies to optimize memory usage in JavaScript web scraping projects.

Understanding Memory Consumption in Web Scraping

JavaScript applications, particularly those using headless browsers like Puppeteer or Playwright, can consume significant memory when processing large datasets. Common memory bottlenecks include:

  • DOM objects accumulating in memory
  • Large response bodies stored in variables
  • Unclosed browser instances and pages
  • Event listeners not being properly removed
  • Circular references preventing garbage collection

1. Implement Streaming and Chunked Processing

Instead of loading entire datasets into memory, process data in smaller chunks using streams and pagination.

Example: Chunked Data Processing

class MemoryOptimizedScraper {
  constructor(chunkSize = 100) {
    this.chunkSize = chunkSize;
    this.processedCount = 0;
  }

  async scrapeInChunks(urls) {
    const results = [];

    for (let i = 0; i < urls.length; i += this.chunkSize) {
      const chunk = urls.slice(i, i + this.chunkSize);

      // Process chunk and immediately write to file/database
      const chunkResults = await this.processChunk(chunk);

      // Write results immediately instead of accumulating
      await this.writeResults(chunkResults);

      // Clear chunk results from memory
      chunkResults.length = 0;

      // Force garbage collection (Node.js specific)
      if (global.gc) {
        global.gc();
      }

      this.processedCount += chunk.length;
      console.log(`Processed ${this.processedCount}/${urls.length} URLs`);
    }
  }

  async processChunk(urls) {
    const browser = await puppeteer.launch({ headless: true });
    const results = [];

    try {
      for (const url of urls) {
        const page = await browser.newPage();

        try {
          await page.goto(url, { waitUntil: 'networkidle0' });
          const data = await page.evaluate(() => {
            // Extract only necessary data
            return {
              title: document.title,
              description: document.querySelector('meta[name="description"]')?.content
            };
          });

          results.push(data);
        } finally {
          // Always close page to free memory
          await page.close();
        }
      }
    } finally {
      // Always close browser
      await browser.close();
    }

    return results;
  }
}

2. Proper Resource Management with Puppeteer

When using Puppeteer for large-scale scraping, proper resource management is essential to prevent memory leaks.

Browser and Page Management

const puppeteer = require('puppeteer');

class ResourceManagedScraper {
  constructor() {
    this.browser = null;
    this.activePage = null;
  }

  async initialize() {
    this.browser = await puppeteer.launch({
      headless: true,
      args: [
        '--no-sandbox',
        '--disable-setuid-sandbox',
        '--disable-dev-shm-usage',
        '--disable-accelerated-2d-canvas',
        '--no-first-run',
        '--no-zygote',
        '--disable-gpu'
      ]
    });
  }

  async scrapePage(url) {
    if (!this.browser) {
      await this.initialize();
    }

    // Reuse page instance but clear previous content
    if (!this.activePage) {
      this.activePage = await this.browser.newPage();

      // Set memory-efficient page settings
      await this.activePage.setViewport({ width: 1024, height: 768 });
      await this.activePage.setRequestInterception(true);

      // Block unnecessary resources to save memory
      this.activePage.on('request', (req) => {
        const resourceType = req.resourceType();
        if (['image', 'stylesheet', 'font', 'media'].includes(resourceType)) {
          req.abort();
        } else {
          req.continue();
        }
      });
    }

    try {
      // Clear previous page content
      await this.activePage.goto('about:blank');

      await this.activePage.goto(url, {
        waitUntil: 'domcontentloaded',
        timeout: 30000
      });

      // Extract data efficiently
      const data = await this.activePage.evaluate(() => {
        // Remove unnecessary DOM elements to free memory
        const scripts = document.querySelectorAll('script');
        scripts.forEach(script => script.remove());

        const styles = document.querySelectorAll('style, link[rel="stylesheet"]');
        styles.forEach(style => style.remove());

        // Extract only required data
        return {
          title: document.title.trim(),
          headings: Array.from(document.querySelectorAll('h1, h2')).map(h => h.textContent.trim())
        };
      });

      return data;
    } catch (error) {
      console.error(`Error scraping ${url}:`, error.message);
      return null;
    }
  }

  async cleanup() {
    if (this.activePage) {
      await this.activePage.close();
      this.activePage = null;
    }

    if (this.browser) {
      await this.browser.close();
      this.browser = null;
    }
  }
}

3. Implement Connection Pooling

For API-based scraping, implement connection pooling to reuse HTTP connections and reduce memory overhead.

const axios = require('axios');
const http = require('http');
const https = require('https');

// Create HTTP agents with connection pooling
const httpAgent = new http.Agent({
  keepAlive: true,
  maxSockets: 50,
  maxFreeSockets: 10,
  timeout: 60000,
  freeSocketTimeout: 30000
});

const httpsAgent = new https.Agent({
  keepAlive: true,
  maxSockets: 50,
  maxFreeSockets: 10,
  timeout: 60000,
  freeSocketTimeout: 30000
});

const apiClient = axios.create({
  httpAgent,
  httpsAgent,
  timeout: 30000,
  maxContentLength: 10 * 1024 * 1024, // 10MB limit
  maxBodyLength: 10 * 1024 * 1024
});

class PooledAPIScraper {
  async scrapeAPI(urls) {
    const results = [];
    const batchSize = 10;

    for (let i = 0; i < urls.length; i += batchSize) {
      const batch = urls.slice(i, i + batchSize);

      const promises = batch.map(async (url) => {
        try {
          const response = await apiClient.get(url);

          // Process response immediately and extract only needed data
          const processedData = this.extractRelevantData(response.data);

          // Clear response from memory
          response.data = null;

          return processedData;
        } catch (error) {
          console.error(`Error fetching ${url}:`, error.message);
          return null;
        }
      });

      const batchResults = await Promise.all(promises);

      // Process results immediately instead of accumulating
      await this.processBatchResults(batchResults.filter(Boolean));

      // Clear batch results from memory
      batchResults.length = 0;
    }
  }

  extractRelevantData(data) {
    // Extract only necessary fields to minimize memory usage
    return {
      id: data.id,
      title: data.title,
      timestamp: data.created_at
    };
  }
}

4. Optimize Data Storage and Processing

Use streaming JSON parsers and efficient data structures to handle large responses.

const fs = require('fs');
const { Transform } = require('stream');
const StreamValues = require('stream-json/streamers/StreamValues');
const parser = require('stream-json');

class StreamingDataProcessor {
  constructor(outputFile) {
    this.outputStream = fs.createWriteStream(outputFile);
    this.processedCount = 0;
  }

  async processLargeJSONFile(inputFile) {
    return new Promise((resolve, reject) => {
      const pipeline = fs.createReadStream(inputFile)
        .pipe(parser())
        .pipe(StreamValues.withParser())
        .pipe(new Transform({
          objectMode: true,
          transform(chunk, encoding, callback) {
            try {
              // Process each JSON object individually
              const processedItem = this.processItem(chunk.value);

              if (processedItem) {
                this.outputStream.write(JSON.stringify(processedItem) + '\n');
                this.processedCount++;

                if (this.processedCount % 1000 === 0) {
                  console.log(`Processed ${this.processedCount} items`);
                }
              }

              callback();
            } catch (error) {
              callback(error);
            }
          }
        }));

      pipeline.on('finish', () => {
        this.outputStream.end();
        resolve(this.processedCount);
      });

      pipeline.on('error', reject);
    });
  }

  processItem(item) {
    // Transform and filter data as needed
    if (!item.title || item.title.length < 3) {
      return null;
    }

    return {
      id: item.id,
      title: item.title.substring(0, 100), // Limit string length
      category: item.category
    };
  }
}

5. Monitor and Control Memory Usage

Implement memory monitoring and automatic cleanup mechanisms.

class MemoryMonitor {
  constructor(maxMemoryMB = 1024) {
    this.maxMemoryBytes = maxMemoryMB * 1024 * 1024;
    this.checkInterval = null;
  }

  startMonitoring() {
    this.checkInterval = setInterval(() => {
      const memUsage = process.memoryUsage();
      const usedMB = Math.round(memUsage.heapUsed / 1024 / 1024);
      const maxMB = Math.round(this.maxMemoryBytes / 1024 / 1024);

      console.log(`Memory usage: ${usedMB}MB / ${maxMB}MB`);

      if (memUsage.heapUsed > this.maxMemoryBytes) {
        console.warn('Memory limit exceeded, forcing garbage collection');

        if (global.gc) {
          global.gc();
        }

        // Optionally pause processing to allow memory cleanup
        this.onMemoryLimitExceeded();
      }
    }, 5000);
  }

  stopMonitoring() {
    if (this.checkInterval) {
      clearInterval(this.checkInterval);
      this.checkInterval = null;
    }
  }

  onMemoryLimitExceeded() {
    // Implement custom cleanup logic
    console.log('Implementing memory cleanup strategies...');
  }
}

// Usage with scraper
class OptimizedScraper {
  constructor() {
    this.memoryMonitor = new MemoryMonitor(512); // 512MB limit
  }

  async startScraping(urls) {
    this.memoryMonitor.startMonitoring();

    try {
      await this.scrapeWithOptimization(urls);
    } finally {
      this.memoryMonitor.stopMonitoring();
    }
  }
}

6. Advanced Memory Optimization Techniques

Worker Threads for CPU-Intensive Tasks

const { Worker, isMainThread, parentPort, workerData } = require('worker_threads');

if (isMainThread) {
  class WorkerBasedScraper {
    async processLargeDataset(data) {
      const chunkSize = 100;
      const numWorkers = require('os').cpus().length;
      const results = [];

      for (let i = 0; i < data.length; i += chunkSize * numWorkers) {
        const workers = [];

        // Create workers for parallel processing
        for (let j = 0; j < numWorkers && i + j * chunkSize < data.length; j++) {
          const chunk = data.slice(i + j * chunkSize, i + (j + 1) * chunkSize);

          const worker = new Worker(__filename, {
            workerData: chunk
          });

          workers.push(new Promise((resolve, reject) => {
            worker.on('message', resolve);
            worker.on('error', reject);
            worker.on('exit', (code) => {
              if (code !== 0) {
                reject(new Error(`Worker stopped with exit code ${code}`));
              }
            });
          }));
        }

        // Wait for all workers to complete
        const workerResults = await Promise.all(workers);
        results.push(...workerResults.flat());

        // Clear worker results to free memory
        workerResults.length = 0;
      }

      return results;
    }
  }
} else {
  // Worker thread code
  const processDataChunk = (chunk) => {
    return chunk.map(item => ({
      id: item.id,
      processed: item.value * 2,
      timestamp: Date.now()
    }));
  };

  const result = processDataChunk(workerData);
  parentPort.postMessage(result);
}

7. Memory-Efficient Puppeteer Configuration

When working with Puppeteer for large datasets, configure it for optimal memory usage:

const puppeteer = require('puppeteer');

class EfficientPuppeteerScraper {
  static async createBrowser() {
    return await puppeteer.launch({
      headless: true,
      args: [
        '--no-sandbox',
        '--disable-setuid-sandbox',
        '--disable-dev-shm-usage',
        '--disable-accelerated-2d-canvas',
        '--disable-background-timer-throttling',
        '--disable-backgrounding-occluded-windows',
        '--disable-renderer-backgrounding',
        '--max-old-space-size=4096',
        '--memory-pressure-off'
      ]
    });
  }

  async scrapeWithMemoryOptimization(urls) {
    const browser = await EfficientPuppeteerScraper.createBrowser();
    const context = await browser.createIncognitoBrowserContext();

    try {
      for (let i = 0; i < urls.length; i++) {
        const page = await context.newPage();

        // Configure page for memory efficiency
        await page.setRequestInterception(true);
        page.on('request', (req) => {
          if (req.resourceType() === 'image' || req.resourceType() === 'font') {
            req.abort();
          } else {
            req.continue();
          }
        });

        try {
          await page.goto(urls[i], { waitUntil: 'domcontentloaded' });

          // Extract data and process immediately
          const data = await page.evaluate(() => {
            return {
              title: document.title,
              url: window.location.href
            };
          });

          // Process data immediately
          await this.processData(data);

        } finally {
          await page.close();
        }

        // Periodic cleanup
        if (i % 50 === 0) {
          await context.close();
          context = await browser.createIncognitoBrowserContext();
        }
      }
    } finally {
      await browser.close();
    }
  }
}

8. Memory-Conscious Error Handling

Implement error handling that doesn't accumulate error objects in memory:

class MemoryAwareErrorHandler {
  constructor(maxErrors = 100) {
    this.errors = [];
    this.maxErrors = maxErrors;
    this.errorCount = 0;
  }

  handleError(error, context) {
    this.errorCount++;

    // Keep only recent errors to prevent memory buildup
    if (this.errors.length >= this.maxErrors) {
      this.errors.shift(); // Remove oldest error
    }

    this.errors.push({
      message: error.message,
      context,
      timestamp: Date.now()
    });

    // Log critical information immediately
    console.error(`Error ${this.errorCount}: ${error.message}`);

    // Clear stack trace from memory
    error.stack = null;
  }

  getErrorSummary() {
    return {
      totalErrors: this.errorCount,
      recentErrors: this.errors.slice(-10)
    };
  }

  clearErrors() {
    this.errors.length = 0;
  }
}

Best Practices Summary

  1. Process data in chunks instead of loading everything into memory
  2. Close browser pages and instances properly to prevent memory leaks
  3. Limit concurrent operations to control memory usage
  4. Use streaming APIs when available for large datasets
  5. Implement proper error handling with cleanup mechanisms
  6. Monitor memory usage and implement automatic cleanup
  7. Block unnecessary resources when using headless browsers
  8. Use worker threads for CPU-intensive processing tasks
  9. Implement connection pooling for HTTP requests
  10. Clear variables and arrays explicitly when no longer needed

When handling browser sessions in Puppeteer, these memory optimization techniques become even more critical for maintaining stable, long-running scraping operations. For complex parallel processing scenarios, consider running multiple pages in parallel with Puppeteer while applying these memory management strategies.

Monitoring Tools and Commands

Use these Node.js commands to monitor memory usage during development:

# Run with garbage collection logs
node --expose-gc --trace-gc your-scraper.js

# Monitor memory usage with heap snapshots
node --inspect your-scraper.js

# Set memory limits
node --max-old-space-size=4096 your-scraper.js

By implementing these memory optimization strategies, you can build robust JavaScript scrapers capable of handling large datasets without running into memory constraints. Regular monitoring and proactive resource management are key to maintaining optimal performance in production environments.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon