Table of contents

What is the Best Way to Handle Memory Usage When Processing Large HTML Documents?

Processing large HTML documents with Cheerio can quickly consume significant amounts of memory, leading to performance issues or even out-of-memory errors. This comprehensive guide covers proven strategies and techniques to efficiently manage memory usage when working with large HTML documents in Node.js applications.

Understanding Memory Challenges with Large HTML Documents

When Cheerio parses HTML, it creates a complete DOM representation in memory. For large documents (several megabytes or larger), this can result in memory usage that's 3-10 times the original document size due to the overhead of DOM objects, references, and metadata.

Common Memory Issues

  • Memory leaks: Retaining references to large DOM trees
  • Peak memory usage: Loading entire documents into memory at once
  • Garbage collection pressure: Frequent allocation and deallocation of large objects
  • Process crashes: Exceeding Node.js memory limits

Core Memory Management Strategies

1. Use Streaming and Chunked Processing

Instead of loading entire HTML documents into memory, process them in smaller chunks:

const fs = require('fs');
const cheerio = require('cheerio');
const { Transform } = require('stream');

class HTMLChunkProcessor extends Transform {
  constructor(options = {}) {
    super({ objectMode: true });
    this.buffer = '';
    this.chunkSize = options.chunkSize || 64 * 1024; // 64KB chunks
  }

  _transform(chunk, encoding, callback) {
    this.buffer += chunk.toString();

    // Process complete HTML elements
    let lastTagEnd = this.buffer.lastIndexOf('>');
    if (lastTagEnd > 0) {
      const processableChunk = this.buffer.substring(0, lastTagEnd + 1);
      this.buffer = this.buffer.substring(lastTagEnd + 1);

      // Process the chunk with Cheerio
      try {
        const $ = cheerio.load(processableChunk, {
          withDomLvl1: false, // Reduce memory overhead
          normalizeWhitespace: true,
          decodeEntities: false
        });

        // Extract data and clear references
        const data = this.extractData($);

        // Clear the $ reference to help GC
        $ = null;

        this.push(data);
      } catch (error) {
        return callback(error);
      }
    }

    callback();
  }

  extractData($) {
    // Your data extraction logic here
    return {
      titles: $('h1, h2, h3').map((i, el) => $(el).text()).get(),
      links: $('a[href]').map((i, el) => $(el).attr('href')).get()
    };
  }
}

// Usage
const processor = new HTMLChunkProcessor();
fs.createReadStream('large-document.html')
  .pipe(processor)
  .on('data', (data) => {
    console.log('Processed chunk:', data);
  })
  .on('end', () => {
    console.log('Processing complete');
  });

2. Optimize Cheerio Configuration

Configure Cheerio with memory-efficient options:

const cheerio = require('cheerio');

function loadHTMLEfficiently(html) {
  const $ = cheerio.load(html, {
    // Memory optimization options
    withDomLvl1: false,           // Reduces DOM Level 1 compatibility overhead
    normalizeWhitespace: true,    // Reduces text node memory usage
    xmlMode: false,               // HTML mode is more memory efficient than XML
    decodeEntities: false,        // Skip entity decoding to save memory
    lowerCaseAttributeNames: false, // Skip attribute name normalization

    // Parser options for large documents
    recognizeSelfClosing: true,
    recognizeCDATA: false,        // Skip CDATA processing if not needed
  });

  return $;
}

// For very large documents, consider limiting the parsed content
function loadHTMLWithLimits(html, maxSize = 10 * 1024 * 1024) { // 10MB limit
  if (html.length > maxSize) {
    console.warn(`HTML document truncated from ${html.length} to ${maxSize} bytes`);
    html = html.substring(0, maxSize);
  }

  return loadHTMLEfficiently(html);
}

3. Implement Selective Parsing

Parse only the parts of the document you need:

const cheerio = require('cheerio');

function selectivelyParseHTML(html, targetSelectors = []) {
  // Pre-filter HTML to include only relevant sections
  const relevantSections = [];

  for (const selector of targetSelectors) {
    // Use regex to extract sections containing target elements
    const selectorRegex = new RegExp(`<[^>]*class\\s*=\\s*[^>]*${selector}[^>]*>.*?</[^>]+>`, 'gis');
    const matches = html.match(selectorRegex);

    if (matches) {
      relevantSections.push(...matches);
    }
  }

  // Create minimal HTML with only relevant sections
  const minimalHTML = `<html><body>${relevantSections.join('')}</body></html>`;

  return cheerio.load(minimalHTML, {
    withDomLvl1: false,
    normalizeWhitespace: true,
    decodeEntities: false
  });
}

// Usage
const html = fs.readFileSync('large-document.html', 'utf8');
const $ = selectivelyParseHTML(html, ['article', 'product-card', 'news-item']);

// Extract data from the filtered document
const articles = $('.article').map((i, el) => {
  return {
    title: $(el).find('h1').text(),
    content: $(el).find('.content').text()
  };
}).get();

4. Implement Proper Cleanup and Garbage Collection

Actively manage memory by cleaning up references and triggering garbage collection:

function processLargeHTML(htmlPath) {
  return new Promise((resolve, reject) => {
    let results = [];

    const processChunk = (html) => {
      try {
        // Load with Cheerio
        let $ = cheerio.load(html, {
          withDomLvl1: false,
          normalizeWhitespace: true,
          decodeEntities: false
        });

        // Extract data
        const chunkData = $('.target-element').map((i, el) => {
          const data = {
            text: $(el).text().trim(),
            href: $(el).attr('href')
          };

          // Clear element reference
          $(el).remove();
          return data;
        }).get();

        results.push(...chunkData);

        // Explicitly clear references
        $ = null;
        html = null;

        // Force garbage collection in development (not recommended for production)
        if (process.env.NODE_ENV === 'development' && global.gc) {
          global.gc();
        }

      } catch (error) {
        reject(error);
      }
    };

    // Process file in chunks
    const stream = fs.createReadStream(htmlPath, { 
      encoding: 'utf8',
      highWaterMark: 64 * 1024 // 64KB chunks
    });

    let buffer = '';

    stream.on('data', (chunk) => {
      buffer += chunk;

      // Process complete HTML tags
      let lastTagEnd = buffer.lastIndexOf('>');
      if (lastTagEnd > 0) {
        const processableHTML = buffer.substring(0, lastTagEnd + 1);
        buffer = buffer.substring(lastTagEnd + 1);

        processChunk(processableHTML);
      }
    });

    stream.on('end', () => {
      if (buffer.trim()) {
        processChunk(buffer);
      }
      resolve(results);
    });

    stream.on('error', reject);
  });
}

Advanced Memory Optimization Techniques

1. Use Worker Threads for Parallel Processing

Distribute processing across multiple worker threads to prevent blocking and reduce memory pressure:

// main.js
const { Worker, isMainThread, parentPort, workerData } = require('worker_threads');
const path = require('path');

if (isMainThread) {
  // Main thread - coordinate workers
  async function processLargeHTMLWithWorkers(htmlPath, numWorkers = 4) {
    const fs = require('fs');
    const stats = fs.statSync(htmlPath);
    const chunkSize = Math.ceil(stats.size / numWorkers);

    const workers = [];
    const results = [];

    for (let i = 0; i < numWorkers; i++) {
      const start = i * chunkSize;
      const end = Math.min(start + chunkSize, stats.size);

      const worker = new Worker(__filename, {
        workerData: { htmlPath, start, end }
      });

      workers.push(new Promise((resolve, reject) => {
        worker.on('message', resolve);
        worker.on('error', reject);
        worker.on('exit', (code) => {
          if (code !== 0) {
            reject(new Error(`Worker stopped with exit code ${code}`));
          }
        });
      }));
    }

    const workerResults = await Promise.all(workers);
    return workerResults.flat();
  }

  // Export the function
  module.exports = { processLargeHTMLWithWorkers };

} else {
  // Worker thread - process HTML chunk
  const { htmlPath, start, end } = workerData;
  const fs = require('fs');
  const cheerio = require('cheerio');

  const stream = fs.createReadStream(htmlPath, { start, end, encoding: 'utf8' });
  let html = '';

  stream.on('data', (chunk) => {
    html += chunk;
  });

  stream.on('end', () => {
    try {
      const $ = cheerio.load(html, {
        withDomLvl1: false,
        normalizeWhitespace: true,
        decodeEntities: false
      });

      const data = $('.target').map((i, el) => $(el).text()).get();

      // Clean up
      html = null;

      parentPort.postMessage(data);
    } catch (error) {
      parentPort.postMessage({ error: error.message });
    }
  });
}

2. Implement Memory Monitoring

Monitor memory usage to prevent crashes and optimize performance:

const v8 = require('v8');
const process = require('process');

class MemoryMonitor {
  constructor(options = {}) {
    this.maxHeapUsed = options.maxHeapUsed || 1.5 * 1024 * 1024 * 1024; // 1.5GB
    this.checkInterval = options.checkInterval || 1000; // 1 second
    this.callbacks = new Set();
  }

  start() {
    this.intervalId = setInterval(() => {
      const memoryUsage = process.memoryUsage();
      const heapStats = v8.getHeapStatistics();

      const stats = {
        heapUsed: memoryUsage.heapUsed,
        heapTotal: memoryUsage.heapTotal,
        external: memoryUsage.external,
        heapSizeLimit: heapStats.heap_size_limit,
        usedPercent: (memoryUsage.heapUsed / heapStats.heap_size_limit) * 100
      };

      // Trigger callbacks if memory usage is high
      if (stats.heapUsed > this.maxHeapUsed || stats.usedPercent > 80) {
        this.callbacks.forEach(callback => callback(stats));
      }
    }, this.checkInterval);
  }

  stop() {
    if (this.intervalId) {
      clearInterval(this.intervalId);
    }
  }

  onHighMemory(callback) {
    this.callbacks.add(callback);
  }

  getMemoryStats() {
    const memoryUsage = process.memoryUsage();
    const heapStats = v8.getHeapStatistics();

    return {
      heapUsed: this.formatBytes(memoryUsage.heapUsed),
      heapTotal: this.formatBytes(memoryUsage.heapTotal),
      external: this.formatBytes(memoryUsage.external),
      heapSizeLimit: this.formatBytes(heapStats.heap_size_limit),
      usedPercent: ((memoryUsage.heapUsed / heapStats.heap_size_limit) * 100).toFixed(2) + '%'
    };
  }

  formatBytes(bytes) {
    const sizes = ['Bytes', 'KB', 'MB', 'GB'];
    if (bytes === 0) return '0 Byte';
    const i = parseInt(Math.floor(Math.log(bytes) / Math.log(1024)));
    return Math.round(bytes / Math.pow(1024, i) * 100) / 100 + ' ' + sizes[i];
  }
}

// Usage with HTML processing
const monitor = new MemoryMonitor();

monitor.onHighMemory((stats) => {
  console.warn('High memory usage detected:', stats);

  // Trigger garbage collection if available
  if (global.gc) {
    global.gc();
    console.log('Garbage collection triggered');
  }
});

monitor.start();

async function processWithMonitoring(htmlPath) {
  console.log('Starting processing. Memory stats:', monitor.getMemoryStats());

  // Your HTML processing logic here
  const results = await processLargeHTML(htmlPath);

  console.log('Processing complete. Memory stats:', monitor.getMemoryStats());
  monitor.stop();

  return results;
}

Best Practices for Large HTML Document Processing

1. Set Node.js Memory Limits

Configure appropriate memory limits for your Node.js process:

# Increase heap size to 4GB
node --max-old-space-size=4096 your-script.js

# Monitor garbage collection
node --expose-gc --trace-gc your-script.js

2. Use Efficient Data Structures

Choose memory-efficient data structures for storing extracted data:

// Instead of storing entire DOM elements
const inefficientData = [];
$('.item').each((i, el) => {
  inefficientData.push($(el)); // Stores entire Cheerio object
});

// Store only the data you need
const efficientData = [];
$('.item').each((i, el) => {
  efficientData.push({
    text: $(el).text(),
    href: $(el).attr('href'),
    class: $(el).attr('class')
  });
});

3. Implement Backpressure Handling

Manage processing speed to prevent memory overflow:

const { pipeline } = require('stream');
const { promisify } = require('util');
const pipelineAsync = promisify(pipeline);

class BackpressureHTMLProcessor extends Transform {
  constructor(options = {}) {
    super({ 
      objectMode: true,
      highWaterMark: options.highWaterMark || 16 // Limit concurrent processing
    });
    this.processing = 0;
    this.maxConcurrent = options.maxConcurrent || 4;
  }

  async _transform(chunk, encoding, callback) {
    // Wait if too many chunks are being processed
    while (this.processing >= this.maxConcurrent) {
      await new Promise(resolve => setTimeout(resolve, 10));
    }

    this.processing++;

    try {
      const result = await this.processChunk(chunk);
      this.push(result);
    } catch (error) {
      return callback(error);
    } finally {
      this.processing--;
    }

    callback();
  }

  async processChunk(html) {
    // Your Cheerio processing logic
    const $ = cheerio.load(html, { withDomLvl1: false });
    const data = $('.target').map((i, el) => $(el).text()).get();
    return data;
  }
}

Performance Monitoring and Debugging

Memory Profiling Tools

Use these tools to identify memory bottlenecks:

# Profile memory usage
node --inspect your-script.js

# Generate heap snapshots
node --inspect --heap-prof your-script.js

# Use clinic.js for comprehensive profiling
npm install -g clinic
clinic doctor -- node your-script.js

Debugging Memory Leaks

// Add periodic memory reporting
setInterval(() => {
  const usage = process.memoryUsage();
  console.log('Memory usage:', {
    rss: Math.round(usage.rss / 1024 / 1024) + 'MB',
    heapUsed: Math.round(usage.heapUsed / 1024 / 1024) + 'MB',
    heapTotal: Math.round(usage.heapTotal / 1024 / 1024) + 'MB',
    external: Math.round(usage.external / 1024 / 1024) + 'MB'
  });
}, 5000);

Integration with Modern Scraping Tools

For more advanced scenarios involving dynamic content, consider integrating memory-efficient techniques with tools like Puppeteer for handling AJAX requests or when you need to handle timeouts in Puppeteer for large document processing.

Conclusion

Efficiently handling memory usage when processing large HTML documents with Cheerio requires a combination of streaming, selective parsing, proper configuration, and active memory management. By implementing these strategies, you can process documents of virtually any size while maintaining stable memory usage and optimal performance.

The key is to avoid loading entire documents into memory when possible, configure Cheerio for minimal overhead, and actively manage object lifecycles to prevent memory leaks. Combined with proper monitoring and debugging tools, these techniques enable robust processing of large-scale HTML data extraction tasks.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon