Table of contents

How do you use Cheerio with streaming HTML parsers?

Cheerio is a powerful server-side implementation of jQuery that's perfect for parsing HTML documents. However, when dealing with large HTML files or real-time data streams, loading the entire document into memory can be inefficient or even impossible. This is where streaming HTML parsers come into play, allowing you to process HTML content as it arrives while maintaining Cheerio's familiar API.

Understanding Streaming vs Traditional Parsing

Traditional HTML parsing with Cheerio loads the entire document into memory before processing:

const cheerio = require('cheerio');
const axios = require('axios');

// Traditional approach - loads entire HTML into memory
const response = await axios.get('https://example.com');
const $ = cheerio.load(response.data);
$('div.content').each((i, el) => {
  console.log($(el).text());
});

While this works well for most use cases, it can be problematic when dealing with: - Large HTML documents (several MB or larger) - Real-time data streams - Memory-constrained environments - Processing multiple documents simultaneously

Setting Up Streaming HTML Parsing with Cheerio

Method 1: Using htmlparser2 with Cheerio

The most efficient approach is to use htmlparser2 (which Cheerio uses internally) with a streaming parser:

const cheerio = require('cheerio');
const { Parser } = require('htmlparser2');
const axios = require('axios');

async function streamParseWithCheerio(url) {
  let currentElement = '';
  let isTargetElement = false;

  const parser = new Parser({
    onopentag(name, attributes) {
      if (name === 'div' && attributes.class === 'content') {
        isTargetElement = true;
        currentElement = `<${name}`;
        for (const [key, value] of Object.entries(attributes)) {
          currentElement += ` ${key}="${value}"`;
        }
        currentElement += '>';
      }
    },

    ontext(text) {
      if (isTargetElement) {
        currentElement += text;
      }
    },

    onclosetag(tagname) {
      if (isTargetElement && tagname === 'div') {
        currentElement += `</${tagname}>`;

        // Process with Cheerio
        const $ = cheerio.load(currentElement);
        console.log($('div.content').text().trim());

        // Reset for next element
        currentElement = '';
        isTargetElement = false;
      }
    }
  });

  const response = await axios({
    method: 'get',
    url: url,
    responseType: 'stream'
  });

  response.data.on('data', (chunk) => {
    parser.write(chunk);
  });

  response.data.on('end', () => {
    parser.end();
  });
}

Method 2: Using Transform Streams

For more complex scenarios, you can create a transform stream that processes HTML chunks:

const { Transform } = require('stream');
const cheerio = require('cheerio');
const axios = require('axios');

class CheerioStreamParser extends Transform {
  constructor(selector, options = {}) {
    super({ objectMode: true });
    this.selector = selector;
    this.buffer = '';
    this.elementStack = [];
    this.targetDepth = 0;
    this.currentDepth = 0;
  }

  _transform(chunk, encoding, callback) {
    this.buffer += chunk.toString();

    // Simple tag matching (for production, use a proper HTML parser)
    const tagRegex = /<\/?[^>]+>/g;
    let match;
    let lastIndex = 0;

    while ((match = tagRegex.exec(this.buffer)) !== null) {
      const tag = match[0];
      const isClosing = tag.startsWith('</');
      const tagName = tag.match(/<\/?([^\s>]+)/)[1];

      if (!isClosing) {
        this.currentDepth++;
        if (this.matchesSelector(tag)) {
          this.targetDepth = this.currentDepth;
          this.elementStack = [tag];
        } else if (this.targetDepth > 0) {
          this.elementStack.push(tag);
        }
      } else {
        if (this.targetDepth > 0 && this.currentDepth === this.targetDepth) {
          // Complete element found
          this.elementStack.push(tag);
          const completeElement = this.elementStack.join('');

          const $ = cheerio.load(completeElement);
          this.push($(this.selector));

          this.elementStack = [];
          this.targetDepth = 0;
        }
        this.currentDepth--;
      }

      lastIndex = match.index + match[0].length;
    }

    // Keep unprocessed buffer
    this.buffer = this.buffer.substring(lastIndex);
    callback();
  }

  matchesSelector(tag) {
    // Simple selector matching - extend for more complex selectors
    return tag.includes('class="content"');
  }
}

// Usage
async function useTransformStream(url) {
  const response = await axios({
    method: 'get',
    url: url,
    responseType: 'stream'
  });

  const parser = new CheerioStreamParser('div.content');

  parser.on('data', ($element) => {
    console.log($element.text().trim());
  });

  response.data.pipe(parser);
}

Advanced Streaming Techniques

Processing Large Tables

When dealing with large HTML tables, you can stream-process rows individually:

const { Parser } = require('htmlparser2');
const cheerio = require('cheerio');

function streamTableRows(htmlStream) {
  let currentRow = '';
  let inTableRow = false;
  let rowCount = 0;

  const parser = new Parser({
    onopentag(name, attributes) {
      if (name === 'tr') {
        inTableRow = true;
        currentRow = '<tr>';
      } else if (inTableRow) {
        currentRow += `<${name}`;
        for (const [key, value] of Object.entries(attributes)) {
          currentRow += ` ${key}="${value}"`;
        }
        currentRow += '>';
      }
    },

    ontext(text) {
      if (inTableRow) {
        currentRow += text;
      }
    },

    onclosetag(tagname) {
      if (inTableRow) {
        currentRow += `</${tagname}>`;

        if (tagname === 'tr') {
          // Process the complete row with Cheerio
          const $ = cheerio.load(`<table>${currentRow}</table>`);
          const cells = [];

          $('td').each((i, cell) => {
            cells.push($(cell).text().trim());
          });

          console.log(`Row ${++rowCount}:`, cells);

          // Reset for next row
          currentRow = '';
          inTableRow = false;
        }
      }
    }
  });

  return parser;
}

Handling Nested Elements

For processing nested structures while streaming:

function streamNestedElements(htmlStream, targetSelector) {
  const elementBuffer = new Map();
  let elementId = 0;

  const parser = new Parser({
    onopentag(name, attributes) {
      const id = ++elementId;
      const element = {
        id,
        name,
        attributes,
        children: [],
        parent: this.currentParent,
        content: `<${name}`,
        complete: false
      };

      // Build opening tag
      for (const [key, value] of Object.entries(attributes)) {
        element.content += ` ${key}="${value}"`;
      }
      element.content += '>';

      if (this.currentParent) {
        elementBuffer.get(this.currentParent).children.push(id);
      }

      elementBuffer.set(id, element);
      this.currentParent = id;

      // Check if this matches our target selector
      if (this.matchesSelector(element)) {
        element.isTarget = true;
      }
    },

    ontext(text) {
      if (this.currentParent) {
        const element = elementBuffer.get(this.currentParent);
        element.content += text;
      }
    },

    onclosetag(tagname) {
      if (this.currentParent) {
        const element = elementBuffer.get(this.currentParent);
        element.content += `</${tagname}>`;
        element.complete = true;

        // If this is a target element, process it
        if (element.isTarget) {
          const $ = cheerio.load(element.content);
          console.log('Found target:', $(targetSelector).text().trim());
        }

        // Move to parent and clean up
        this.currentParent = element.parent;

        // Clean up completed elements to prevent memory leaks
        if (!element.parent || elementBuffer.get(element.parent).complete) {
          elementBuffer.delete(element.id);
        }
      }
    },

    matchesSelector(element) {
      // Implement your selector matching logic here
      return element.name === 'div' && 
             element.attributes.class && 
             element.attributes.class.includes('content');
    }
  });

  return parser;
}

Performance Considerations and Best Practices

Memory Management

When using streaming parsers, it's crucial to manage memory effectively:

class MemoryEfficientParser {
  constructor(options = {}) {
    this.maxBufferSize = options.maxBufferSize || 1024 * 1024; // 1MB
    this.buffer = '';
    this.processedCount = 0;
  }

  processChunk(chunk) {
    this.buffer += chunk;

    // Process complete elements and clear buffer periodically
    if (this.buffer.length > this.maxBufferSize) {
      this.flushBuffer();
    }
  }

  flushBuffer() {
    // Process any complete elements in buffer
    // Then clear processed portions
    this.buffer = this.buffer.substring(this.getProcessedLength());
  }

  getProcessedLength() {
    // Calculate how much of the buffer has been processed
    // This is implementation-specific
    return 0;
  }
}

Error Handling

Implement robust error handling for streaming scenarios:

function createRobustStreamParser(url) {
  const parser = new Parser({
    onerror(error) {
      console.error('Parser error:', error);
      // Implement recovery logic
    }
  });

  return axios({
    method: 'get',
    url: url,
    responseType: 'stream',
    timeout: 30000
  })
  .then(response => {
    response.data.on('error', (error) => {
      console.error('Stream error:', error);
    });

    response.data.on('data', (chunk) => {
      try {
        parser.write(chunk);
      } catch (error) {
        console.error('Processing error:', error);
      }
    });

    response.data.on('end', () => {
      parser.end();
    });
  })
  .catch(error => {
    console.error('Request error:', error);
  });
}

Integration with Other Tools

Combining with Puppeteer for Dynamic Content

For JavaScript-heavy sites, you might need to combine streaming parsing with tools like Puppeteer. While handling AJAX requests using Puppeteer can help with dynamic content, you can also stream the final rendered HTML:

const puppeteer = require('puppeteer');

async function streamFromPuppeteer(url) {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();

  // Set up streaming response handler
  await page.setRequestInterception(true);
  page.on('request', request => {
    if (request.url() === url) {
      // Handle the response as a stream
      request.continue();
    } else {
      request.continue();
    }
  });

  await page.goto(url);

  // Get the final HTML and process as stream
  const html = await page.content();
  const streamParser = createStreamParser();

  // Process in chunks
  const chunkSize = 1024;
  for (let i = 0; i < html.length; i += chunkSize) {
    const chunk = html.substring(i, i + chunkSize);
    streamParser.write(chunk);
  }

  streamParser.end();
  await browser.close();
}

Use Cases and Examples

Real-time Data Processing

Streaming parsers are particularly useful for real-time data feeds:

const WebSocket = require('ws');

function setupRealtimeParser(wsUrl) {
  const ws = new WebSocket(wsUrl);
  const parser = new Parser({
    onopentag(name, attributes) {
      if (name === 'update' && attributes.type === 'price') {
        this.captureUpdate = true;
        this.updateContent = '';
      }
    },

    ontext(text) {
      if (this.captureUpdate) {
        this.updateContent += text;
      }
    },

    onclosetag(tagname) {
      if (this.captureUpdate && tagname === 'update') {
        const $ = cheerio.load(`<update>${this.updateContent}</update>`);
        const price = $('price').text();
        const symbol = $('symbol').text();

        console.log(`Price update: ${symbol} = ${price}`);

        this.captureUpdate = false;
      }
    }
  });

  ws.on('message', (data) => {
    parser.write(data.toString());
  });
}

Log File Processing

Stream parsing is excellent for processing large log files:

const fs = require('fs');
const readline = require('readline');

async function parseLogStream(filePath) {
  const fileStream = fs.createReadStream(filePath);
  const rl = readline.createInterface({
    input: fileStream,
    crlfDelay: Infinity
  });

  for await (const line of rl) {
    if (line.includes('<entry>')) {
      // Process HTML-formatted log entries
      const $ = cheerio.load(line);
      const timestamp = $('timestamp').text();
      const level = $('level').text();
      const message = $('message').text();

      console.log(`[${timestamp}] ${level}: ${message}`);
    }
  }
}

Conclusion

Using Cheerio with streaming HTML parsers provides a powerful combination of familiar jQuery-like syntax with efficient memory usage for large documents and real-time data processing. The key is to break down the HTML into manageable chunks while maintaining the parsing context needed for accurate element extraction.

The streaming approach is particularly beneficial when: - Processing large HTML documents that might not fit in memory - Handling real-time data feeds - Building scalable applications that process multiple documents simultaneously - Working in memory-constrained environments

When implementing streaming parsers, always consider memory management, error handling, and the specific requirements of your use case. For scenarios involving complex JavaScript-rendered content, combining these techniques with browser automation tools can provide the best of both worlds.

Remember to test your streaming implementation thoroughly with various HTML structures and edge cases to ensure robust performance in production environments.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon