Table of contents

How to Handle Infinite Scroll Pages in Puppeteer

Infinite scroll pages present unique challenges for web scraping, as content loads dynamically as users scroll down the page. This guide provides comprehensive techniques for handling infinite scroll pages using Puppeteer, including auto-scrolling strategies, content detection methods, and efficient data extraction approaches.

Understanding Infinite Scroll Pages

Infinite scroll pages load content progressively as users scroll down, replacing traditional pagination. Popular social media platforms like Twitter, Instagram, and e-commerce sites like Amazon use this pattern to improve user experience. When scraping these pages, you need to simulate scrolling behavior to trigger content loading.

Basic Auto-Scrolling Implementation

Simple Scroll-to-Bottom Approach

The most straightforward method involves scrolling to the bottom of the page repeatedly until no new content loads:

const puppeteer = require('puppeteer');

async function scrapeInfiniteScroll(url) {
  const browser = await puppeteer.launch({ headless: false });
  const page = await browser.newPage();

  await page.goto(url, { waitUntil: 'networkidle2' });

  let previousHeight;
  let currentHeight = await page.evaluate('document.body.scrollHeight');

  while (previousHeight !== currentHeight) {
    previousHeight = currentHeight;

    // Scroll to bottom
    await page.evaluate(() => {
      window.scrollTo(0, document.body.scrollHeight);
    });

    // Wait for new content to load
    await page.waitForTimeout(2000);

    // Get new scroll height
    currentHeight = await page.evaluate('document.body.scrollHeight');
  }

  console.log('Finished loading all content');
  await browser.close();
}

Progressive Scrolling with Element Counting

A more reliable approach involves counting specific elements and scrolling until no new elements appear:

async function scrapeWithElementCounting(url, elementSelector) {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();

  await page.goto(url, { waitUntil: 'networkidle2' });

  let previousCount = 0;
  let currentCount = await page.$$eval(elementSelector, elements => elements.length);
  let stableCount = 0;

  while (stableCount < 3) { // Wait for 3 stable iterations
    previousCount = currentCount;

    // Scroll down
    await page.evaluate(() => {
      window.scrollBy(0, window.innerHeight);
    });

    // Wait for potential new content
    await page.waitForTimeout(1500);

    // Count elements again
    currentCount = await page.$$eval(elementSelector, elements => elements.length);

    if (currentCount === previousCount) {
      stableCount++;
    } else {
      stableCount = 0;
    }

    console.log(`Elements loaded: ${currentCount}`);
  }

  await browser.close();
}

Advanced Scrolling Techniques

Scroll with Network Monitoring

Monitor network requests to detect when new content stops loading:

async function scrapeWithNetworkMonitoring(url) {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();

  let requestCount = 0;
  let responseCount = 0;

  // Monitor network activity
  page.on('request', () => requestCount++);
  page.on('response', () => responseCount++);

  await page.goto(url, { waitUntil: 'networkidle2' });

  let scrollAttempts = 0;
  const maxScrollAttempts = 50;

  while (scrollAttempts < maxScrollAttempts) {
    const initialRequests = requestCount;

    // Scroll down
    await page.evaluate(() => {
      window.scrollBy(0, window.innerHeight);
    });

    // Wait for potential requests
    await page.waitForTimeout(2000);

    // Check if new requests were made
    if (requestCount === initialRequests) {
      console.log('No new requests detected, assuming end of content');
      break;
    }

    scrollAttempts++;
  }

  await browser.close();
}

Smooth Scrolling with Viewport-Based Detection

Implement smooth scrolling that mimics human behavior:

async function smoothScrollInfinite(url, targetItemCount = 100) {
  const browser = await puppeteer.launch({ headless: false });
  const page = await browser.newPage();

  await page.goto(url, { waitUntil: 'networkidle2' });

  // Define scroll function
  await page.evaluateOnNewDocument(() => {
    window.smoothScrollBy = (distance, duration) => {
      return new Promise(resolve => {
        const startTime = performance.now();
        const startScrollTop = window.pageYOffset;

        function scrollStep(timestamp) {
          const elapsed = timestamp - startTime;
          const progress = Math.min(elapsed / duration, 1);
          const easeProgress = 0.5 - Math.cos(progress * Math.PI) / 2;

          window.scrollTo(0, startScrollTop + distance * easeProgress);

          if (progress < 1) {
            requestAnimationFrame(scrollStep);
          } else {
            resolve();
          }
        }

        requestAnimationFrame(scrollStep);
      });
    };
  });

  let itemCount = 0;
  const itemSelector = '.item'; // Adjust selector as needed

  while (itemCount < targetItemCount) {
    // Smooth scroll
    await page.evaluate(() => window.smoothScrollBy(window.innerHeight, 1000));

    // Wait for content to load
    await page.waitForTimeout(1500);

    // Check current item count
    const newItemCount = await page.$$eval(itemSelector, items => items.length);

    if (newItemCount === itemCount) {
      console.log('No new items loaded, reached end of content');
      break;
    }

    itemCount = newItemCount;
    console.log(`Loaded ${itemCount} items`);
  }

  await browser.close();
}

Data Extraction During Scrolling

Incremental Data Collection

Extract data while scrolling to avoid memory issues with large datasets:

async function extractDataWhileScrolling(url) {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();

  await page.goto(url, { waitUntil: 'networkidle2' });

  let allData = [];
  let processedItems = new Set();

  while (true) {
    // Extract new items
    const newItems = await page.evaluate((processedIds) => {
      const items = Array.from(document.querySelectorAll('.item'));
      return items
        .filter(item => !processedIds.has(item.id))
        .map(item => ({
          id: item.id,
          title: item.querySelector('.title')?.textContent,
          description: item.querySelector('.description')?.textContent,
          price: item.querySelector('.price')?.textContent
        }));
    }, Array.from(processedItems));

    if (newItems.length === 0) {
      // Scroll and try again
      await page.evaluate(() => window.scrollBy(0, window.innerHeight));
      await page.waitForTimeout(2000);

      // Check if we've reached the end
      const moreItems = await page.evaluate(() => {
        return document.querySelectorAll('.item').length;
      });

      if (moreItems === allData.length) {
        break; // No new items after scrolling
      }
    } else {
      // Add new items to our collection
      allData = allData.concat(newItems);
      newItems.forEach(item => processedItems.add(item.id));

      console.log(`Extracted ${newItems.length} new items. Total: ${allData.length}`);
    }
  }

  await browser.close();
  return allData;
}

Handling Load More Buttons

Some infinite scroll implementations use "Load More" buttons instead of automatic scrolling:

async function handleLoadMoreButton(url) {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();

  await page.goto(url, { waitUntil: 'networkidle2' });

  const loadMoreSelector = '.load-more-btn, [data-testid="load-more"], .show-more';

  while (true) {
    try {
      // Wait for load more button
      await page.waitForSelector(loadMoreSelector, { timeout: 5000 });

      // Click the button
      await page.click(loadMoreSelector);

      // Wait for content to load
      await page.waitForTimeout(2000);

      console.log('Clicked load more button');

    } catch (error) {
      console.log('No more load more buttons found');
      break;
    }
  }

  await browser.close();
}

Performance Optimization

Memory Management

For large datasets, implement memory-efficient scrolling:

async function memoryEfficientScrolling(url) {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();

  // Reduce memory usage
  await page.setRequestInterception(true);
  page.on('request', (req) => {
    if (req.resourceType() === 'image' || req.resourceType() === 'stylesheet') {
      req.abort();
    } else {
      req.continue();
    }
  });

  await page.goto(url, { waitUntil: 'networkidle2' });

  let batchCount = 0;
  const batchSize = 20;

  while (true) {
    // Process items in batches
    const items = await page.evaluate((batchSize, batchCount) => {
      const allItems = document.querySelectorAll('.item');
      const start = batchCount * batchSize;
      const end = start + batchSize;

      return Array.from(allItems).slice(start, end).map(item => ({
        title: item.querySelector('.title')?.textContent,
        url: item.querySelector('a')?.href
      }));
    }, batchSize, batchCount);

    if (items.length === 0) {
      // Try scrolling for more content
      await page.evaluate(() => window.scrollBy(0, window.innerHeight));
      await page.waitForTimeout(2000);

      // Check if scrolling loaded new items
      const totalItems = await page.$$eval('.item', items => items.length);
      if (totalItems <= batchCount * batchSize) {
        break;
      }
    } else {
      // Process batch
      console.log(`Processing batch ${batchCount + 1}: ${items.length} items`);
      batchCount++;
    }
  }

  await browser.close();
}

Error Handling and Retry Logic

Implement robust error handling for unstable infinite scroll pages:

async function robustInfiniteScrolling(url, maxRetries = 3) {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();

  let retryCount = 0;

  while (retryCount < maxRetries) {
    try {
      await page.goto(url, { waitUntil: 'networkidle2' });

      let scrollAttempts = 0;
      const maxScrollAttempts = 100;

      while (scrollAttempts < maxScrollAttempts) {
        const beforeScrollHeight = await page.evaluate(() => document.body.scrollHeight);

        // Scroll with error handling
        await page.evaluate(() => {
          window.scrollTo(0, document.body.scrollHeight);
        });

        // Wait with timeout
        try {
          await page.waitForFunction(
            (prevHeight) => document.body.scrollHeight > prevHeight,
            { timeout: 5000 },
            beforeScrollHeight
          );
        } catch (timeoutError) {
          console.log('No new content loaded within timeout');
          break;
        }

        scrollAttempts++;
      }

      break; // Success, exit retry loop

    } catch (error) {
      console.log(`Attempt ${retryCount + 1} failed: ${error.message}`);
      retryCount++;

      if (retryCount < maxRetries) {
        console.log('Retrying in 5 seconds...');
        await new Promise(resolve => setTimeout(resolve, 5000));
      }
    }
  }

  await browser.close();
}

Best Practices

  1. Always implement timeouts to prevent infinite loops
  2. Monitor network activity to detect when content stops loading
  3. Use element counting rather than scroll height for more reliable detection
  4. Implement smooth scrolling to mimic human behavior and avoid detection
  5. Extract data incrementally to avoid memory issues with large datasets
  6. Handle errors gracefully with retry logic and fallback strategies

When dealing with complex infinite scroll implementations, consider using different methods to handle timeouts in Playwright as an alternative approach, or explore how to handle AJAX calls in Playwright for more advanced network monitoring techniques.

Conclusion

Handling infinite scroll pages in Puppeteer requires a combination of scrolling automation, content detection, and efficient data extraction. The key is to implement robust detection mechanisms that can identify when all content has loaded while maintaining performance and reliability. Choose the approach that best fits your specific use case and always include proper error handling and retry logic for production applications.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon