Table of contents

How do I Handle Pagination in n8n Web Scraping Workflows?

Pagination is one of the most common challenges when scraping websites at scale. Whether you're extracting product listings, blog posts, or search results, understanding how to navigate through multiple pages efficiently in n8n is essential for successful data extraction workflows.

This guide covers multiple approaches to handling pagination in n8n, from simple loop-based methods to advanced browser automation techniques.

Understanding Pagination Types

Before diving into implementation, it's important to recognize the different types of pagination you'll encounter:

  1. URL-based pagination: Pages accessible via URL parameters (e.g., ?page=2)
  2. Button-based pagination: "Next" buttons that trigger page loads
  3. Infinite scroll: Content that loads dynamically as you scroll
  4. API pagination: REST APIs with pagination tokens or offsets

Method 1: Loop-Based URL Pagination

The simplest pagination method works when pages follow a predictable URL pattern. This approach uses n8n's loop functionality to iterate through multiple page numbers.

Basic Loop Setup

  1. Set up a Loop Over Items node to define your page range:
// In a Function node to generate page numbers
const items = [];
const startPage = 1;
const endPage = 10;

for (let page = startPage; page <= endPage; page++) {
  items.push({ page: page });
}

return items;
  1. Configure HTTP Request node to fetch each page:
URL: https://example.com/products?page={{$json["page"]}}
Method: GET
  1. Parse HTML using the HTML Extract node or Code node with Cheerio:
// Using Cheerio in Code node
const cheerio = require('cheerio');
const html = $input.item.json.data;
const $ = cheerio.load(html);

const products = [];
$('.product-item').each((i, el) => {
  products.push({
    title: $(el).find('.product-title').text().trim(),
    price: $(el).find('.product-price').text().trim(),
    url: $(el).find('a').attr('href')
  });
});

return products.map(product => ({ json: product }));

Dynamic Page Detection

Often you don't know the total number of pages upfront. Here's how to scrape until no more data is found:

// Function node: Check for next page
const items = $input.all();
const currentPage = $node["HTTP Request"].json;
const hasResults = currentPage.products && currentPage.products.length > 0;

if (hasResults) {
  return {
    json: {
      nextPage: ($json.currentPage || 1) + 1,
      continue: true
    }
  };
} else {
  return {
    json: {
      continue: false
    }
  };
}

Connect this to an IF node that continues the loop only when continue is true.

Method 2: Browser Automation with Puppeteer

For JavaScript-rendered content and complex pagination, using Puppeteer with n8n provides more control. This is particularly useful for sites that rely on JavaScript for navigation.

Click-Based Pagination

// In n8n Puppeteer node or Code node with Puppeteer
const puppeteer = require('puppeteer');

const browser = await puppeteer.launch({ headless: true });
const page = await browser.newPage();

let allData = [];
let hasNextPage = true;
let pageNum = 1;

await page.goto('https://example.com/listings', {
  waitUntil: 'networkidle2'
});

while (hasNextPage && pageNum <= 50) {
  // Wait for content to load
  await page.waitForSelector('.listing-item', { timeout: 5000 });

  // Extract data from current page
  const pageData = await page.evaluate(() => {
    const items = [];
    document.querySelectorAll('.listing-item').forEach(item => {
      items.push({
        title: item.querySelector('.title')?.textContent.trim(),
        description: item.querySelector('.description')?.textContent.trim(),
        link: item.querySelector('a')?.href
      });
    });
    return items;
  });

  allData.push(...pageData);

  // Check if next button exists
  const nextButton = await page.$('.next-page-button:not(.disabled)');

  if (nextButton) {
    await Promise.all([
      page.waitForNavigation({ waitUntil: 'networkidle2' }),
      nextButton.click()
    ]);
    pageNum++;
  } else {
    hasNextPage = false;
  }
}

await browser.close();

return allData.map(item => ({ json: item }));

Handling Infinite Scroll

For infinite scroll pagination, you need to simulate scrolling behavior:

// Puppeteer node: Infinite scroll handler
const puppeteer = require('puppeteer');

const browser = await puppeteer.launch({ headless: true });
const page = await browser.newPage();

await page.goto('https://example.com/feed', {
  waitUntil: 'networkidle2'
});

let previousHeight = 0;
let scrollAttempts = 0;
const maxScrolls = 20;

while (scrollAttempts < maxScrolls) {
  // Scroll to bottom
  await page.evaluate(() => {
    window.scrollTo(0, document.body.scrollHeight);
  });

  // Wait for new content to load
  await page.waitForTimeout(2000);

  const currentHeight = await page.evaluate(() => document.body.scrollHeight);

  if (currentHeight === previousHeight) {
    // No new content loaded
    break;
  }

  previousHeight = currentHeight;
  scrollAttempts++;
}

// Extract all loaded data
const allItems = await page.evaluate(() => {
  const items = [];
  document.querySelectorAll('.feed-item').forEach(item => {
    items.push({
      content: item.querySelector('.content')?.textContent.trim(),
      author: item.querySelector('.author')?.textContent.trim(),
      timestamp: item.querySelector('.timestamp')?.textContent.trim()
    });
  });
  return items;
});

await browser.close();

return allItems.map(item => ({ json: item }));

Method 3: API Pagination

Many modern websites load data via API calls. Intercepting these calls often provides the cleanest scraping approach.

Offset-Based Pagination

// Function node: API pagination with offset
const pageSize = 50;
let offset = 0;
let allResults = [];
let hasMore = true;

while (hasMore) {
  const response = await $http.request({
    method: 'GET',
    url: `https://api.example.com/items?limit=${pageSize}&offset=${offset}`,
    headers: {
      'Accept': 'application/json'
    }
  });

  const data = response.data;
  allResults.push(...data.items);

  hasMore = data.items.length === pageSize;
  offset += pageSize;

  // Safety limit
  if (offset > 1000) break;
}

return allResults.map(item => ({ json: item }));

Cursor-Based Pagination

Some APIs use cursor tokens instead of offsets:

// Function node: Cursor-based API pagination
let cursor = null;
let allResults = [];
let pageCount = 0;
const maxPages = 20;

do {
  const url = cursor
    ? `https://api.example.com/data?cursor=${cursor}`
    : 'https://api.example.com/data';

  const response = await $http.request({
    method: 'GET',
    url: url,
    headers: {
      'Authorization': 'Bearer YOUR_TOKEN',
      'Accept': 'application/json'
    }
  });

  const data = response.data;
  allResults.push(...data.results);

  cursor = data.next_cursor;
  pageCount++;

} while (cursor && pageCount < maxPages);

return allResults.map(item => ({ json: item }));

Method 4: Using WebScraping.AI API with n8n

For production workflows, using a dedicated scraping API can simplify pagination handling significantly:

// HTTP Request node configuration
const pageNum = $json.page || 1;

const response = await $http.request({
  method: 'GET',
  url: 'https://api.webscraping.ai/html',
  params: {
    url: `https://example.com/products?page=${pageNum}`,
    api_key: 'YOUR_API_KEY',
    js: true,
    proxy: 'residential'
  }
});

// Parse the HTML response
const cheerio = require('cheerio');
const $ = cheerio.load(response.data);

const products = [];
$('.product').each((i, el) => {
  products.push({
    name: $(el).find('.name').text(),
    price: $(el).find('.price').text()
  });
});

return { json: { products, page: pageNum } };

Best Practices for Pagination in n8n

1. Implement Rate Limiting

Avoid overwhelming target servers by adding delays between requests:

// Wait node or in Code node
await new Promise(resolve => setTimeout(resolve, 2000)); // 2 second delay

2. Handle Errors Gracefully

Wrap your pagination logic in try-catch blocks:

// Error handling in pagination loop
try {
  const data = await fetchPage(pageNum);
  return { json: data };
} catch (error) {
  console.error(`Failed to fetch page ${pageNum}:`, error.message);
  return {
    json: {
      error: true,
      page: pageNum,
      message: error.message
    }
  };
}

3. Store Progress

For long-running scrapes, save progress periodically:

// After each page, update a spreadsheet or database
await $http.request({
  method: 'POST',
  url: 'YOUR_WEBHOOK_URL',
  data: {
    lastProcessedPage: currentPage,
    totalItems: allData.length,
    timestamp: new Date().toISOString()
  }
});

4. Use Conditional Logic

Implement smart stopping conditions to avoid infinite loops:

// Stop conditions
const shouldContinue = (
  currentPage < maxPages &&
  newItemsFound > 0 &&
  !rateLimitDetected
);

Advanced Techniques

Parallel Page Processing

For faster scraping, process multiple pages simultaneously using n8n's Split In Batches node:

// Generate batch of page URLs
const pages = Array.from({ length: 10 }, (_, i) => ({
  url: `https://example.com/items?page=${i + 1}`,
  pageNum: i + 1
}));

return pages.map(page => ({ json: page }));

Then use Split In Batches with batch size 3-5 to process multiple pages concurrently while respecting rate limits.

Detecting Pagination Patterns

Automatically detect pagination structure:

// Auto-detect pagination type
const $ = cheerio.load(html);

const paginationInfo = {
  hasNumberedLinks: $('.pagination a[href*="page="]').length > 0,
  hasNextButton: $('.next, .pagination-next, a:contains("Next")').length > 0,
  hasLoadMore: $('button:contains("Load More")').length > 0,
  pageLinks: []
};

$('.pagination a').each((i, el) => {
  const href = $(el).attr('href');
  if (href && href.includes('page=')) {
    paginationInfo.pageLinks.push(href);
  }
});

return { json: paginationInfo };

Troubleshooting Common Issues

Issue: Duplicate Data

Solution: Implement deduplication using Set or database checks:

const seen = new Set();
const uniqueItems = allItems.filter(item => {
  const key = item.id || item.url;
  if (seen.has(key)) return false;
  seen.add(key);
  return true;
});

Issue: Pagination Loop Never Ends

Solution: Always implement maximum page limits and timeout conditions:

const MAX_PAGES = 100;
const START_TIME = Date.now();
const TIMEOUT_MS = 300000; // 5 minutes

while (hasNextPage && pageCount < MAX_PAGES) {
  if (Date.now() - START_TIME > TIMEOUT_MS) {
    console.log('Timeout reached, stopping pagination');
    break;
  }
  // ... pagination logic
}

Issue: Dynamic Content Not Loading

Solution: Use proper wait conditions in Puppeteer to ensure content is fully loaded before extraction:

await page.waitForSelector('.product-list', { timeout: 10000 });
await page.waitForFunction(() => {
  return document.querySelectorAll('.product-item').length > 0;
});

Conclusion

Handling pagination in n8n requires understanding both the pagination mechanism of your target website and choosing the right n8n nodes and techniques. Start with simple URL-based pagination for basic sites, leverage browser automation for complex JavaScript-heavy pages, and consider dedicated scraping APIs for production use cases.

Remember to always respect robots.txt, implement rate limiting, and handle errors gracefully to build robust and reliable scraping workflows in n8n.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon