How do I Handle Pagination in n8n Web Scraping Workflows?
Pagination is one of the most common challenges when scraping websites at scale. Whether you're extracting product listings, blog posts, or search results, understanding how to navigate through multiple pages efficiently in n8n is essential for successful data extraction workflows.
This guide covers multiple approaches to handling pagination in n8n, from simple loop-based methods to advanced browser automation techniques.
Understanding Pagination Types
Before diving into implementation, it's important to recognize the different types of pagination you'll encounter:
- URL-based pagination: Pages accessible via URL parameters (e.g.,
?page=2
) - Button-based pagination: "Next" buttons that trigger page loads
- Infinite scroll: Content that loads dynamically as you scroll
- API pagination: REST APIs with pagination tokens or offsets
Method 1: Loop-Based URL Pagination
The simplest pagination method works when pages follow a predictable URL pattern. This approach uses n8n's loop functionality to iterate through multiple page numbers.
Basic Loop Setup
- Set up a Loop Over Items node to define your page range:
// In a Function node to generate page numbers
const items = [];
const startPage = 1;
const endPage = 10;
for (let page = startPage; page <= endPage; page++) {
items.push({ page: page });
}
return items;
- Configure HTTP Request node to fetch each page:
URL: https://example.com/products?page={{$json["page"]}}
Method: GET
- Parse HTML using the HTML Extract node or Code node with Cheerio:
// Using Cheerio in Code node
const cheerio = require('cheerio');
const html = $input.item.json.data;
const $ = cheerio.load(html);
const products = [];
$('.product-item').each((i, el) => {
products.push({
title: $(el).find('.product-title').text().trim(),
price: $(el).find('.product-price').text().trim(),
url: $(el).find('a').attr('href')
});
});
return products.map(product => ({ json: product }));
Dynamic Page Detection
Often you don't know the total number of pages upfront. Here's how to scrape until no more data is found:
// Function node: Check for next page
const items = $input.all();
const currentPage = $node["HTTP Request"].json;
const hasResults = currentPage.products && currentPage.products.length > 0;
if (hasResults) {
return {
json: {
nextPage: ($json.currentPage || 1) + 1,
continue: true
}
};
} else {
return {
json: {
continue: false
}
};
}
Connect this to an IF node that continues the loop only when continue
is true
.
Method 2: Browser Automation with Puppeteer
For JavaScript-rendered content and complex pagination, using Puppeteer with n8n provides more control. This is particularly useful for sites that rely on JavaScript for navigation.
Click-Based Pagination
// In n8n Puppeteer node or Code node with Puppeteer
const puppeteer = require('puppeteer');
const browser = await puppeteer.launch({ headless: true });
const page = await browser.newPage();
let allData = [];
let hasNextPage = true;
let pageNum = 1;
await page.goto('https://example.com/listings', {
waitUntil: 'networkidle2'
});
while (hasNextPage && pageNum <= 50) {
// Wait for content to load
await page.waitForSelector('.listing-item', { timeout: 5000 });
// Extract data from current page
const pageData = await page.evaluate(() => {
const items = [];
document.querySelectorAll('.listing-item').forEach(item => {
items.push({
title: item.querySelector('.title')?.textContent.trim(),
description: item.querySelector('.description')?.textContent.trim(),
link: item.querySelector('a')?.href
});
});
return items;
});
allData.push(...pageData);
// Check if next button exists
const nextButton = await page.$('.next-page-button:not(.disabled)');
if (nextButton) {
await Promise.all([
page.waitForNavigation({ waitUntil: 'networkidle2' }),
nextButton.click()
]);
pageNum++;
} else {
hasNextPage = false;
}
}
await browser.close();
return allData.map(item => ({ json: item }));
Handling Infinite Scroll
For infinite scroll pagination, you need to simulate scrolling behavior:
// Puppeteer node: Infinite scroll handler
const puppeteer = require('puppeteer');
const browser = await puppeteer.launch({ headless: true });
const page = await browser.newPage();
await page.goto('https://example.com/feed', {
waitUntil: 'networkidle2'
});
let previousHeight = 0;
let scrollAttempts = 0;
const maxScrolls = 20;
while (scrollAttempts < maxScrolls) {
// Scroll to bottom
await page.evaluate(() => {
window.scrollTo(0, document.body.scrollHeight);
});
// Wait for new content to load
await page.waitForTimeout(2000);
const currentHeight = await page.evaluate(() => document.body.scrollHeight);
if (currentHeight === previousHeight) {
// No new content loaded
break;
}
previousHeight = currentHeight;
scrollAttempts++;
}
// Extract all loaded data
const allItems = await page.evaluate(() => {
const items = [];
document.querySelectorAll('.feed-item').forEach(item => {
items.push({
content: item.querySelector('.content')?.textContent.trim(),
author: item.querySelector('.author')?.textContent.trim(),
timestamp: item.querySelector('.timestamp')?.textContent.trim()
});
});
return items;
});
await browser.close();
return allItems.map(item => ({ json: item }));
Method 3: API Pagination
Many modern websites load data via API calls. Intercepting these calls often provides the cleanest scraping approach.
Offset-Based Pagination
// Function node: API pagination with offset
const pageSize = 50;
let offset = 0;
let allResults = [];
let hasMore = true;
while (hasMore) {
const response = await $http.request({
method: 'GET',
url: `https://api.example.com/items?limit=${pageSize}&offset=${offset}`,
headers: {
'Accept': 'application/json'
}
});
const data = response.data;
allResults.push(...data.items);
hasMore = data.items.length === pageSize;
offset += pageSize;
// Safety limit
if (offset > 1000) break;
}
return allResults.map(item => ({ json: item }));
Cursor-Based Pagination
Some APIs use cursor tokens instead of offsets:
// Function node: Cursor-based API pagination
let cursor = null;
let allResults = [];
let pageCount = 0;
const maxPages = 20;
do {
const url = cursor
? `https://api.example.com/data?cursor=${cursor}`
: 'https://api.example.com/data';
const response = await $http.request({
method: 'GET',
url: url,
headers: {
'Authorization': 'Bearer YOUR_TOKEN',
'Accept': 'application/json'
}
});
const data = response.data;
allResults.push(...data.results);
cursor = data.next_cursor;
pageCount++;
} while (cursor && pageCount < maxPages);
return allResults.map(item => ({ json: item }));
Method 4: Using WebScraping.AI API with n8n
For production workflows, using a dedicated scraping API can simplify pagination handling significantly:
// HTTP Request node configuration
const pageNum = $json.page || 1;
const response = await $http.request({
method: 'GET',
url: 'https://api.webscraping.ai/html',
params: {
url: `https://example.com/products?page=${pageNum}`,
api_key: 'YOUR_API_KEY',
js: true,
proxy: 'residential'
}
});
// Parse the HTML response
const cheerio = require('cheerio');
const $ = cheerio.load(response.data);
const products = [];
$('.product').each((i, el) => {
products.push({
name: $(el).find('.name').text(),
price: $(el).find('.price').text()
});
});
return { json: { products, page: pageNum } };
Best Practices for Pagination in n8n
1. Implement Rate Limiting
Avoid overwhelming target servers by adding delays between requests:
// Wait node or in Code node
await new Promise(resolve => setTimeout(resolve, 2000)); // 2 second delay
2. Handle Errors Gracefully
Wrap your pagination logic in try-catch blocks:
// Error handling in pagination loop
try {
const data = await fetchPage(pageNum);
return { json: data };
} catch (error) {
console.error(`Failed to fetch page ${pageNum}:`, error.message);
return {
json: {
error: true,
page: pageNum,
message: error.message
}
};
}
3. Store Progress
For long-running scrapes, save progress periodically:
// After each page, update a spreadsheet or database
await $http.request({
method: 'POST',
url: 'YOUR_WEBHOOK_URL',
data: {
lastProcessedPage: currentPage,
totalItems: allData.length,
timestamp: new Date().toISOString()
}
});
4. Use Conditional Logic
Implement smart stopping conditions to avoid infinite loops:
// Stop conditions
const shouldContinue = (
currentPage < maxPages &&
newItemsFound > 0 &&
!rateLimitDetected
);
Advanced Techniques
Parallel Page Processing
For faster scraping, process multiple pages simultaneously using n8n's Split In Batches node:
// Generate batch of page URLs
const pages = Array.from({ length: 10 }, (_, i) => ({
url: `https://example.com/items?page=${i + 1}`,
pageNum: i + 1
}));
return pages.map(page => ({ json: page }));
Then use Split In Batches with batch size 3-5 to process multiple pages concurrently while respecting rate limits.
Detecting Pagination Patterns
Automatically detect pagination structure:
// Auto-detect pagination type
const $ = cheerio.load(html);
const paginationInfo = {
hasNumberedLinks: $('.pagination a[href*="page="]').length > 0,
hasNextButton: $('.next, .pagination-next, a:contains("Next")').length > 0,
hasLoadMore: $('button:contains("Load More")').length > 0,
pageLinks: []
};
$('.pagination a').each((i, el) => {
const href = $(el).attr('href');
if (href && href.includes('page=')) {
paginationInfo.pageLinks.push(href);
}
});
return { json: paginationInfo };
Troubleshooting Common Issues
Issue: Duplicate Data
Solution: Implement deduplication using Set or database checks:
const seen = new Set();
const uniqueItems = allItems.filter(item => {
const key = item.id || item.url;
if (seen.has(key)) return false;
seen.add(key);
return true;
});
Issue: Pagination Loop Never Ends
Solution: Always implement maximum page limits and timeout conditions:
const MAX_PAGES = 100;
const START_TIME = Date.now();
const TIMEOUT_MS = 300000; // 5 minutes
while (hasNextPage && pageCount < MAX_PAGES) {
if (Date.now() - START_TIME > TIMEOUT_MS) {
console.log('Timeout reached, stopping pagination');
break;
}
// ... pagination logic
}
Issue: Dynamic Content Not Loading
Solution: Use proper wait conditions in Puppeteer to ensure content is fully loaded before extraction:
await page.waitForSelector('.product-list', { timeout: 10000 });
await page.waitForFunction(() => {
return document.querySelectorAll('.product-item').length > 0;
});
Conclusion
Handling pagination in n8n requires understanding both the pagination mechanism of your target website and choosing the right n8n nodes and techniques. Start with simple URL-based pagination for basic sites, leverage browser automation for complex JavaScript-heavy pages, and consider dedicated scraping APIs for production use cases.
Remember to always respect robots.txt, implement rate limiting, and handle errors gracefully to build robust and reliable scraping workflows in n8n.