Table of contents

What is the Best Way to Implement Loops in n8n for Scraping Multiple Pages?

Implementing loops in n8n for scraping multiple pages is essential when you need to extract data from paginated content, multiple URLs, or dynamic websites. n8n provides several powerful methods to handle loops efficiently, including the Loop Over Items node, SplitInBatches node, and custom Code nodes with JavaScript. This guide covers best practices and practical implementations for each approach.

Understanding Loop Patterns in n8n

When scraping multiple pages, you typically encounter three common scenarios:

  1. Paginated content - Scraping multiple pages with sequential page numbers
  2. URL lists - Processing a predefined list of URLs
  3. Dynamic pagination - Following "next page" links until no more pages exist

n8n handles these scenarios through different loop mechanisms, each suited for specific use cases.

Method 1: Loop Over Items Node

The Loop Over Items node is the simplest way to iterate through multiple items in n8n. It processes each item in your input data sequentially, making it ideal for scraping a known list of URLs.

Basic Implementation

// Step 1: Create an array of URLs using a Code node
const urls = [
  'https://example.com/page/1',
  'https://example.com/page/2',
  'https://example.com/page/3',
  'https://example.com/page/4',
  'https://example.com/page/5'
];

return urls.map((url, index) => ({
  json: { url, pageNumber: index + 1 }
}));

After creating your URL list, connect it to a Loop Over Items node, which will process each URL individually. Inside the loop, add your HTTP Request or scraping node:

// Step 2: HTTP Request configuration (inside loop)
// Method: GET
// URL: {{$json.url}}
// Response Format: String (for HTML)

Extracting Data Within the Loop

Once you have the HTML response, use a Code node to parse and extract data:

// Step 3: Parse HTML using Code node
const cheerio = require('cheerio');
const html = $input.first().json.body;
const $ = cheerio.load(html);

const data = [];
$('.product-item').each((i, element) => {
  data.push({
    title: $(element).find('.product-title').text().trim(),
    price: $(element).find('.product-price').text().trim(),
    url: $(element).find('a').attr('href'),
    pageNumber: $input.first().json.pageNumber
  });
});

return data.map(item => ({ json: item }));

Method 2: SplitInBatches Node

The SplitInBatches node is more advanced and efficient for processing large numbers of pages. It divides your items into batches and processes them in groups, which is excellent for managing rate limits and memory usage.

Configuration Example

// Step 1: Generate page numbers using Code node
const totalPages = 50;
const pages = [];

for (let i = 1; i <= totalPages; i++) {
  pages.push({
    json: {
      pageNumber: i,
      url: `https://example.com/products?page=${i}`
    }
  });
}

return pages;

Configure the SplitInBatches node: - Batch Size: 5 (process 5 pages at a time) - Options: Reset after completion

Inside the loop, add your scraping logic with the HTTP Request node configured similarly to Method 1.

Method 3: Custom Loop with Code Node

For more complex scenarios, such as following "next page" links or implementing dynamic pagination, use a custom JavaScript loop in a Code node. This approach offers maximum flexibility and control.

Dynamic Pagination Example

// Complete pagination scraper using Code node
const cheerio = require('cheerio');
const axios = require('axios');

async function scrapeAllPages(startUrl, maxPages = 100) {
  const allData = [];
  let currentUrl = startUrl;
  let pageCount = 0;

  while (currentUrl && pageCount < maxPages) {
    try {
      // Fetch page content
      const response = await axios.get(currentUrl, {
        headers: {
          'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
        }
      });

      const $ = cheerio.load(response.data);

      // Extract data from current page
      $('.item').each((i, element) => {
        allData.push({
          title: $(element).find('.title').text().trim(),
          description: $(element).find('.description').text().trim(),
          link: $(element).find('a').attr('href'),
          page: pageCount + 1
        });
      });

      // Find next page link
      const nextPageLink = $('a.next-page').attr('href');
      currentUrl = nextPageLink ? new URL(nextPageLink, currentUrl).href : null;

      pageCount++;

      // Add delay to respect rate limits
      await new Promise(resolve => setTimeout(resolve, 1000));

    } catch (error) {
      console.error(`Error on page ${pageCount + 1}:`, error.message);
      break;
    }
  }

  return allData;
}

// Execute scraping
const startUrl = 'https://example.com/products';
const results = await scrapeAllPages(startUrl, 50);

return results.map(item => ({ json: item }));

Method 4: Loop with Puppeteer for JavaScript-Heavy Sites

For scraping JavaScript-rendered content across multiple pages, combine n8n's Loop Over Items with Puppeteer. This is particularly useful when handling browser sessions in Puppeteer for complex multi-page scraping tasks.

Puppeteer Loop Implementation

// Code node with Puppeteer for paginated scraping
const puppeteer = require('puppeteer');

async function scrapeWithPuppeteer(urls) {
  const browser = await puppeteer.launch({
    headless: true,
    args: ['--no-sandbox', '--disable-setuid-sandbox']
  });

  const allResults = [];

  for (const url of urls) {
    const page = await browser.newPage();

    try {
      // Navigate to page and wait for content
      await page.goto(url, { waitUntil: 'networkidle2' });

      // Wait for dynamic content to load
      await page.waitForSelector('.product-list', { timeout: 5000 });

      // Extract data
      const pageData = await page.evaluate(() => {
        const items = [];
        document.querySelectorAll('.product-item').forEach(item => {
          items.push({
            title: item.querySelector('.title')?.textContent.trim(),
            price: item.querySelector('.price')?.textContent.trim(),
            image: item.querySelector('img')?.src
          });
        });
        return items;
      });

      allResults.push(...pageData.map(item => ({
        ...item,
        sourceUrl: url
      })));

    } catch (error) {
      console.error(`Error scraping ${url}:`, error.message);
    } finally {
      await page.close();
    }
  }

  await browser.close();
  return allResults;
}

// Get URLs from input
const urls = $input.all().map(item => item.json.url);
const results = await scrapeWithPuppeteer(urls);

return results.map(item => ({ json: item }));

Best Practices for Loop-Based Scraping

1. Implement Rate Limiting

Always add delays between requests to avoid overwhelming servers and getting blocked:

// Add delay function in Code node
function delay(ms) {
  return new Promise(resolve => setTimeout(resolve, ms));
}

// Use between iterations
await delay(1000); // 1 second delay

2. Error Handling

Implement robust error handling to prevent entire workflows from failing:

const results = [];
const errors = [];

for (const url of urls) {
  try {
    const response = await axios.get(url);
    results.push(processData(response.data));
  } catch (error) {
    errors.push({
      url,
      error: error.message,
      timestamp: new Date().toISOString()
    });
  }
}

return {
  json: {
    successful: results,
    failed: errors,
    totalProcessed: urls.length
  }
};

3. Use Conditional Stopping

Implement logic to stop loops when no more data is available:

// Check if page has content before continuing
const hasContent = $('.product-item').length > 0;
if (!hasContent) {
  break; // Exit loop if no items found
}

// Check for "next page" button
const hasNextPage = $('a.next-page').length > 0;
if (!hasNextPage) {
  break; // Exit loop if no next page exists
}

4. Memory Management

When scraping many pages, manage memory by processing data in batches:

// Process in chunks to avoid memory issues
const CHUNK_SIZE = 10;
const allResults = [];

for (let i = 0; i < totalUrls.length; i += CHUNK_SIZE) {
  const chunk = totalUrls.slice(i, i + CHUNK_SIZE);
  const chunkResults = await processChunk(chunk);
  allResults.push(...chunkResults);

  // Optional: Save intermediate results
  await saveToDatabase(chunkResults);
}

Advanced Pattern: Infinite Scroll Handling

Some websites use infinite scroll instead of traditional pagination. Here's how to handle this with Puppeteer in n8n:

async function scrapeInfiniteScroll(url, maxScrolls = 10) {
  const browser = await puppeteer.launch({ headless: true });
  const page = await browser.newPage();

  await page.goto(url, { waitUntil: 'networkidle2' });

  const allItems = [];
  let previousHeight = 0;
  let scrollCount = 0;

  while (scrollCount < maxScrolls) {
    // Scroll to bottom
    await page.evaluate(() => {
      window.scrollTo(0, document.body.scrollHeight);
    });

    // Wait for new content to load
    await page.waitForTimeout(2000);

    // Get current scroll height
    const currentHeight = await page.evaluate(() => document.body.scrollHeight);

    // Break if no new content loaded
    if (currentHeight === previousHeight) {
      break;
    }

    previousHeight = currentHeight;
    scrollCount++;
  }

  // Extract all items after scrolling
  const items = await page.evaluate(() => {
    return Array.from(document.querySelectorAll('.item')).map(item => ({
      title: item.querySelector('.title')?.textContent.trim(),
      link: item.querySelector('a')?.href
    }));
  });

  await browser.close();
  return items;
}

Combining Loops with WebScraping.AI API

For production-grade scraping at scale, consider using the WebScraping.AI API within your n8n loops. This approach handles JavaScript rendering, proxies, and anti-bot measures automatically:

// HTTP Request node configuration for WebScraping.AI
// Method: GET
// URL: https://api.webscraping.ai/html

// Query Parameters:
const params = {
  api_key: 'YOUR_API_KEY',
  url: '{{$json.targetUrl}}',
  js: 'true',
  proxy: 'datacenter'
};

Then parse the response in a Code node:

const cheerio = require('cheerio');
const html = $input.first().json.html;
const $ = cheerio.load(html);

// Extract your data
const results = [];
$('.result-item').each((i, el) => {
  results.push({
    title: $(el).find('.title').text(),
    url: $(el).find('a').attr('href')
  });
});

return results.map(r => ({ json: r }));

Monitoring and Debugging Loops

Add logging to track loop progress and identify issues:

// Add progress tracking
const totalPages = urls.length;
console.log(`Starting scrape of ${totalPages} pages`);

for (let i = 0; i < urls.length; i++) {
  const url = urls[i];
  console.log(`Processing page ${i + 1}/${totalPages}: ${url}`);

  try {
    const result = await scrapePage(url);
    console.log(`✓ Successfully scraped ${url}`);
  } catch (error) {
    console.error(`✗ Failed to scrape ${url}: ${error.message}`);
  }
}

console.log('Scraping complete');

Conclusion

Implementing loops in n8n for multi-page scraping requires choosing the right approach based on your specific requirements. Use Loop Over Items for simple iterations, SplitInBatches for efficient large-scale processing, and Code nodes with custom JavaScript for complex pagination logic. When dealing with JavaScript-heavy websites, combining loops with Puppeteer's navigation capabilities provides the most robust solution.

Remember to implement proper rate limiting, error handling, and memory management to ensure your scraping workflows run reliably at scale. For production applications requiring high reliability and sophisticated anti-bot bypassing, consider integrating the WebScraping.AI API into your n8n loops to handle the complexities of modern web scraping automatically.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon