Can I Use Node.js for Web Scraping in n8n Workflows?

Yes, you can absolutely use Node.js for web scraping in n8n workflows! n8n provides powerful built-in functionality through its Code node (formerly Function node) that allows you to execute custom JavaScript/Node.js code directly within your automation workflows. This gives you the flexibility to implement complex scraping logic, parse HTML, process data, and integrate with external libraries.

Understanding n8n's Code Node

The Code node in n8n is your gateway to custom Node.js scripting. It runs in a sandboxed environment with access to several built-in modules and allows you to process data, make HTTP requests, and manipulate workflow results programmatically.

Key Features of the Code Node

JavaScript/Node.js runtime: Write standard JavaScript code with ES6+ syntax
Access to workflow data: Read and manipulate data from previous nodes
Built-in modules: Access to common Node.js modules like axios, cheerio, and more
Multiple items support: Process single or multiple data items
Error handling: Built-in error management and debugging capabilities

Basic Web Scraping with Node.js in n8n

Method 1: Using the HTTP Request Node + Code Node

The most straightforward approach combines n8n's HTTP Request node with the Code node for HTML parsing:

// In the Code node, after fetching HTML with HTTP Request node
const cheerio = require('cheerio');

// Get the HTML from the previous node
const html = $input.item.json.body;

// Parse with Cheerio
const $ = cheerio.load(html);

// Extract data
const titles = [];
$('h2.product-title').each((i, elem) => {
  titles.push($(elem).text().trim());
});

const prices = [];
$('.price').each((i, elem) => {
  prices.push($(elem).text().trim());
});

// Return structured data
return titles.map((title, index) => ({
  json: {
    title: title,
    price: prices[index] || 'N/A'
  }
}));

Method 2: All-in-One Code Node Approach

You can also handle both the HTTP request and parsing in a single Code node:

const axios = require('axios');
const cheerio = require('cheerio');

// Define the target URL
const url = 'https://example.com/products';

try {
  // Fetch the HTML
  const response = await axios.get(url, {
    headers: {
      'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
    }
  });

  // Parse HTML
  const $ = cheerio.load(response.data);

  // Extract product information
  const products = [];

  $('.product-card').each((i, element) => {
    const product = {
      name: $(element).find('.product-name').text().trim(),
      price: $(element).find('.product-price').text().trim(),
      url: $(element).find('a').attr('href'),
      image: $(element).find('img').attr('src')
    };

    products.push(product);
  });

  // Return results as n8n items
  return products.map(product => ({
    json: product
  }));

} catch (error) {
  throw new Error(`Scraping failed: ${error.message}`);
}

Advanced Scraping Techniques

Handling Pagination

When scraping multiple pages, you can implement pagination logic within your Code node:

const axios = require('axios');
const cheerio = require('cheerio');

const baseUrl = 'https://example.com/products';
const maxPages = 5;
const allProducts = [];

for (let page = 1; page <= maxPages; page++) {
  const url = `${baseUrl}?page=${page}`;

  try {
    const response = await axios.get(url);
    const $ = cheerio.load(response.data);

    // Check if there are products on this page
    const productCount = $('.product-item').length;
    if (productCount === 0) break;

    $('.product-item').each((i, elem) => {
      allProducts.push({
        title: $(elem).find('h3').text().trim(),
        price: $(elem).find('.price').text().trim(),
        page: page
      });
    });

    // Rate limiting - wait between requests
    await new Promise(resolve => setTimeout(resolve, 1000));

  } catch (error) {
    console.error(`Error on page ${page}:`, error.message);
    break;
  }
}

return allProducts.map(product => ({ json: product }));

Handling Dynamic Content and AJAX

For websites that load content dynamically via JavaScript, you'll need a headless browser solution. While n8n doesn't directly support Puppeteer in the Code node due to resource constraints, you can use alternative approaches:

Option 1: Use WebScraping.AI API

const axios = require('axios');

const apiKey = 'YOUR_API_KEY';
const targetUrl = 'https://example.com/dynamic-content';

try {
  const response = await axios.get('https://api.webscraping.ai/html', {
    params: {
      api_key: apiKey,
      url: targetUrl,
      js: true, // Enable JavaScript rendering
      timeout: 10000
    }
  });

  const cheerio = require('cheerio');
  const $ = cheerio.load(response.data);

  // Extract data from rendered HTML
  const results = [];
  $('.dynamic-content-item').each((i, elem) => {
    results.push({
      title: $(elem).find('.title').text(),
      description: $(elem).find('.description').text()
    });
  });

  return results.map(item => ({ json: item }));

} catch (error) {
  throw new Error(`API request failed: ${error.message}`);
}

Option 2: Call External Service

Deploy a separate Node.js service with Puppeteer and call it from n8n:

const axios = require('axios');

// Call your external Puppeteer service
const response = await axios.post('https://your-puppeteer-service.com/scrape', {
  url: 'https://example.com',
  waitForSelector: '.loaded-content',
  timeout: 30000
});

const scrapedData = response.data;

return scrapedData.map(item => ({ json: item }));

For more complex browser automation scenarios, consider using solutions that handle browser sessions or manage AJAX requests effectively.

Data Cleaning and Transformation

Clean and normalize scraped data within your Code node:

// Input data from previous node
const items = $input.all();

const cleanedData = items.map(item => {
  const data = item.json;

  return {
    // Remove currency symbols and convert to number
    price: parseFloat(data.price.replace(/[^0-9.]/g, '')),

    // Normalize text
    title: data.title
      .trim()
      .replace(/\s+/g, ' ')
      .toLowerCase(),

    // Extract domain from URL
    domain: new URL(data.url).hostname,

    // Add timestamp
    scrapedAt: new Date().toISOString(),

    // Convert stock status to boolean
    inStock: data.availability.toLowerCase().includes('in stock')
  };
});

return cleanedData.map(item => ({ json: item }));

Best Practices for Node.js Scraping in n8n

1. Error Handling

Always implement robust error handling to prevent workflow failures:

const axios = require('axios');
const cheerio = require('cheerio');

const urls = ['https://site1.com', 'https://site2.com', 'https://site3.com'];
const results = [];
const errors = [];

for (const url of urls) {
  try {
    const response = await axios.get(url, { timeout: 5000 });
    const $ = cheerio.load(response.data);

    results.push({
      url: url,
      title: $('title').text(),
      status: 'success'
    });

  } catch (error) {
    errors.push({
      url: url,
      error: error.message,
      status: 'failed'
    });
  }
}

return [
  { json: { results, errors, summary: {
    total: urls.length,
    successful: results.length,
    failed: errors.length
  }}]
];

2. Rate Limiting

Implement delays to avoid overwhelming target servers:

async function scrapeWithDelay(urls, delayMs = 1000) {
  const results = [];

  for (const url of urls) {
    const response = await axios.get(url);
    const $ = cheerio.load(response.data);

    results.push({
      url: url,
      content: $('body').text().substring(0, 200)
    });

    // Wait before next request
    await new Promise(resolve => setTimeout(resolve, delayMs));
  }

  return results;
}

const urls = $input.item.json.urls;
const data = await scrapeWithDelay(urls, 2000);

return data.map(item => ({ json: item }));

3. User-Agent Rotation

Set appropriate headers to avoid being blocked:

const axios = require('axios');

const userAgents = [
  'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
  'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36',
  'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36'
];

const randomUserAgent = userAgents[Math.floor(Math.random() * userAgents.length)];

const response = await axios.get('https://example.com', {
  headers: {
    'User-Agent': randomUserAgent,
    'Accept': 'text/html,application/xhtml+xml',
    'Accept-Language': 'en-US,en;q=0.9',
    'Accept-Encoding': 'gzip, deflate, br',
    'Connection': 'keep-alive'
  }
});

return [{ json: { html: response.data, userAgent: randomUserAgent } }];

Integrating with WebScraping.AI

For production-grade scraping in n8n, consider using a dedicated API like WebScraping.AI:

const axios = require('axios');

const config = {
  apiKey: 'YOUR_API_KEY',
  baseUrl: 'https://api.webscraping.ai'
};

// HTML Scraping
async function scrapeHtml(url, enableJs = true) {
  const response = await axios.get(`${config.baseUrl}/html`, {
    params: {
      api_key: config.apiKey,
      url: url,
      js: enableJs,
      proxy: 'datacenter',
      timeout: 15000
    }
  });

  return response.data;
}

// AI-powered question answering
async function askQuestion(url, question) {
  const response = await axios.get(`${config.baseUrl}/question`, {
    params: {
      api_key: config.apiKey,
      url: url,
      question: question
    }
  });

  return response.data;
}

// Main execution
const targetUrl = $input.item.json.url;
const html = await scrapeHtml(targetUrl, true);

const cheerio = require('cheerio');
const $ = cheerio.load(html);

return [{
  json: {
    title: $('h1').first().text(),
    description: $('meta[name="description"]').attr('content'),
    scrapedAt: new Date().toISOString()
  }
}];

Workflow Example: Complete Product Scraper

Here's a complete example that combines multiple techniques:

const axios = require('axios');
const cheerio = require('cheerio');

// Configuration
const config = {
  startUrl: 'https://example.com/products',
  maxProducts: 50,
  delayBetweenRequests: 1500
};

// Results storage
const allProducts = [];
let currentPage = 1;
let hasNextPage = true;

// Main scraping loop
while (hasNextPage && allProducts.length < config.maxProducts) {
  try {
    const url = `${config.startUrl}?page=${currentPage}`;

    const response = await axios.get(url, {
      headers: {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
      },
      timeout: 10000
    });

    const $ = cheerio.load(response.data);

    // Extract products
    $('.product').each((i, elem) => {
      if (allProducts.length >= config.maxProducts) return false;

      const product = {
        id: $(elem).attr('data-product-id'),
        name: $(elem).find('.product-name').text().trim(),
        price: parseFloat($(elem).find('.price').text().replace(/[^0-9.]/g, '')),
        rating: parseFloat($(elem).find('.rating').attr('data-rating')),
        image: $(elem).find('img').attr('src'),
        url: $(elem).find('a').attr('href'),
        inStock: $(elem).find('.stock-status').text().includes('In Stock'),
        scrapedAt: new Date().toISOString(),
        page: currentPage
      };

      allProducts.push(product);
    });

    // Check for next page
    hasNextPage = $('.pagination .next').length > 0;
    currentPage++;

    // Rate limiting
    if (hasNextPage) {
      await new Promise(resolve => setTimeout(resolve, config.delayBetweenRequests));
    }

  } catch (error) {
    console.error(`Error on page ${currentPage}:`, error.message);
    hasNextPage = false;
  }
}

// Return results with summary
return [{
  json: {
    products: allProducts,
    summary: {
      totalProducts: allProducts.length,
      pagesScraped: currentPage - 1,
      avgPrice: allProducts.reduce((sum, p) => sum + p.price, 0) / allProducts.length,
      inStockCount: allProducts.filter(p => p.inStock).length
    }
  }
}];

Limitations and Considerations

Memory and Execution Time

n8n's Code node has resource limitations: - Execution timeout: Code nodes typically timeout after 2-5 minutes - Memory limits: Limited memory allocation per execution - No persistent state: Each execution starts fresh

Module Availability

The Code node has access to common modules, but not all npm packages are available. Built-in modules include: - axios - HTTP requests - cheerio - HTML parsing - lodash - Utility functions - Standard Node.js modules (fs, path, crypto, etc.)

For modules not available in n8n, consider using external services or APIs.

Alternatives for Complex Scraping

When Node.js in n8n isn't sufficient:

Use HTTP Request node: For simple API calls and basic scraping
Deploy external service: Run a dedicated scraping service with full Node.js/Puppeteer
Use specialized APIs: Services like WebScraping.AI handle complex scenarios
n8n Execute Command node: Run external scripts from your n8n host

For complex scenarios involving iframe handling or advanced browser events, external services are recommended.

Conclusion

Node.js is a powerful tool for web scraping within n8n workflows through the Code node. It provides the flexibility to implement custom scraping logic, parse HTML, handle pagination, and process data—all within your automation workflows. While there are some limitations compared to a full Node.js environment, combining n8n's Code node with external APIs like WebScraping.AI gives you a robust solution for production-grade web scraping automation.

Whether you're building a simple product price monitor or a complex data aggregation pipeline, Node.js in n8n provides the scripting power you need while maintaining the visual workflow benefits that make n8n so powerful for automation.

Table of contents