How do I Create a Node.js Scraper with n8n Automation?

Creating a Node.js scraper within n8n automation workflows allows you to leverage the full power of Node.js libraries while benefiting from n8n's visual workflow automation. This approach combines custom scraping logic with n8n's scheduling, data processing, and integration capabilities.

Understanding n8n's Node.js Execution Options

n8n provides several methods to execute Node.js code for web scraping:

Function Node: Execute custom JavaScript code within n8n workflows
Execute Command Node: Run Node.js scripts as external processes
Code Node: Modern replacement for Function node with enhanced capabilities
HTTP Request Node with JavaScript: Combine API calls with JavaScript processing

Each method has specific use cases, with the Function/Code nodes being ideal for inline scraping logic and Execute Command nodes better suited for complex scrapers requiring external dependencies.

Method 1: Using the Function Node for Simple Scraping

The Function node allows you to write JavaScript code directly in your n8n workflow. Here's a basic example using native Node.js modules:

// Function node code for simple HTTP scraping
const https = require('https');

async function scrapeWebsite(url) {
  return new Promise((resolve, reject) => {
    https.get(url, (response) => {
      let data = '';

      response.on('data', (chunk) => {
        data += chunk;
      });

      response.on('end', () => {
        // Extract data using regex or string methods
        const titleMatch = data.match(/<title>(.*?)<\/title>/i);
        const title = titleMatch ? titleMatch[1] : 'No title found';

        resolve({
          title: title,
          statusCode: response.statusCode,
          contentLength: data.length
        });
      });
    }).on('error', (error) => {
      reject(error);
    });
  });
}

// Main execution
const targetUrl = items[0].json.url || 'https://example.com';
const result = await scrapeWebsite(targetUrl);

return [{ json: result }];

This approach works well for simple HTML parsing but has limitations with dynamic content and complex DOM manipulation.

Method 2: Advanced Scraping with Execute Command Node

For more sophisticated scraping requirements, use the Execute Command node to run external Node.js scripts. This method allows you to install and use npm packages like Puppeteer, Cheerio, or Axios.

Step 1: Create Your Node.js Scraper Script

First, create a standalone Node.js scraper script (scraper.js):

// scraper.js
const axios = require('axios');
const cheerio = require('cheerio');

async function scrapeProduct(url) {
  try {
    const { data } = await axios.get(url, {
      headers: {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
      }
    });

    const $ = cheerio.load(data);

    const product = {
      title: $('h1.product-title').text().trim(),
      price: $('.price').first().text().trim(),
      description: $('.product-description').text().trim(),
      images: []
    };

    // Extract all image URLs
    $('.product-images img').each((i, elem) => {
      product.images.push($(elem).attr('src'));
    });

    return product;
  } catch (error) {
    console.error('Scraping error:', error.message);
    throw error;
  }
}

// Get URL from command line argument
const targetUrl = process.argv[2];

if (!targetUrl) {
  console.error('Please provide a URL as argument');
  process.exit(1);
}

scrapeProduct(targetUrl)
  .then(result => {
    console.log(JSON.stringify(result, null, 2));
  })
  .catch(error => {
    console.error(JSON.stringify({ error: error.message }));
    process.exit(1);
  });

Step 2: Install Dependencies

npm init -y
npm install axios cheerio

Step 3: Configure Execute Command Node in n8n

In your n8n workflow, configure the Execute Command node:

node /path/to/scraper.js {{$json["url"]}}

The output will be captured as JSON and passed to the next node in your workflow.

Method 3: Using Puppeteer for Dynamic Content

For JavaScript-heavy websites, Puppeteer provides browser automation capabilities. Here's a complete Puppeteer scraper for n8n:

// puppeteer-scraper.js
const puppeteer = require('puppeteer');

async function scrapeWithPuppeteer(url) {
  const browser = await puppeteer.launch({
    headless: true,
    args: ['--no-sandbox', '--disable-setuid-sandbox']
  });

  try {
    const page = await browser.newPage();

    // Set user agent to avoid detection
    await page.setUserAgent('Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36');

    // Navigate to page with proper wait conditions
    await page.goto(url, {
      waitUntil: 'networkidle2',
      timeout: 30000
    });

    // Wait for specific elements to load
    await page.waitForSelector('.content', { timeout: 10000 });

    // Extract data from the page
    const data = await page.evaluate(() => {
      const results = [];
      const items = document.querySelectorAll('.item');

      items.forEach(item => {
        results.push({
          title: item.querySelector('h2')?.textContent.trim(),
          description: item.querySelector('.description')?.textContent.trim(),
          link: item.querySelector('a')?.href
        });
      });

      return results;
    });

    return data;
  } finally {
    await browser.close();
  }
}

const url = process.argv[2];
scrapeWithPuppeteer(url)
  .then(data => console.log(JSON.stringify(data)))
  .catch(error => {
    console.error(JSON.stringify({ error: error.message }));
    process.exit(1);
  });

This script demonstrates essential Puppeteer techniques including handling browser sessions and waiting for dynamic content.

Install Puppeteer dependencies:

npm install puppeteer

Method 4: Using WebScraping.AI API in n8n

For production-grade scraping without infrastructure overhead, integrate WebScraping.AI API directly into your n8n workflow using the HTTP Request node:

// Function node to prepare API request
const apiKey = $credentials.webscrapingai.apiKey;
const targetUrl = items[0].json.url;

return [{
  json: {
    url: 'https://api.webscraping.ai/html',
    method: 'GET',
    qs: {
      api_key: apiKey,
      url: targetUrl,
      js: true,  // Enable JavaScript rendering
      proxy: 'datacenter'
    }
  }
}];

Then use an HTTP Request node to make the API call, followed by a Function node to parse the HTML:

// Function node to parse API response
const cheerio = require('cheerio');
const html = items[0].json.html;

const $ = cheerio.load(html);

const scraped_data = {
  title: $('h1').first().text(),
  paragraphs: [],
  links: []
};

$('p').each((i, elem) => {
  scraped_data.paragraphs.push($(elem).text());
});

$('a').each((i, elem) => {
  scraped_data.links.push({
    text: $(elem).text(),
    href: $(elem).attr('href')
  });
});

return [{ json: scraped_data }];

Complete n8n Workflow Example

Here's a complete workflow structure for automated scraping:

Schedule Trigger: Run scraper daily at 9 AM
Function Node: Prepare list of URLs to scrape
Split In Batches: Process URLs in batches of 5
Execute Command/HTTP Request: Perform scraping
Function Node: Parse and transform data
IF Node: Check for errors or missing data
PostgreSQL/Google Sheets: Store results
Send Email: Notify on completion or errors

Error Handling and Best Practices

Implement robust error handling in your Node.js scrapers:

async function scrapeWithRetry(url, maxRetries = 3) {
  for (let attempt = 1; attempt <= maxRetries; attempt++) {
    try {
      const result = await scrapeWebsite(url);
      return result;
    } catch (error) {
      console.error(`Attempt ${attempt} failed:`, error.message);

      if (attempt === maxRetries) {
        throw new Error(`Failed after ${maxRetries} attempts: ${error.message}`);
      }

      // Exponential backoff
      await new Promise(resolve => setTimeout(resolve, 1000 * Math.pow(2, attempt)));
    }
  }
}

Best Practices for n8n Node.js Scrapers

Rate Limiting: Add delays between requests to avoid overwhelming target servers
User Agents: Rotate user agents to appear as different browsers
Error Logging: Use n8n's error workflows to capture and handle failures
Data Validation: Validate scraped data before storing or processing
Proxy Rotation: Use proxies for large-scale scraping operations
Memory Management: Close browser instances and clean up resources
Timeout Configuration: Set appropriate timeouts for network requests
Credential Management: Store API keys and credentials securely in n8n

Handling Dynamic Content and AJAX

Many modern websites load content dynamically via AJAX. When handling AJAX requests, wait for network activity to settle:

// Wait for AJAX content to load
await page.waitForFunction(() => {
  const elements = document.querySelectorAll('.dynamic-content');
  return elements.length > 0;
}, { timeout: 10000 });

// Alternative: Wait for specific network requests
await page.waitForResponse(response => {
  return response.url().includes('api/data') && response.status() === 200;
});

Scheduling and Monitoring

Configure your n8n workflow for production use:

Cron Schedule: Set up recurring execution times
- Syntax: 0 9 * * * (daily at 9 AM)
Error Notifications: Add error trigger workflows
- Send Slack/email alerts on failures
Execution Logs: Monitor workflow history
- Review execution times and success rates
Data Persistence: Store results reliably
- Use databases or cloud storage
Webhook Triggers: Enable on-demand scraping
- Create API endpoints for manual triggers

Performance Optimization

Optimize your Node.js scrapers for better performance:

// Use connection pooling for multiple requests
const axios = require('axios');
const axiosInstance = axios.create({
  timeout: 10000,
  maxRedirects: 5,
  httpAgent: new require('http').Agent({ keepAlive: true }),
  httpsAgent: new require('https').Agent({ keepAlive: true })
});

// Parallel processing with Promise.all
async function scrapeMultipleUrls(urls) {
  const promises = urls.map(url => scrapeWithRetry(url));
  return await Promise.all(promises);
}

// Limit concurrent requests
async function scrapeBatch(urls, concurrency = 5) {
  const results = [];
  for (let i = 0; i < urls.length; i += concurrency) {
    const batch = urls.slice(i, i + concurrency);
    const batchResults = await Promise.all(
      batch.map(url => scrapeWithRetry(url))
    );
    results.push(...batchResults);
  }
  return results;
}

Conclusion

Creating Node.js scrapers with n8n automation combines the flexibility of custom code with visual workflow management. Choose the Function node for simple scraping tasks, Execute Command for complex scrapers with external dependencies, or integrate APIs like WebScraping.AI for production-grade reliability. With proper error handling, rate limiting, and monitoring, you can build robust automated scraping workflows that scale with your needs.

The key is matching your approach to your requirements: use native n8n nodes for simplicity, custom Node.js scripts for flexibility, or specialized APIs for reliability and compliance with anti-scraping measures.

Table of contents

How do I Create a Node.js Scraper with n8n Automation?

Understanding n8n's Node.js Execution Options

Method 1: Using the Function Node for Simple Scraping

Method 2: Advanced Scraping with Execute Command Node

Step 1: Create Your Node.js Scraper Script

Step 2: Install Dependencies

Step 3: Configure Execute Command Node in n8n

Method 3: Using Puppeteer for Dynamic Content

Method 4: Using WebScraping.AI API in n8n

Complete n8n Workflow Example

Error Handling and Best Practices

Best Practices for n8n Node.js Scrapers

Handling Dynamic Content and AJAX

Scheduling and Monitoring

Performance Optimization

Conclusion

Try WebScraping.AI for Your Web Scraping Needs

Key Features:

Getting Started:

Related Questions

How can I scrape websites using n8n and JavaScript?

What are the best n8n templates for web scraping?

How do I use the n8n web scraping template to get started?

Get Started Now

Support