Can I Use Node.js for Web Scraping in n8n Workflows?
Yes, you can absolutely use Node.js for web scraping in n8n workflows! n8n provides powerful built-in functionality through its Code node (formerly Function node) that allows you to execute custom JavaScript/Node.js code directly within your automation workflows. This gives you the flexibility to implement complex scraping logic, parse HTML, process data, and integrate with external libraries.
Understanding n8n's Code Node
The Code node in n8n is your gateway to custom Node.js scripting. It runs in a sandboxed environment with access to several built-in modules and allows you to process data, make HTTP requests, and manipulate workflow results programmatically.
Key Features of the Code Node
- JavaScript/Node.js runtime: Write standard JavaScript code with ES6+ syntax
- Access to workflow data: Read and manipulate data from previous nodes
- Built-in modules: Access to common Node.js modules like
axios
,cheerio
, and more - Multiple items support: Process single or multiple data items
- Error handling: Built-in error management and debugging capabilities
Basic Web Scraping with Node.js in n8n
Method 1: Using the HTTP Request Node + Code Node
The most straightforward approach combines n8n's HTTP Request node with the Code node for HTML parsing:
// In the Code node, after fetching HTML with HTTP Request node
const cheerio = require('cheerio');
// Get the HTML from the previous node
const html = $input.item.json.body;
// Parse with Cheerio
const $ = cheerio.load(html);
// Extract data
const titles = [];
$('h2.product-title').each((i, elem) => {
titles.push($(elem).text().trim());
});
const prices = [];
$('.price').each((i, elem) => {
prices.push($(elem).text().trim());
});
// Return structured data
return titles.map((title, index) => ({
json: {
title: title,
price: prices[index] || 'N/A'
}
}));
Method 2: All-in-One Code Node Approach
You can also handle both the HTTP request and parsing in a single Code node:
const axios = require('axios');
const cheerio = require('cheerio');
// Define the target URL
const url = 'https://example.com/products';
try {
// Fetch the HTML
const response = await axios.get(url, {
headers: {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
}
});
// Parse HTML
const $ = cheerio.load(response.data);
// Extract product information
const products = [];
$('.product-card').each((i, element) => {
const product = {
name: $(element).find('.product-name').text().trim(),
price: $(element).find('.product-price').text().trim(),
url: $(element).find('a').attr('href'),
image: $(element).find('img').attr('src')
};
products.push(product);
});
// Return results as n8n items
return products.map(product => ({
json: product
}));
} catch (error) {
throw new Error(`Scraping failed: ${error.message}`);
}
Advanced Scraping Techniques
Handling Pagination
When scraping multiple pages, you can implement pagination logic within your Code node:
const axios = require('axios');
const cheerio = require('cheerio');
const baseUrl = 'https://example.com/products';
const maxPages = 5;
const allProducts = [];
for (let page = 1; page <= maxPages; page++) {
const url = `${baseUrl}?page=${page}`;
try {
const response = await axios.get(url);
const $ = cheerio.load(response.data);
// Check if there are products on this page
const productCount = $('.product-item').length;
if (productCount === 0) break;
$('.product-item').each((i, elem) => {
allProducts.push({
title: $(elem).find('h3').text().trim(),
price: $(elem).find('.price').text().trim(),
page: page
});
});
// Rate limiting - wait between requests
await new Promise(resolve => setTimeout(resolve, 1000));
} catch (error) {
console.error(`Error on page ${page}:`, error.message);
break;
}
}
return allProducts.map(product => ({ json: product }));
Handling Dynamic Content and AJAX
For websites that load content dynamically via JavaScript, you'll need a headless browser solution. While n8n doesn't directly support Puppeteer in the Code node due to resource constraints, you can use alternative approaches:
Option 1: Use WebScraping.AI API
const axios = require('axios');
const apiKey = 'YOUR_API_KEY';
const targetUrl = 'https://example.com/dynamic-content';
try {
const response = await axios.get('https://api.webscraping.ai/html', {
params: {
api_key: apiKey,
url: targetUrl,
js: true, // Enable JavaScript rendering
timeout: 10000
}
});
const cheerio = require('cheerio');
const $ = cheerio.load(response.data);
// Extract data from rendered HTML
const results = [];
$('.dynamic-content-item').each((i, elem) => {
results.push({
title: $(elem).find('.title').text(),
description: $(elem).find('.description').text()
});
});
return results.map(item => ({ json: item }));
} catch (error) {
throw new Error(`API request failed: ${error.message}`);
}
Option 2: Call External Service
Deploy a separate Node.js service with Puppeteer and call it from n8n:
const axios = require('axios');
// Call your external Puppeteer service
const response = await axios.post('https://your-puppeteer-service.com/scrape', {
url: 'https://example.com',
waitForSelector: '.loaded-content',
timeout: 30000
});
const scrapedData = response.data;
return scrapedData.map(item => ({ json: item }));
For more complex browser automation scenarios, consider using solutions that handle browser sessions or manage AJAX requests effectively.
Data Cleaning and Transformation
Clean and normalize scraped data within your Code node:
// Input data from previous node
const items = $input.all();
const cleanedData = items.map(item => {
const data = item.json;
return {
// Remove currency symbols and convert to number
price: parseFloat(data.price.replace(/[^0-9.]/g, '')),
// Normalize text
title: data.title
.trim()
.replace(/\s+/g, ' ')
.toLowerCase(),
// Extract domain from URL
domain: new URL(data.url).hostname,
// Add timestamp
scrapedAt: new Date().toISOString(),
// Convert stock status to boolean
inStock: data.availability.toLowerCase().includes('in stock')
};
});
return cleanedData.map(item => ({ json: item }));
Best Practices for Node.js Scraping in n8n
1. Error Handling
Always implement robust error handling to prevent workflow failures:
const axios = require('axios');
const cheerio = require('cheerio');
const urls = ['https://site1.com', 'https://site2.com', 'https://site3.com'];
const results = [];
const errors = [];
for (const url of urls) {
try {
const response = await axios.get(url, { timeout: 5000 });
const $ = cheerio.load(response.data);
results.push({
url: url,
title: $('title').text(),
status: 'success'
});
} catch (error) {
errors.push({
url: url,
error: error.message,
status: 'failed'
});
}
}
return [
{ json: { results, errors, summary: {
total: urls.length,
successful: results.length,
failed: errors.length
}}]
];
2. Rate Limiting
Implement delays to avoid overwhelming target servers:
async function scrapeWithDelay(urls, delayMs = 1000) {
const results = [];
for (const url of urls) {
const response = await axios.get(url);
const $ = cheerio.load(response.data);
results.push({
url: url,
content: $('body').text().substring(0, 200)
});
// Wait before next request
await new Promise(resolve => setTimeout(resolve, delayMs));
}
return results;
}
const urls = $input.item.json.urls;
const data = await scrapeWithDelay(urls, 2000);
return data.map(item => ({ json: item }));
3. User-Agent Rotation
Set appropriate headers to avoid being blocked:
const axios = require('axios');
const userAgents = [
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36',
'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36'
];
const randomUserAgent = userAgents[Math.floor(Math.random() * userAgents.length)];
const response = await axios.get('https://example.com', {
headers: {
'User-Agent': randomUserAgent,
'Accept': 'text/html,application/xhtml+xml',
'Accept-Language': 'en-US,en;q=0.9',
'Accept-Encoding': 'gzip, deflate, br',
'Connection': 'keep-alive'
}
});
return [{ json: { html: response.data, userAgent: randomUserAgent } }];
Integrating with WebScraping.AI
For production-grade scraping in n8n, consider using a dedicated API like WebScraping.AI:
const axios = require('axios');
const config = {
apiKey: 'YOUR_API_KEY',
baseUrl: 'https://api.webscraping.ai'
};
// HTML Scraping
async function scrapeHtml(url, enableJs = true) {
const response = await axios.get(`${config.baseUrl}/html`, {
params: {
api_key: config.apiKey,
url: url,
js: enableJs,
proxy: 'datacenter',
timeout: 15000
}
});
return response.data;
}
// AI-powered question answering
async function askQuestion(url, question) {
const response = await axios.get(`${config.baseUrl}/question`, {
params: {
api_key: config.apiKey,
url: url,
question: question
}
});
return response.data;
}
// Main execution
const targetUrl = $input.item.json.url;
const html = await scrapeHtml(targetUrl, true);
const cheerio = require('cheerio');
const $ = cheerio.load(html);
return [{
json: {
title: $('h1').first().text(),
description: $('meta[name="description"]').attr('content'),
scrapedAt: new Date().toISOString()
}
}];
Workflow Example: Complete Product Scraper
Here's a complete example that combines multiple techniques:
const axios = require('axios');
const cheerio = require('cheerio');
// Configuration
const config = {
startUrl: 'https://example.com/products',
maxProducts: 50,
delayBetweenRequests: 1500
};
// Results storage
const allProducts = [];
let currentPage = 1;
let hasNextPage = true;
// Main scraping loop
while (hasNextPage && allProducts.length < config.maxProducts) {
try {
const url = `${config.startUrl}?page=${currentPage}`;
const response = await axios.get(url, {
headers: {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
},
timeout: 10000
});
const $ = cheerio.load(response.data);
// Extract products
$('.product').each((i, elem) => {
if (allProducts.length >= config.maxProducts) return false;
const product = {
id: $(elem).attr('data-product-id'),
name: $(elem).find('.product-name').text().trim(),
price: parseFloat($(elem).find('.price').text().replace(/[^0-9.]/g, '')),
rating: parseFloat($(elem).find('.rating').attr('data-rating')),
image: $(elem).find('img').attr('src'),
url: $(elem).find('a').attr('href'),
inStock: $(elem).find('.stock-status').text().includes('In Stock'),
scrapedAt: new Date().toISOString(),
page: currentPage
};
allProducts.push(product);
});
// Check for next page
hasNextPage = $('.pagination .next').length > 0;
currentPage++;
// Rate limiting
if (hasNextPage) {
await new Promise(resolve => setTimeout(resolve, config.delayBetweenRequests));
}
} catch (error) {
console.error(`Error on page ${currentPage}:`, error.message);
hasNextPage = false;
}
}
// Return results with summary
return [{
json: {
products: allProducts,
summary: {
totalProducts: allProducts.length,
pagesScraped: currentPage - 1,
avgPrice: allProducts.reduce((sum, p) => sum + p.price, 0) / allProducts.length,
inStockCount: allProducts.filter(p => p.inStock).length
}
}
}];
Limitations and Considerations
Memory and Execution Time
n8n's Code node has resource limitations: - Execution timeout: Code nodes typically timeout after 2-5 minutes - Memory limits: Limited memory allocation per execution - No persistent state: Each execution starts fresh
Module Availability
The Code node has access to common modules, but not all npm packages are available. Built-in modules include:
- axios
- HTTP requests
- cheerio
- HTML parsing
- lodash
- Utility functions
- Standard Node.js modules (fs
, path
, crypto
, etc.)
For modules not available in n8n, consider using external services or APIs.
Alternatives for Complex Scraping
When Node.js in n8n isn't sufficient:
- Use HTTP Request node: For simple API calls and basic scraping
- Deploy external service: Run a dedicated scraping service with full Node.js/Puppeteer
- Use specialized APIs: Services like WebScraping.AI handle complex scenarios
- n8n Execute Command node: Run external scripts from your n8n host
For complex scenarios involving iframe handling or advanced browser events, external services are recommended.
Conclusion
Node.js is a powerful tool for web scraping within n8n workflows through the Code node. It provides the flexibility to implement custom scraping logic, parse HTML, handle pagination, and process data—all within your automation workflows. While there are some limitations compared to a full Node.js environment, combining n8n's Code node with external APIs like WebScraping.AI gives you a robust solution for production-grade web scraping automation.
Whether you're building a simple product price monitor or a complex data aggregation pipeline, Node.js in n8n provides the scripting power you need while maintaining the visual workflow benefits that make n8n so powerful for automation.