How do I Perform Web Scraping with JavaScript Using n8n?
Web scraping with JavaScript in n8n combines the power of workflow automation with flexible data extraction capabilities. n8n provides multiple approaches to scrape websites, from simple HTTP requests to custom JavaScript code execution, making it an excellent choice for developers who need to integrate web scraping into their automation workflows.
Understanding n8n's Web Scraping Capabilities
n8n offers several built-in nodes and methods for web scraping:
- HTTP Request Node: Fetches web pages and APIs
- Code Node: Executes custom JavaScript for complex scraping logic
- HTML Extract Node: Parses HTML using CSS selectors or XPath
- HTML to JSON Node: Converts HTML tables to structured data
These nodes can be combined to create powerful scraping workflows that extract, transform, and store data automatically.
Method 1: Basic Web Scraping with HTTP Request + HTML Extract
The simplest approach uses the HTTP Request node to fetch a page and the HTML Extract node to parse the content.
Step-by-Step Setup
Add an HTTP Request Node
- Set the method to
GET
- Enter the target URL
- Configure headers if needed (User-Agent, Accept, etc.)
- Set the method to
Add an HTML Extract Node
- Connect it to the HTTP Request output
- Define CSS selectors or XPath expressions
- Map extracted data to output fields
Example Workflow Configuration
{
"nodes": [
{
"parameters": {
"url": "https://example.com/products",
"options": {
"headers": {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
}
}
},
"name": "HTTP Request",
"type": "n8n-nodes-base.httpRequest"
},
{
"parameters": {
"sourceData": "html",
"extractionValues": {
"values": [
{
"key": "title",
"cssSelector": "h1.product-title"
},
{
"key": "price",
"cssSelector": "span.price"
},
{
"key": "description",
"cssSelector": "div.product-description"
}
]
}
},
"name": "HTML Extract",
"type": "n8n-nodes-base.htmlExtract"
}
]
}
Method 2: Advanced Scraping with JavaScript Code Node
For complex scraping scenarios, the Code node allows you to write custom JavaScript that has access to the full Node.js environment.
JavaScript Scraping with Cheerio
n8n's Code node includes the Cheerio library, which provides jQuery-like syntax for HTML parsing:
// In n8n Code Node
const html = $input.first().json.data;
const cheerio = require('cheerio');
const $ = cheerio.load(html);
const products = [];
$('.product-item').each((index, element) => {
const product = {
title: $(element).find('h2.title').text().trim(),
price: $(element).find('.price').text().trim(),
url: $(element).find('a').attr('href'),
image: $(element).find('img').attr('src'),
rating: $(element).find('.rating').attr('data-rating'),
inStock: $(element).find('.stock-status').hasClass('in-stock')
};
products.push(product);
});
return products.map(product => ({ json: product }));
Handling Pagination
When scraping multiple pages, you can implement pagination logic:
// n8n Code Node for pagination
const baseUrl = 'https://example.com/products';
let currentPage = $input.first().json.page || 1;
const maxPages = 10;
const results = [];
for (let page = currentPage; page <= maxPages; page++) {
const url = `${baseUrl}?page=${page}`;
// Note: In actual n8n workflow, use HTTP Request node
// This is conceptual code showing the logic
const response = await fetch(url);
const html = await response.text();
const $ = cheerio.load(html);
$('.product-item').each((i, el) => {
results.push({
title: $(el).find('h2').text(),
price: $(el).find('.price').text(),
page: page
});
});
// Check if next page exists
if ($('.pagination .next').length === 0) {
break;
}
}
return results.map(item => ({ json: item }));
Method 3: JavaScript-Rendered Content with Puppeteer
For websites that rely heavily on JavaScript rendering, you'll need a headless browser. While n8n doesn't include Puppeteer by default, you can integrate it through custom Docker installations or external services.
Using WebScraping.AI API in n8n
A more practical approach is to use a dedicated web scraping API that handles JavaScript rendering. Here's how to integrate WebScraping.AI with n8n:
// n8n HTTP Request Node Configuration
{
"method": "GET",
"url": "https://api.webscraping.ai/html",
"qs": {
"api_key": "YOUR_API_KEY",
"url": "https://example.com/dynamic-content",
"js": true,
"proxy": "datacenter"
}
}
Then process the response with a Code node:
// Code Node to extract data from rendered HTML
const html = $input.first().json.html;
const $ = cheerio.load(html);
const data = {
dynamicContent: $('.loaded-by-js').text(),
items: []
};
$('.dynamic-item').each((i, el) => {
data.items.push({
name: $(el).find('.item-name').text(),
value: $(el).find('.item-value').text()
});
});
return [{ json: data }];
Handling Common Scraping Challenges
1. Rate Limiting and Delays
Implement delays between requests to avoid overwhelming servers:
// n8n Code Node with delay
function delay(ms) {
return new Promise(resolve => setTimeout(resolve, ms));
}
const urls = $input.all().map(item => item.json.url);
const results = [];
for (const url of urls) {
// Scraping logic here
const data = await scrapeUrl(url);
results.push(data);
// Wait 2 seconds between requests
await delay(2000);
}
return results.map(item => ({ json: item }));
2. Error Handling
Robust error handling prevents workflow failures:
// n8n Code Node with error handling
const urls = $input.all();
const results = [];
const errors = [];
for (const item of urls) {
try {
const response = await fetch(item.json.url);
if (!response.ok) {
throw new Error(`HTTP ${response.status}: ${response.statusText}`);
}
const html = await response.text();
const $ = cheerio.load(html);
results.push({
url: item.json.url,
title: $('h1').first().text(),
status: 'success'
});
} catch (error) {
errors.push({
url: item.json.url,
error: error.message,
status: 'failed'
});
}
}
return [
{ json: { results, errors, total: urls.length } }
];
3. Dynamic Content Loading
When dealing with content that loads after the initial page load, similar to handling AJAX requests using Puppeteer, you need to wait for elements to appear. With WebScraping.AI API:
// HTTP Request Node with wait_for parameter
{
"method": "GET",
"url": "https://api.webscraping.ai/html",
"qs": {
"api_key": "YOUR_API_KEY",
"url": "https://example.com",
"js": true,
"wait_for": ".dynamically-loaded-content",
"js_timeout": 5000
}
}
Complete n8n Workflow Example
Here's a complete workflow that scrapes product data, processes it, and saves to a database:
// Workflow Node 1: HTTP Request (fetch page)
// Configuration: GET https://example.com/products
// Workflow Node 2: Code Node (extract data)
const html = $input.first().json;
const cheerio = require('cheerio');
const $ = cheerio.load(html);
const products = [];
$('.product').each((i, element) => {
const $el = $(element);
products.push({
id: $el.attr('data-id'),
name: $el.find('.name').text().trim(),
price: parseFloat($el.find('.price').text().replace(/[^0-9.]/g, '')),
currency: $el.find('.price').attr('data-currency') || 'USD',
url: $el.find('a').attr('href'),
imageUrl: $el.find('img').attr('src'),
availability: $el.find('.stock').text().trim(),
rating: parseFloat($el.find('.rating').attr('data-rating')) || 0,
reviewCount: parseInt($el.find('.reviews').text().match(/\d+/)?.[0]) || 0,
scrapedAt: new Date().toISOString()
});
});
return products.map(product => ({ json: product }));
// Workflow Node 3: Set Node (transform data)
// Add computed fields, filter invalid entries
// Workflow Node 4: Postgres/MySQL Node (save to database)
// INSERT INTO products (name, price, url, ...) VALUES ...
Best Practices for Web Scraping in n8n
1. Respect robots.txt
Always check the website's robots.txt file before scraping:
// Check robots.txt compliance
const robotsUrl = new URL('/robots.txt', 'https://example.com').href;
const robotsTxt = await fetch(robotsUrl).then(r => r.text());
// Parse and respect crawl delays and disallowed paths
if (robotsTxt.includes('Crawl-delay:')) {
const delay = robotsTxt.match(/Crawl-delay:\s*(\d+)/i)?.[1];
console.log(`Respecting crawl delay of ${delay} seconds`);
}
2. Use Appropriate User-Agents
Set a descriptive User-Agent header:
// HTTP Request Node headers
{
"User-Agent": "MyBot/1.0 (+https://mysite.com/bot-info)",
"Accept": "text/html,application/xhtml+xml",
"Accept-Language": "en-US,en;q=0.9"
}
3. Implement Caching
Cache responses to avoid redundant requests, similar to handling browser sessions in Puppeteer:
// Code Node with caching logic
const cache = {};
async function fetchWithCache(url, ttl = 3600000) {
const now = Date.now();
if (cache[url] && (now - cache[url].timestamp) < ttl) {
return cache[url].data;
}
const response = await fetch(url);
const data = await response.text();
cache[url] = {
data: data,
timestamp: now
};
return data;
}
4. Monitor and Log
Add logging to track scraping performance:
// Code Node with logging
const startTime = Date.now();
const url = $input.first().json.url;
console.log(`[${new Date().toISOString()}] Starting scrape: ${url}`);
try {
const data = await scrapeWebsite(url);
const duration = Date.now() - startTime;
console.log(`[${new Date().toISOString()}] Completed in ${duration}ms`);
return [{
json: {
...data,
metadata: {
scrapedAt: new Date().toISOString(),
duration: duration,
status: 'success'
}
}
}];
} catch (error) {
console.error(`[${new Date().toISOString()}] Error: ${error.message}`);
throw error;
}
Scheduling and Automation
One of n8n's strengths is scheduling automated scraping:
Cron Node: Schedule workflows to run at specific times
0 */6 * * * // Every 6 hours 0 0 * * 0 // Weekly on Sunday at midnight 0 9 * * 1-5 // Weekdays at 9 AM
Webhook Node: Trigger scraping via HTTP requests
Interval Node: Run workflows at regular intervals
Conclusion
Web scraping with JavaScript in n8n provides a powerful, flexible solution for automation workflows. By combining HTTP Request nodes, Code nodes with Cheerio, and external APIs for JavaScript-rendered content, you can build robust scraping pipelines that extract, transform, and deliver data exactly where you need it.
For production use cases requiring JavaScript rendering, consider using dedicated scraping APIs like WebScraping.AI that handle the complexity of headless browsers, proxies, and anti-bot measures while integrating seamlessly with n8n workflows.
Remember to always scrape responsibly by respecting robots.txt, implementing rate limiting, and being mindful of the target website's terms of service and server load.