How can I scrape websites using n8n and JavaScript?
Web scraping with n8n and JavaScript combines the power of workflow automation with flexible scripting capabilities. This guide demonstrates how to extract data from websites using n8n's built-in nodes and custom JavaScript code for complex scenarios.
Understanding n8n Web Scraping Approaches
n8n offers multiple methods for web scraping:
- HTTP Request node: For simple HTML fetching
- HTML Extract node: For parsing HTML content
- Code node: For custom JavaScript logic
- Function node: For transforming data
- Third-party integrations: Like WebScraping.AI for advanced scenarios
Method 1: Basic Web Scraping with HTTP Request + HTML Extract
The simplest approach combines HTTP Request and HTML Extract nodes:
Step 1: Fetch the HTML Content
Add an HTTP Request node with these settings:
{
"method": "GET",
"url": "https://example.com/products",
"options": {
"redirect": {
"followRedirects": true
}
}
}
Step 2: Extract Data with CSS Selectors
Add an HTML Extract node to parse the HTML:
{
"extractionValues": {
"title": {
"cssSelector": "h1.product-title",
"returnValue": "text"
},
"price": {
"cssSelector": ".product-price",
"returnValue": "text"
},
"image": {
"cssSelector": "img.product-image",
"returnValue": "attribute",
"attribute": "src"
}
}
}
Method 2: Advanced Scraping with JavaScript Code Node
For complex scraping scenarios, use the Code node with JavaScript:
// Access the HTML from previous node
const html = $input.first().json.data;
// Use Cheerio for HTML parsing (available in n8n)
const cheerio = require('cheerio');
const $ = cheerio.load(html);
// Extract product data
const products = [];
$('.product-item').each((index, element) => {
const product = {
title: $(element).find('.product-title').text().trim(),
price: parseFloat($(element).find('.price').text().replace(/[^0-9.]/g, '')),
description: $(element).find('.description').text().trim(),
url: $(element).find('a').attr('href'),
inStock: $(element).find('.stock').text().includes('In Stock'),
rating: parseFloat($(element).find('.rating').attr('data-rating') || 0)
};
products.push(product);
});
// Return structured data
return products.map(product => ({ json: product }));
Method 3: Scraping JavaScript-Rendered Pages
Many modern websites render content with JavaScript. For these sites, you'll need to execute JavaScript in a browser context. While n8n doesn't have built-in browser automation, you can use external services or APIs.
Using WebScraping.AI API with n8n
For JavaScript-heavy sites, integrate WebScraping.AI:
// In a Code node
const targetUrl = 'https://example.com/dynamic-content';
// Make request to WebScraping.AI
const response = await this.helpers.httpRequest({
method: 'GET',
url: 'https://api.webscraping.ai/html',
qs: {
url: targetUrl,
js: true,
proxy: 'datacenter'
},
headers: {
'Api-Key': 'YOUR_API_KEY'
}
});
// Parse the returned HTML
const cheerio = require('cheerio');
const $ = cheerio.load(response);
// Extract data from JavaScript-rendered content
const data = [];
$('.dynamic-item').each((i, el) => {
data.push({
text: $(el).find('.text').text(),
value: $(el).attr('data-value')
});
});
return [{ json: { items: data } }];
Method 4: Handling Pagination in n8n Workflows
To scrape multiple pages, create a loop in your workflow:
// In a Code node - Generate page URLs
const baseUrl = 'https://example.com/products';
const totalPages = 10;
const urls = [];
for (let page = 1; page <= totalPages; page++) {
urls.push({
json: {
url: `${baseUrl}?page=${page}`,
pageNumber: page
}
});
}
return urls;
Then connect this to an HTTP Request node with Split In Batches to process pages sequentially or in parallel.
Method 5: Handling Authentication and Headers
Many websites require authentication or specific headers:
// In a Code node
const response = await this.helpers.httpRequest({
method: 'GET',
url: 'https://example.com/api/data',
headers: {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
'Accept': 'text/html,application/xhtml+xml',
'Accept-Language': 'en-US,en;q=0.9',
'Authorization': 'Bearer YOUR_TOKEN'
},
auth: {
username: 'your-username',
password: 'your-password'
}
});
return [{ json: response }];
Method 6: Extracting Data from APIs
Many websites have underlying APIs that are easier to scrape than HTML:
// In a Code node - Fetch JSON data
const apiResponse = await this.helpers.httpRequest({
method: 'GET',
url: 'https://example.com/api/products',
qs: {
category: 'electronics',
limit: 100
},
json: true
});
// Transform and filter the data
const filteredProducts = apiResponse.products
.filter(p => p.price < 1000)
.map(p => ({
name: p.title,
price: p.price,
available: p.stock > 0
}));
return filteredProducts.map(p => ({ json: p }));
Handling Common Scraping Challenges
1. Rate Limiting and Delays
Add delays between requests to avoid being blocked:
// In a Code node
async function delay(ms) {
return new Promise(resolve => setTimeout(resolve, ms));
}
const results = [];
const urls = $input.all();
for (const item of urls) {
const response = await this.helpers.httpRequest({
method: 'GET',
url: item.json.url
});
results.push({ json: response });
// Wait 2 seconds between requests
await delay(2000);
}
return results;
2. Error Handling
Implement robust error handling in your workflows:
// In a Code node
const results = [];
const errors = [];
for (const item of $input.all()) {
try {
const response = await this.helpers.httpRequest({
method: 'GET',
url: item.json.url,
timeout: 30000
});
results.push({ json: { success: true, data: response } });
} catch (error) {
errors.push({
json: {
success: false,
url: item.json.url,
error: error.message
}
});
}
}
return [...results, ...errors];
3. Data Cleaning and Transformation
Clean extracted data before storage:
// In a Code node
function cleanText(text) {
return text
.trim()
.replace(/\s+/g, ' ')
.replace(/\n+/g, ' ')
.replace(/[^\x20-\x7E]/g, '');
}
function cleanPrice(priceStr) {
const match = priceStr.match(/[\d,]+\.?\d*/);
return match ? parseFloat(match[0].replace(/,/g, '')) : null;
}
const cleanedData = $input.all().map(item => ({
json: {
title: cleanText(item.json.title),
price: cleanPrice(item.json.price),
description: cleanText(item.json.description),
scrapedAt: new Date().toISOString()
}
}));
return cleanedData;
Storing Scraped Data
n8n can save scraped data to various destinations:
Save to Google Sheets
Add a Google Sheets node after your scraping logic:
{
"operation": "append",
"sheetName": "Scraped Products",
"dataMode": "autoMapInputData"
}
Save to Database
Use Postgres or MySQL nodes:
{
"operation": "insert",
"table": "products",
"columns": "title, price, url, scraped_at"
}
Save to JSON File
Use the Write Binary File node:
// Prepare data for file output
const data = $input.all().map(item => item.json);
const jsonContent = JSON.stringify(data, null, 2);
return [{
json: {},
binary: {
data: Buffer.from(jsonContent, 'utf-8')
}
}];
Complete n8n Web Scraping Workflow Example
Here's a complete workflow that scrapes product data:
- Schedule Trigger: Run daily at 9 AM
- Code Node (Generate URLs): Create list of pages to scrape
- Split In Batches: Process 5 URLs at a time
- HTTP Request: Fetch each page
- Code Node (Parse HTML): Extract product data using Cheerio
- Code Node (Clean Data): Clean and validate extracted data
- Filter: Remove items without prices
- Google Sheets: Save to spreadsheet
- Slack: Send notification when complete
Best Practices for n8n Web Scraping
- Respect robots.txt: Check the website's robots.txt file before scraping
- Add delays: Use the Wait node or delays in code to avoid overwhelming servers
- Handle errors gracefully: Use the Error Trigger node to catch and log failures
- Use appropriate User-Agent: Identify your scraper properly in headers
- Monitor execution: Set up notifications for failed workflows
- Cache results: Store intermediate results to avoid re-scraping on failures
- Validate data: Always validate extracted data before storage
- Consider legal aspects: Ensure you have permission to scrape the target website
When to Use External Scraping Services
For complex scenarios involving JavaScript rendering, CAPTCHAs, or rotating proxies, consider using dedicated scraping APIs like WebScraping.AI. These services provide:
- Automatic proxy rotation
- JavaScript rendering
- CAPTCHA solving
- Anti-bot detection bypass
- Reliable infrastructure
Similar to how you handle browser sessions in Puppeteer, maintaining consistent scraping sessions in n8n workflows requires careful state management. For dynamic content that requires waiting for elements to load, techniques similar to handling AJAX requests using Puppeteer can be applied through API-based solutions integrated into your n8n workflows.
Conclusion
n8n provides powerful capabilities for web scraping through its visual workflow builder and JavaScript support. Start with simple HTTP Request and HTML Extract nodes for static content, then progress to custom JavaScript code for complex scenarios. For JavaScript-heavy websites, integrate external services or APIs to handle browser automation. Always follow ethical scraping practices and respect website terms of service.
With proper error handling, rate limiting, and data validation, you can build robust, automated web scraping workflows that run reliably on schedule or triggered by events.