What is n8n workflow automation and how does it work for scraping?
n8n is an open-source workflow automation platform that enables developers to create sophisticated data pipelines and automated tasks through a visual, node-based interface. For web scraping, n8n provides a powerful alternative to traditional scripting by allowing you to build, manage, and scale scraping workflows without writing extensive code. By connecting different nodes representing various operations, you can create complex scraping systems that handle everything from data extraction to storage and notifications.
Understanding n8n's Architecture
n8n (pronounced "n-eight-n") is built on a node-based architecture where each node represents a specific action or integration. The platform operates on a flow-based programming model, making it intuitive for developers to visualize and construct data pipelines.
Core Components of n8n
Nodes: Individual units of work that perform specific tasks such as HTTP requests, data transformation, or database operations. Each node receives data from the previous node and passes processed data to the next.
Connections: Links between nodes that define the data flow and execution order. Connections can branch, merge, and loop, allowing for complex workflow logic.
Credentials: Secure storage for API keys, passwords, and authentication tokens used across workflows. Credentials are encrypted and can be reused across multiple nodes.
Executions: Individual runs of a workflow, tracked with logs, input/output data, and performance metrics for debugging and monitoring.
Why n8n for Web Scraping?
Traditional web scraping requires writing scripts that handle HTTP requests, parse HTML, manage proxies, handle errors, and store data. n8n simplifies this by providing:
- Visual workflow creation - See your entire scraping pipeline at a glance
- Built-in error handling - Configure retries and failure notifications without custom code
- Schedule automation - Run scraping jobs on cron-like schedules
- API integrations - Connect to 350+ services without writing integration code
- Self-hosted control - Keep your data and workflows on your infrastructure
- Version control - Export workflows as JSON for Git versioning
How n8n Workflows Work for Web Scraping
A typical n8n scraping workflow consists of several stages:
1. Trigger Stage
Workflows start with a trigger node that initiates execution:
{
"name": "Schedule Trigger",
"type": "n8n-nodes-base.scheduleTrigger",
"parameters": {
"rule": {
"interval": [{
"field": "hours",
"hoursInterval": 12
}]
}
}
}
Common trigger types for scraping: - Schedule Trigger: Run workflows on fixed intervals (hourly, daily, weekly) - Webhook Trigger: Start scraping via HTTP POST requests - Manual Trigger: Execute workflows on-demand from the UI - Cron Trigger: Use cron expressions for complex scheduling
2. Data Fetching Stage
After triggering, the workflow fetches web content. You have several options:
Option A: Simple HTTP Requests
For static websites without JavaScript:
{
"name": "HTTP Request",
"type": "n8n-nodes-base.httpRequest",
"parameters": {
"method": "GET",
"url": "https://example.com/products",
"options": {
"headers": {
"User-Agent": "Mozilla/5.0 (compatible; n8n-bot)"
}
}
}
}
Option B: Web Scraping API Integration
For JavaScript-rendered sites and anti-bot bypass, integrating a dedicated scraping API provides better reliability:
{
"name": "WebScraping.AI",
"type": "n8n-nodes-base.httpRequest",
"parameters": {
"method": "GET",
"url": "https://api.webscraping.ai/html",
"authentication": "genericCredentialType",
"genericAuthType": "httpHeaderAuth",
"options": {
"queryParameters": {
"parameters": [
{
"name": "url",
"value": "={{ $json.targetUrl }}"
},
{
"name": "js",
"value": "true"
},
{
"name": "proxy",
"value": "datacenter"
},
{
"name": "timeout",
"value": "10000"
}
]
}
}
}
}
This approach handles JavaScript rendering automatically, similar to how you would handle AJAX requests using Puppeteer, but without managing browser instances.
3. Data Extraction Stage
Once you have the HTML content, extract structured data using various methods:
CSS Selector Extraction
Use the HTML Extract node for simple data extraction:
{
"name": "HTML Extract",
"type": "n8n-nodes-base.html",
"parameters": {
"operation": "extractHtmlContent",
"options": {
"extractionValues": [
{
"key": "title",
"cssSelector": "h1.product-title",
"returnValue": "text"
},
{
"key": "price",
"cssSelector": ".price-value",
"returnValue": "text"
},
{
"key": "image",
"cssSelector": "img.product-image",
"returnValue": "attribute",
"attribute": "src"
}
]
}
}
}
JavaScript Code Extraction
For complex parsing logic, use the Code node with JavaScript:
// Access the HTML from previous node
const html = $input.item.json.html;
// Parse with regex or string methods
const priceMatch = html.match(/Price: \$(\d+\.?\d*)/);
const price = priceMatch ? parseFloat(priceMatch[1]) : null;
// Extract multiple items
const products = [];
const productRegex = /<div class="product">(.*?)<\/div>/gs;
let match;
while ((match = productRegex.exec(html)) !== null) {
const productHtml = match[1];
const titleMatch = productHtml.match(/<h2>(.*?)<\/h2>/);
products.push({
title: titleMatch ? titleMatch[1] : '',
timestamp: new Date().toISOString()
});
}
return products.map(product => ({ json: product }));
AI-Powered Extraction
WebScraping.AI offers a question-based extraction endpoint that uses AI to extract data:
{
"method": "GET",
"url": "https://api.webscraping.ai/question",
"queryParameters": {
"url": "={{ $json.pageUrl }}",
"question": "What is the product name, current price, original price, and discount percentage?",
"js": "true"
}
}
This method eliminates the need to write CSS selectors or parsing code, making it ideal for complex or frequently-changing page structures.
4. Data Transformation Stage
Clean, normalize, and enrich the extracted data:
// Clean and transform product data
const items = $input.all();
items.forEach(item => {
// Clean price: "$1,234.56" -> 1234.56
item.json.price = parseFloat(
item.json.price.replace(/[$,]/g, '')
);
// Standardize availability
const stockText = item.json.stock.toLowerCase();
item.json.inStock = stockText.includes('in stock') ||
stockText.includes('available');
// Add metadata
item.json.scrapedAt = new Date().toISOString();
item.json.source = 'example.com';
// Generate unique ID
item.json.productId = `${item.json.sku}_${Date.now()}`;
});
return items;
5. Storage Stage
Store scraped data in your preferred destination:
Database Storage (PostgreSQL):
{
"name": "Postgres",
"type": "n8n-nodes-base.postgres",
"parameters": {
"operation": "insert",
"table": "products",
"columns": "title,price,url,scraped_at",
"options": {
"skipOnConflict": true
}
}
}
Spreadsheet Storage (Google Sheets):
{
"name": "Google Sheets",
"type": "n8n-nodes-base.googleSheets",
"parameters": {
"operation": "append",
"documentId": "{{ $credentials.sheetId }}",
"sheetName": "Products",
"columns": "title,price,url,scraped_at"
}
}
Cloud Storage (AWS S3):
{
"name": "AWS S3",
"type": "n8n-nodes-base.awsS3",
"parameters": {
"operation": "upload",
"bucket": "scraping-data",
"fileName": "={{ $json.productId }}.json",
"fileContent": "={{ JSON.stringify($json) }}"
}
}
6. Notification Stage
Get alerts when scraping completes or encounters errors:
{
"name": "Send Email",
"type": "n8n-nodes-base.emailSend",
"parameters": {
"fromEmail": "scraper@example.com",
"toEmail": "alerts@example.com",
"subject": "Scraping Complete: {{ $json.itemCount }} items",
"text": "Scraped {{ $json.itemCount }} products at {{ $now.toISO() }}"
}
}
Advanced n8n Scraping Patterns
Pattern 1: Pagination with Loop
Handle multi-page scraping with the Split in Batches node:
{
"workflow": {
"nodes": [
{
"name": "Generate Page Numbers",
"type": "n8n-nodes-base.function",
"parameters": {
"functionCode": "const pages = [];\nfor(let i = 1; i <= 50; i++) {\n pages.push({page: i});\n}\nreturn pages;"
}
},
{
"name": "Loop Pages",
"type": "n8n-nodes-base.splitInBatches",
"parameters": {
"batchSize": 1,
"options": {}
}
},
{
"name": "Scrape Page",
"type": "n8n-nodes-base.httpRequest",
"parameters": {
"url": "https://api.webscraping.ai/html",
"qs": {
"url": "=https://example.com/products?page={{ $json.page }}",
"js": "true"
}
}
}
]
}
}
Pattern 2: Dynamic URL Lists
Scrape multiple URLs from a CSV or database:
// Read URLs from Google Sheets
const urls = $input.all();
// Process each URL
return urls.map(item => ({
json: {
targetUrl: item.json.url,
category: item.json.category,
priority: item.json.priority
}
}));
Pattern 3: Conditional Logic
Use IF nodes to handle different page types:
{
"name": "Check Page Type",
"type": "n8n-nodes-base.if",
"parameters": {
"conditions": {
"string": [
{
"value1": "={{ $json.html }}",
"operation": "contains",
"value2": "product-detail"
}
]
}
}
}
Pattern 4: Error Recovery
Implement sophisticated error handling:
{
"name": "Scrape with Retry",
"type": "n8n-nodes-base.httpRequest",
"parameters": {
"url": "https://api.webscraping.ai/html",
"options": {
"timeout": 30000
}
},
"continueOnFail": true,
"retryOnFail": true,
"maxTries": 3,
"waitBetweenTries": 5000
}
This retry logic is similar to handling timeouts in Puppeteer, but configured through n8n's visual interface rather than code.
Working with JavaScript-Heavy Websites
Modern websites often rely heavily on JavaScript for content rendering. n8n handles these through API integrations that execute JavaScript:
Configuration for SPA Scraping
{
"httpRequest": {
"url": "https://api.webscraping.ai/html",
"queryParameters": {
"url": "https://spa-website.com",
"js": "true",
"js_timeout": "10000",
"wait_for": ".content-loaded",
"wait_until": "networkidle"
}
}
}
Key parameters:
- js=true
: Enable JavaScript execution
- js_timeout
: Milliseconds to wait for JavaScript (default 2000, max 30000)
- wait_for
: CSS selector to wait for before capturing HTML
- wait_until
: Wait condition (load, domcontentloaded, networkidle)
Handling Dynamic Content
For pages that load content asynchronously:
{
"queryParameters": {
"url": "https://example.com/infinite-scroll",
"js": "true",
"js_timeout": "15000",
"js_script": "window.scrollTo(0, document.body.scrollHeight); await new Promise(r => setTimeout(r, 2000));"
}
}
Real-World n8n Scraping Workflow Examples
Example 1: E-commerce Price Monitoring
Complete workflow for tracking competitor prices:
{
"name": "Price Monitor",
"nodes": [
{
"name": "Every 6 Hours",
"type": "n8n-nodes-base.scheduleTrigger",
"parameters": {
"rule": {
"interval": [{"field": "hours", "hoursInterval": 6}]
}
},
"position": [250, 300]
},
{
"name": "Get Product URLs",
"type": "n8n-nodes-base.postgres",
"parameters": {
"operation": "executeQuery",
"query": "SELECT url, product_name FROM products WHERE active = true"
},
"position": [450, 300]
},
{
"name": "Scrape Prices",
"type": "n8n-nodes-base.httpRequest",
"parameters": {
"method": "GET",
"url": "https://api.webscraping.ai/question",
"qs": {
"url": "={{ $json.url }}",
"question": "What is the current price?",
"js": "true"
}
},
"position": [650, 300]
},
{
"name": "Parse Price",
"type": "n8n-nodes-base.function",
"parameters": {
"functionCode": "const price = parseFloat($json.answer.replace(/[^0-9.]/g, ''));\nreturn [{\n json: {\n product_name: $json.product_name,\n price: price,\n url: $json.url,\n scraped_at: new Date()\n }\n}];"
},
"position": [850, 300]
},
{
"name": "Save to Database",
"type": "n8n-nodes-base.postgres",
"parameters": {
"operation": "insert",
"table": "price_history",
"columns": "product_name,price,url,scraped_at"
},
"position": [1050, 300]
},
{
"name": "Check for Drops",
"type": "n8n-nodes-base.if",
"parameters": {
"conditions": {
"number": [{
"value1": "={{ $json.price }}",
"operation": "smaller",
"value2": "={{ $json.previous_price * 0.9 }}"
}]
}
},
"position": [1250, 300]
},
{
"name": "Send Alert",
"type": "n8n-nodes-base.emailSend",
"parameters": {
"subject": "Price Drop Alert!",
"text": "{{ $json.product_name }} dropped to ${{ $json.price }}"
},
"position": [1450, 250]
}
]
}
Example 2: Job Listing Aggregator
Scrape multiple job boards and consolidate listings:
// In a Function node
const jobBoards = [
'https://jobs.example.com',
'https://careers.another.com',
'https://opportunities.site.com'
];
return jobBoards.map(url => ({
json: {
boardUrl: url,
category: 'software-engineering',
location: 'remote'
}
}));
Then connect to WebScraping.AI to fetch and parse each board.
Example 3: Real Estate Listings Monitor
Track new property listings:
{
"queryParameters": {
"url": "https://realestate-site.com/listings",
"js": "true",
"selector": ".property-card",
"return_multiple": "true"
}
}
Process each listing and compare with database to identify new properties.
Best Practices for n8n Web Scraping
1. Modular Workflow Design
Break complex scraping into reusable sub-workflows: - One workflow for data extraction - Another for data transformation - Separate workflow for storage - Use Execute Workflow node to connect them
2. Credential Management
Store API keys securely: - Use n8n's credential system - Never hardcode keys in workflows - Create separate credentials for dev/prod environments - Rotate keys regularly
3. Error Handling Strategy
Implement comprehensive error handling:
- Set continueOnFail: true
on scraping nodes
- Add Error Trigger nodes to catch failures
- Store failed URLs for retry
- Send notifications for critical failures
4. Rate Limiting
Respect target websites: - Add Wait nodes between requests (2-5 seconds) - Use Split in Batches with appropriate batch sizes - Implement exponential backoff for retries - Consider using proxies for high-volume scraping
5. Data Quality
Ensure scraped data is accurate: - Validate extracted data format - Store raw HTML for later reprocessing - Add checksums to detect page changes - Log extraction failures for investigation
6. Performance Optimization
Scale your scraping workflows: - Use webhook triggers for on-demand scraping - Enable workflow concurrency for parallel execution - Cache frequently accessed data - Archive old data to keep databases lean
7. Monitoring and Logging
Track workflow performance: - Enable execution logging - Set up workflow execution webhooks - Monitor execution times - Track success/failure rates
Troubleshooting Common Issues
Issue: Workflow execution times out
- Increase timeout in HTTP Request node settings
- Use longer js_timeout
for slow-loading pages
- Break large workflows into smaller sub-workflows
- Process data in smaller batches
Issue: No data extracted from page
- Verify CSS selectors in browser DevTools
- Check if content is JavaScript-rendered (enable js=true
)
- Ensure page has fully loaded (use wait_for
parameter)
- Inspect raw HTML response for content availability
Issue: Getting blocked or rate limited - Use WebScraping.AI's proxy rotation - Add delays between requests - Rotate User-Agent headers - Consider using residential proxies for stricter sites
Issue: Duplicate data in database - Implement unique constraints on database tables - Use upsert operations instead of insert - Check for existing records before inserting - Generate consistent unique identifiers
Issue: Workflow runs but produces no output - Check node connections - Verify previous nodes returned data - Review execution logs for errors - Test each node individually in manual mode
Deployment and Scaling
Self-Hosting Options
n8n can be deployed on various platforms:
Docker Compose:
version: '3.8'
services:
n8n:
image: n8nio/n8n
ports:
- "5678:5678"
environment:
- N8N_BASIC_AUTH_ACTIVE=true
- N8N_BASIC_AUTH_USER=admin
- N8N_BASIC_AUTH_PASSWORD=password
volumes:
- n8n_data:/home/node/.n8n
volumes:
n8n_data:
Kubernetes: Deploy using Helm charts for production environments with high availability and auto-scaling.
Cloud Hosting
n8n Cloud offers managed hosting with: - Automatic updates and maintenance - Built-in monitoring and alerting - Guaranteed uptime SLAs - Collaborative workflow editing
Conclusion
n8n workflow automation provides a powerful, visual approach to web scraping that combines ease of use with professional-grade capabilities. By leveraging n8n's node-based architecture alongside specialized scraping APIs like WebScraping.AI, developers can build robust, scalable scraping systems without managing complex codebases.
The platform's strength lies in its flexibility: start with simple HTTP requests for static sites, integrate dedicated scraping APIs for JavaScript-heavy pages, and scale to sophisticated multi-stage pipelines with error handling, data transformation, and storage integration. Whether you're monitoring prices, aggregating content, or extracting structured data, n8n offers the tools to automate and maintain your scraping workflows efficiently.
Start with small, focused workflows and gradually expand as you become familiar with n8n's capabilities. The combination of visual workflow building, extensive integration options, and the ability to incorporate custom JavaScript code provides a balanced solution that works for both quick prototypes and production-scale scraping operations.