How do I create a workflow automation for web scraping with n8n?
n8n is a powerful workflow automation tool that allows developers to create sophisticated web scraping pipelines without writing extensive code. By combining HTTP requests, data transformation nodes, and integration with web scraping APIs, you can build reliable, scalable scraping workflows that run on schedules or triggers.
Understanding n8n Web Scraping Workflows
n8n uses a node-based architecture where each node performs a specific task. For web scraping, you'll typically combine several node types:
- Trigger nodes - Start workflows on schedules or events
- HTTP Request nodes - Fetch web pages or call scraping APIs
- Data transformation nodes - Parse and clean extracted data
- Storage nodes - Save results to databases, spreadsheets, or files
- Notification nodes - Alert you when scraping completes or fails
The key advantage of n8n is its visual interface for building complex workflows, while still offering the flexibility to handle authentication, error handling, and data processing.
Method 1: Direct HTTP Requests with HTML Parsing
For simple static websites, you can use n8n's built-in HTTP Request node combined with HTML parsing.
Setting Up a Basic Scraping Workflow
Add a Schedule Trigger
- Drag a "Schedule Trigger" node to your workflow
- Configure run frequency (hourly, daily, weekly)
Add HTTP Request Node
- Add an "HTTP Request" node
- Set method to
GET
- Enter target URL
- Configure headers if needed:
User-Agent: Mozilla/5.0 (compatible; n8n-bot) Accept: text/html
Extract Data with HTML Node
- Add an "HTML Extract" node
- Use CSS selectors to extract data:
Selector: .product-title Attribute: text
Example: Extracting Product Information
Here's how to configure the HTML Extract node for e-commerce data:
{
"selector": {
"title": ".product-name h1",
"price": ".price-value",
"description": ".product-description p",
"availability": ".stock-status"
},
"returnArray": true
}
Limitations: This approach only works for static HTML and doesn't handle JavaScript-rendered content, which is increasingly common on modern websites.
Method 2: Using WebScraping.AI API in n8n
For JavaScript-heavy sites and more reliable scraping, integrating a dedicated web scraping API like WebScraping.AI provides better results with built-in proxy rotation, JavaScript rendering, and anti-bot bypass.
Setting Up WebScraping.AI Integration
Create HTTP Request Node
- Method:
GET
- URL:
https://api.webscraping.ai/html
- Authentication: API Key (in headers)
- Method:
Configure Request Parameters
{
"url": "={{ $json.targetUrl }}",
"js": true,
"proxy": "datacenter",
"headers": {
"Accept": "application/json"
}
}
- Add API Key Authentication
- Authentication Type: Generic Credential Type
- Header Name:
api_key
- Header Value: Your WebScraping.AI API key
Complete n8n Workflow Example
Here's a full workflow configuration in JSON format that you can import into n8n:
{
"nodes": [
{
"parameters": {
"rule": {
"interval": [
{
"field": "hours",
"hoursInterval": 6
}
]
}
},
"name": "Schedule Trigger",
"type": "n8n-nodes-base.scheduleTrigger",
"position": [250, 300]
},
{
"parameters": {
"url": "https://api.webscraping.ai/html",
"authentication": "genericCredentialType",
"genericAuthType": "httpHeaderAuth",
"options": {
"queryParameters": {
"parameters": [
{
"name": "url",
"value": "https://example.com/products"
},
{
"name": "js",
"value": "true"
},
{
"name": "proxy",
"value": "datacenter"
}
]
}
}
},
"name": "WebScraping.AI",
"type": "n8n-nodes-base.httpRequest",
"position": [450, 300]
},
{
"parameters": {
"jsCode": "const html = $input.item.json.html;\nconst products = [];\n\n// Parse HTML and extract data\nconst parser = new DOMParser();\nconst doc = parser.parseFromString(html, 'text/html');\n\ndoc.querySelectorAll('.product-item').forEach(item => {\n products.push({\n title: item.querySelector('.title')?.textContent.trim(),\n price: item.querySelector('.price')?.textContent.trim(),\n url: item.querySelector('a')?.href\n });\n});\n\nreturn products.map(product => ({ json: product }));"
},
"name": "Parse HTML",
"type": "n8n-nodes-base.code",
"position": [650, 300]
},
{
"parameters": {
"operation": "append",
"documentId": "your-google-sheet-id",
"sheetName": "Products",
"columns": "title,price,url"
},
"name": "Google Sheets",
"type": "n8n-nodes-base.googleSheets",
"position": [850, 300]
}
]
}
Method 3: AI-Powered Data Extraction
WebScraping.AI's question-answering endpoint allows you to extract specific information using natural language, which is particularly useful when dealing with complex page structures.
Using the Question API in n8n
{
"method": "GET",
"url": "https://api.webscraping.ai/question",
"queryParameters": {
"url": "={{ $json.productUrl }}",
"question": "What is the product name, price, and shipping time?",
"js": "true"
}
}
The API returns structured answers that you can directly use in your workflow without complex HTML parsing, similar to how you might handle authentication in Puppeteer for authenticated scraping scenarios.
Handling JavaScript-Rendered Content
Modern websites often load content dynamically with JavaScript. When scraping these sites, you need to ensure JavaScript execution, just as you would when handling AJAX requests using Puppeteer.
n8n Configuration for JavaScript Sites
{
"httpRequest": {
"url": "https://api.webscraping.ai/html",
"qs": {
"url": "https://spa-example.com",
"js": "true",
"js_timeout": "5000",
"wait_for": ".content-loaded"
}
}
}
Key parameters:
- js=true
: Enables JavaScript rendering
- js_timeout
: Milliseconds to wait for JavaScript execution
- wait_for
: CSS selector to wait for before returning content
Advanced Workflow Patterns
Pattern 1: Pagination Handling
{
"nodes": [
{
"name": "Loop Over Pages",
"type": "n8n-nodes-base.splitInBatches",
"parameters": {
"batchSize": 1,
"options": {
"reset": false
}
}
},
{
"name": "Scrape Page",
"type": "n8n-nodes-base.httpRequest",
"parameters": {
"url": "https://api.webscraping.ai/html",
"qs": {
"url": "=https://example.com/products?page={{ $json.page }}",
"js": "true"
}
}
}
]
}
Pattern 2: Error Handling and Retries
Configure error handling in the HTTP Request node:
{
"continueOnFail": true,
"retryOnFail": true,
"maxTries": 3,
"waitBetweenTries": 5000
}
Pattern 3: Rate Limiting
Add a delay between requests to avoid overwhelming servers:
{
"name": "Wait",
"type": "n8n-nodes-base.wait",
"parameters": {
"amount": 2,
"unit": "seconds"
}
}
Data Processing and Storage
Transforming Scraped Data
Use the Code node (JavaScript) to clean and transform data:
// Clean price data
items.forEach(item => {
item.json.price = parseFloat(
item.json.price.replace(/[^0-9.]/g, '')
);
item.json.scrapedAt = new Date().toISOString();
});
return items;
Storage Options
- Google Sheets - Simple storage and sharing
- PostgreSQL/MySQL - Structured data storage
- MongoDB - Flexible document storage
- Airtable - Collaborative database
- Webhooks - Send data to external systems
Monitoring and Notifications
Add monitoring to your scraping workflows:
{
"name": "Send Alert",
"type": "n8n-nodes-base.emailSend",
"parameters": {
"subject": "Scraping Completed",
"text": "Scraped {{ $json.itemCount }} items successfully"
}
}
Error Notifications
Use the Error Trigger node to catch and report failures:
{
"name": "Error Trigger",
"type": "n8n-nodes-base.errorTrigger",
"parameters": {}
}
Best Practices for n8n Web Scraping
- Use API-based scraping services - More reliable than raw HTTP requests
- Implement exponential backoff - Handle rate limits gracefully
- Store raw HTML - Keep original data for re-processing
- Add timestamps - Track when data was collected
- Monitor for changes - Set up alerts for workflow failures
- Use webhooks for real-time scraping - Trigger workflows on demand
- Test with small batches - Verify workflows before scaling
Example: Complete E-commerce Monitoring Workflow
Here's a real-world example that monitors product prices:
- Schedule trigger runs every 6 hours
- Fetch product pages using WebScraping.AI API
- Extract prices and availability with CSS selectors
- Compare with previous prices from database
- Send email alerts if prices drop below threshold
- Store results in Google Sheets
- Archive HTML snapshots in cloud storage
This workflow combines multiple nodes to create a production-ready scraping system that runs autonomously.
Troubleshooting Common Issues
Problem: Workflow times out
Solution: Increase timeout in HTTP Request settings or use js_timeout
parameter
Problem: No data extracted Solution: Verify CSS selectors using browser DevTools, similar to interacting with DOM elements in Puppeteer
Problem: Getting blocked by anti-bot measures Solution: Use WebScraping.AI's proxy rotation and anti-bot bypass features
Problem: JavaScript not rendering
Solution: Enable js=true
parameter and increase js_timeout
Conclusion
n8n provides a powerful platform for building automated web scraping workflows without extensive coding. By combining n8n's visual workflow builder with robust scraping APIs like WebScraping.AI, you can create reliable, maintainable scraping systems that handle JavaScript rendering, bypass anti-bot protections, and scale to handle large-scale data extraction needs.
Start with simple workflows and gradually add complexity as you learn the platform. The combination of n8n's automation capabilities and specialized scraping APIs offers the best of both worlds: ease of use and professional-grade scraping power.