Table of contents

What is n8n workflow automation and how does it work for scraping?

n8n is an open-source workflow automation platform that enables developers to create sophisticated data pipelines and automated tasks through a visual, node-based interface. For web scraping, n8n provides a powerful alternative to traditional scripting by allowing you to build, manage, and scale scraping workflows without writing extensive code. By connecting different nodes representing various operations, you can create complex scraping systems that handle everything from data extraction to storage and notifications.

Understanding n8n's Architecture

n8n (pronounced "n-eight-n") is built on a node-based architecture where each node represents a specific action or integration. The platform operates on a flow-based programming model, making it intuitive for developers to visualize and construct data pipelines.

Core Components of n8n

Nodes: Individual units of work that perform specific tasks such as HTTP requests, data transformation, or database operations. Each node receives data from the previous node and passes processed data to the next.

Connections: Links between nodes that define the data flow and execution order. Connections can branch, merge, and loop, allowing for complex workflow logic.

Credentials: Secure storage for API keys, passwords, and authentication tokens used across workflows. Credentials are encrypted and can be reused across multiple nodes.

Executions: Individual runs of a workflow, tracked with logs, input/output data, and performance metrics for debugging and monitoring.

Why n8n for Web Scraping?

Traditional web scraping requires writing scripts that handle HTTP requests, parse HTML, manage proxies, handle errors, and store data. n8n simplifies this by providing:

  1. Visual workflow creation - See your entire scraping pipeline at a glance
  2. Built-in error handling - Configure retries and failure notifications without custom code
  3. Schedule automation - Run scraping jobs on cron-like schedules
  4. API integrations - Connect to 350+ services without writing integration code
  5. Self-hosted control - Keep your data and workflows on your infrastructure
  6. Version control - Export workflows as JSON for Git versioning

How n8n Workflows Work for Web Scraping

A typical n8n scraping workflow consists of several stages:

1. Trigger Stage

Workflows start with a trigger node that initiates execution:

{
  "name": "Schedule Trigger",
  "type": "n8n-nodes-base.scheduleTrigger",
  "parameters": {
    "rule": {
      "interval": [{
        "field": "hours",
        "hoursInterval": 12
      }]
    }
  }
}

Common trigger types for scraping: - Schedule Trigger: Run workflows on fixed intervals (hourly, daily, weekly) - Webhook Trigger: Start scraping via HTTP POST requests - Manual Trigger: Execute workflows on-demand from the UI - Cron Trigger: Use cron expressions for complex scheduling

2. Data Fetching Stage

After triggering, the workflow fetches web content. You have several options:

Option A: Simple HTTP Requests

For static websites without JavaScript:

{
  "name": "HTTP Request",
  "type": "n8n-nodes-base.httpRequest",
  "parameters": {
    "method": "GET",
    "url": "https://example.com/products",
    "options": {
      "headers": {
        "User-Agent": "Mozilla/5.0 (compatible; n8n-bot)"
      }
    }
  }
}

Option B: Web Scraping API Integration

For JavaScript-rendered sites and anti-bot bypass, integrating a dedicated scraping API provides better reliability:

{
  "name": "WebScraping.AI",
  "type": "n8n-nodes-base.httpRequest",
  "parameters": {
    "method": "GET",
    "url": "https://api.webscraping.ai/html",
    "authentication": "genericCredentialType",
    "genericAuthType": "httpHeaderAuth",
    "options": {
      "queryParameters": {
        "parameters": [
          {
            "name": "url",
            "value": "={{ $json.targetUrl }}"
          },
          {
            "name": "js",
            "value": "true"
          },
          {
            "name": "proxy",
            "value": "datacenter"
          },
          {
            "name": "timeout",
            "value": "10000"
          }
        ]
      }
    }
  }
}

This approach handles JavaScript rendering automatically, similar to how you would handle AJAX requests using Puppeteer, but without managing browser instances.

3. Data Extraction Stage

Once you have the HTML content, extract structured data using various methods:

CSS Selector Extraction

Use the HTML Extract node for simple data extraction:

{
  "name": "HTML Extract",
  "type": "n8n-nodes-base.html",
  "parameters": {
    "operation": "extractHtmlContent",
    "options": {
      "extractionValues": [
        {
          "key": "title",
          "cssSelector": "h1.product-title",
          "returnValue": "text"
        },
        {
          "key": "price",
          "cssSelector": ".price-value",
          "returnValue": "text"
        },
        {
          "key": "image",
          "cssSelector": "img.product-image",
          "returnValue": "attribute",
          "attribute": "src"
        }
      ]
    }
  }
}

JavaScript Code Extraction

For complex parsing logic, use the Code node with JavaScript:

// Access the HTML from previous node
const html = $input.item.json.html;

// Parse with regex or string methods
const priceMatch = html.match(/Price: \$(\d+\.?\d*)/);
const price = priceMatch ? parseFloat(priceMatch[1]) : null;

// Extract multiple items
const products = [];
const productRegex = /<div class="product">(.*?)<\/div>/gs;
let match;

while ((match = productRegex.exec(html)) !== null) {
  const productHtml = match[1];
  const titleMatch = productHtml.match(/<h2>(.*?)<\/h2>/);

  products.push({
    title: titleMatch ? titleMatch[1] : '',
    timestamp: new Date().toISOString()
  });
}

return products.map(product => ({ json: product }));

AI-Powered Extraction

WebScraping.AI offers a question-based extraction endpoint that uses AI to extract data:

{
  "method": "GET",
  "url": "https://api.webscraping.ai/question",
  "queryParameters": {
    "url": "={{ $json.pageUrl }}",
    "question": "What is the product name, current price, original price, and discount percentage?",
    "js": "true"
  }
}

This method eliminates the need to write CSS selectors or parsing code, making it ideal for complex or frequently-changing page structures.

4. Data Transformation Stage

Clean, normalize, and enrich the extracted data:

// Clean and transform product data
const items = $input.all();

items.forEach(item => {
  // Clean price: "$1,234.56" -> 1234.56
  item.json.price = parseFloat(
    item.json.price.replace(/[$,]/g, '')
  );

  // Standardize availability
  const stockText = item.json.stock.toLowerCase();
  item.json.inStock = stockText.includes('in stock') ||
                      stockText.includes('available');

  // Add metadata
  item.json.scrapedAt = new Date().toISOString();
  item.json.source = 'example.com';

  // Generate unique ID
  item.json.productId = `${item.json.sku}_${Date.now()}`;
});

return items;

5. Storage Stage

Store scraped data in your preferred destination:

Database Storage (PostgreSQL):

{
  "name": "Postgres",
  "type": "n8n-nodes-base.postgres",
  "parameters": {
    "operation": "insert",
    "table": "products",
    "columns": "title,price,url,scraped_at",
    "options": {
      "skipOnConflict": true
    }
  }
}

Spreadsheet Storage (Google Sheets):

{
  "name": "Google Sheets",
  "type": "n8n-nodes-base.googleSheets",
  "parameters": {
    "operation": "append",
    "documentId": "{{ $credentials.sheetId }}",
    "sheetName": "Products",
    "columns": "title,price,url,scraped_at"
  }
}

Cloud Storage (AWS S3):

{
  "name": "AWS S3",
  "type": "n8n-nodes-base.awsS3",
  "parameters": {
    "operation": "upload",
    "bucket": "scraping-data",
    "fileName": "={{ $json.productId }}.json",
    "fileContent": "={{ JSON.stringify($json) }}"
  }
}

6. Notification Stage

Get alerts when scraping completes or encounters errors:

{
  "name": "Send Email",
  "type": "n8n-nodes-base.emailSend",
  "parameters": {
    "fromEmail": "scraper@example.com",
    "toEmail": "alerts@example.com",
    "subject": "Scraping Complete: {{ $json.itemCount }} items",
    "text": "Scraped {{ $json.itemCount }} products at {{ $now.toISO() }}"
  }
}

Advanced n8n Scraping Patterns

Pattern 1: Pagination with Loop

Handle multi-page scraping with the Split in Batches node:

{
  "workflow": {
    "nodes": [
      {
        "name": "Generate Page Numbers",
        "type": "n8n-nodes-base.function",
        "parameters": {
          "functionCode": "const pages = [];\nfor(let i = 1; i <= 50; i++) {\n  pages.push({page: i});\n}\nreturn pages;"
        }
      },
      {
        "name": "Loop Pages",
        "type": "n8n-nodes-base.splitInBatches",
        "parameters": {
          "batchSize": 1,
          "options": {}
        }
      },
      {
        "name": "Scrape Page",
        "type": "n8n-nodes-base.httpRequest",
        "parameters": {
          "url": "https://api.webscraping.ai/html",
          "qs": {
            "url": "=https://example.com/products?page={{ $json.page }}",
            "js": "true"
          }
        }
      }
    ]
  }
}

Pattern 2: Dynamic URL Lists

Scrape multiple URLs from a CSV or database:

// Read URLs from Google Sheets
const urls = $input.all();

// Process each URL
return urls.map(item => ({
  json: {
    targetUrl: item.json.url,
    category: item.json.category,
    priority: item.json.priority
  }
}));

Pattern 3: Conditional Logic

Use IF nodes to handle different page types:

{
  "name": "Check Page Type",
  "type": "n8n-nodes-base.if",
  "parameters": {
    "conditions": {
      "string": [
        {
          "value1": "={{ $json.html }}",
          "operation": "contains",
          "value2": "product-detail"
        }
      ]
    }
  }
}

Pattern 4: Error Recovery

Implement sophisticated error handling:

{
  "name": "Scrape with Retry",
  "type": "n8n-nodes-base.httpRequest",
  "parameters": {
    "url": "https://api.webscraping.ai/html",
    "options": {
      "timeout": 30000
    }
  },
  "continueOnFail": true,
  "retryOnFail": true,
  "maxTries": 3,
  "waitBetweenTries": 5000
}

This retry logic is similar to handling timeouts in Puppeteer, but configured through n8n's visual interface rather than code.

Working with JavaScript-Heavy Websites

Modern websites often rely heavily on JavaScript for content rendering. n8n handles these through API integrations that execute JavaScript:

Configuration for SPA Scraping

{
  "httpRequest": {
    "url": "https://api.webscraping.ai/html",
    "queryParameters": {
      "url": "https://spa-website.com",
      "js": "true",
      "js_timeout": "10000",
      "wait_for": ".content-loaded",
      "wait_until": "networkidle"
    }
  }
}

Key parameters: - js=true: Enable JavaScript execution - js_timeout: Milliseconds to wait for JavaScript (default 2000, max 30000) - wait_for: CSS selector to wait for before capturing HTML - wait_until: Wait condition (load, domcontentloaded, networkidle)

Handling Dynamic Content

For pages that load content asynchronously:

{
  "queryParameters": {
    "url": "https://example.com/infinite-scroll",
    "js": "true",
    "js_timeout": "15000",
    "js_script": "window.scrollTo(0, document.body.scrollHeight); await new Promise(r => setTimeout(r, 2000));"
  }
}

Real-World n8n Scraping Workflow Examples

Example 1: E-commerce Price Monitoring

Complete workflow for tracking competitor prices:

{
  "name": "Price Monitor",
  "nodes": [
    {
      "name": "Every 6 Hours",
      "type": "n8n-nodes-base.scheduleTrigger",
      "parameters": {
        "rule": {
          "interval": [{"field": "hours", "hoursInterval": 6}]
        }
      },
      "position": [250, 300]
    },
    {
      "name": "Get Product URLs",
      "type": "n8n-nodes-base.postgres",
      "parameters": {
        "operation": "executeQuery",
        "query": "SELECT url, product_name FROM products WHERE active = true"
      },
      "position": [450, 300]
    },
    {
      "name": "Scrape Prices",
      "type": "n8n-nodes-base.httpRequest",
      "parameters": {
        "method": "GET",
        "url": "https://api.webscraping.ai/question",
        "qs": {
          "url": "={{ $json.url }}",
          "question": "What is the current price?",
          "js": "true"
        }
      },
      "position": [650, 300]
    },
    {
      "name": "Parse Price",
      "type": "n8n-nodes-base.function",
      "parameters": {
        "functionCode": "const price = parseFloat($json.answer.replace(/[^0-9.]/g, ''));\nreturn [{\n  json: {\n    product_name: $json.product_name,\n    price: price,\n    url: $json.url,\n    scraped_at: new Date()\n  }\n}];"
      },
      "position": [850, 300]
    },
    {
      "name": "Save to Database",
      "type": "n8n-nodes-base.postgres",
      "parameters": {
        "operation": "insert",
        "table": "price_history",
        "columns": "product_name,price,url,scraped_at"
      },
      "position": [1050, 300]
    },
    {
      "name": "Check for Drops",
      "type": "n8n-nodes-base.if",
      "parameters": {
        "conditions": {
          "number": [{
            "value1": "={{ $json.price }}",
            "operation": "smaller",
            "value2": "={{ $json.previous_price * 0.9 }}"
          }]
        }
      },
      "position": [1250, 300]
    },
    {
      "name": "Send Alert",
      "type": "n8n-nodes-base.emailSend",
      "parameters": {
        "subject": "Price Drop Alert!",
        "text": "{{ $json.product_name }} dropped to ${{ $json.price }}"
      },
      "position": [1450, 250]
    }
  ]
}

Example 2: Job Listing Aggregator

Scrape multiple job boards and consolidate listings:

// In a Function node
const jobBoards = [
  'https://jobs.example.com',
  'https://careers.another.com',
  'https://opportunities.site.com'
];

return jobBoards.map(url => ({
  json: {
    boardUrl: url,
    category: 'software-engineering',
    location: 'remote'
  }
}));

Then connect to WebScraping.AI to fetch and parse each board.

Example 3: Real Estate Listings Monitor

Track new property listings:

{
  "queryParameters": {
    "url": "https://realestate-site.com/listings",
    "js": "true",
    "selector": ".property-card",
    "return_multiple": "true"
  }
}

Process each listing and compare with database to identify new properties.

Best Practices for n8n Web Scraping

1. Modular Workflow Design

Break complex scraping into reusable sub-workflows: - One workflow for data extraction - Another for data transformation - Separate workflow for storage - Use Execute Workflow node to connect them

2. Credential Management

Store API keys securely: - Use n8n's credential system - Never hardcode keys in workflows - Create separate credentials for dev/prod environments - Rotate keys regularly

3. Error Handling Strategy

Implement comprehensive error handling: - Set continueOnFail: true on scraping nodes - Add Error Trigger nodes to catch failures - Store failed URLs for retry - Send notifications for critical failures

4. Rate Limiting

Respect target websites: - Add Wait nodes between requests (2-5 seconds) - Use Split in Batches with appropriate batch sizes - Implement exponential backoff for retries - Consider using proxies for high-volume scraping

5. Data Quality

Ensure scraped data is accurate: - Validate extracted data format - Store raw HTML for later reprocessing - Add checksums to detect page changes - Log extraction failures for investigation

6. Performance Optimization

Scale your scraping workflows: - Use webhook triggers for on-demand scraping - Enable workflow concurrency for parallel execution - Cache frequently accessed data - Archive old data to keep databases lean

7. Monitoring and Logging

Track workflow performance: - Enable execution logging - Set up workflow execution webhooks - Monitor execution times - Track success/failure rates

Troubleshooting Common Issues

Issue: Workflow execution times out - Increase timeout in HTTP Request node settings - Use longer js_timeout for slow-loading pages - Break large workflows into smaller sub-workflows - Process data in smaller batches

Issue: No data extracted from page - Verify CSS selectors in browser DevTools - Check if content is JavaScript-rendered (enable js=true) - Ensure page has fully loaded (use wait_for parameter) - Inspect raw HTML response for content availability

Issue: Getting blocked or rate limited - Use WebScraping.AI's proxy rotation - Add delays between requests - Rotate User-Agent headers - Consider using residential proxies for stricter sites

Issue: Duplicate data in database - Implement unique constraints on database tables - Use upsert operations instead of insert - Check for existing records before inserting - Generate consistent unique identifiers

Issue: Workflow runs but produces no output - Check node connections - Verify previous nodes returned data - Review execution logs for errors - Test each node individually in manual mode

Deployment and Scaling

Self-Hosting Options

n8n can be deployed on various platforms:

Docker Compose:

version: '3.8'
services:
  n8n:
    image: n8nio/n8n
    ports:
      - "5678:5678"
    environment:
      - N8N_BASIC_AUTH_ACTIVE=true
      - N8N_BASIC_AUTH_USER=admin
      - N8N_BASIC_AUTH_PASSWORD=password
    volumes:
      - n8n_data:/home/node/.n8n
volumes:
  n8n_data:

Kubernetes: Deploy using Helm charts for production environments with high availability and auto-scaling.

Cloud Hosting

n8n Cloud offers managed hosting with: - Automatic updates and maintenance - Built-in monitoring and alerting - Guaranteed uptime SLAs - Collaborative workflow editing

Conclusion

n8n workflow automation provides a powerful, visual approach to web scraping that combines ease of use with professional-grade capabilities. By leveraging n8n's node-based architecture alongside specialized scraping APIs like WebScraping.AI, developers can build robust, scalable scraping systems without managing complex codebases.

The platform's strength lies in its flexibility: start with simple HTTP requests for static sites, integrate dedicated scraping APIs for JavaScript-heavy pages, and scale to sophisticated multi-stage pipelines with error handling, data transformation, and storage integration. Whether you're monitoring prices, aggregating content, or extracting structured data, n8n offers the tools to automate and maintain your scraping workflows efficiently.

Start with small, focused workflows and gradually expand as you become familiar with n8n's capabilities. The combination of visual workflow building, extensive integration options, and the ability to incorporate custom JavaScript code provides a balanced solution that works for both quick prototypes and production-scale scraping operations.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon