Table of contents

How do I create a workflow automation for web scraping with n8n?

n8n is a powerful workflow automation tool that allows developers to create sophisticated web scraping pipelines without writing extensive code. By combining HTTP requests, data transformation nodes, and integration with web scraping APIs, you can build reliable, scalable scraping workflows that run on schedules or triggers.

Understanding n8n Web Scraping Workflows

n8n uses a node-based architecture where each node performs a specific task. For web scraping, you'll typically combine several node types:

  • Trigger nodes - Start workflows on schedules or events
  • HTTP Request nodes - Fetch web pages or call scraping APIs
  • Data transformation nodes - Parse and clean extracted data
  • Storage nodes - Save results to databases, spreadsheets, or files
  • Notification nodes - Alert you when scraping completes or fails

The key advantage of n8n is its visual interface for building complex workflows, while still offering the flexibility to handle authentication, error handling, and data processing.

Method 1: Direct HTTP Requests with HTML Parsing

For simple static websites, you can use n8n's built-in HTTP Request node combined with HTML parsing.

Setting Up a Basic Scraping Workflow

  1. Add a Schedule Trigger

    • Drag a "Schedule Trigger" node to your workflow
    • Configure run frequency (hourly, daily, weekly)
  2. Add HTTP Request Node

    • Add an "HTTP Request" node
    • Set method to GET
    • Enter target URL
    • Configure headers if needed: User-Agent: Mozilla/5.0 (compatible; n8n-bot) Accept: text/html
  3. Extract Data with HTML Node

    • Add an "HTML Extract" node
    • Use CSS selectors to extract data: Selector: .product-title Attribute: text

Example: Extracting Product Information

Here's how to configure the HTML Extract node for e-commerce data:

{
  "selector": {
    "title": ".product-name h1",
    "price": ".price-value",
    "description": ".product-description p",
    "availability": ".stock-status"
  },
  "returnArray": true
}

Limitations: This approach only works for static HTML and doesn't handle JavaScript-rendered content, which is increasingly common on modern websites.

Method 2: Using WebScraping.AI API in n8n

For JavaScript-heavy sites and more reliable scraping, integrating a dedicated web scraping API like WebScraping.AI provides better results with built-in proxy rotation, JavaScript rendering, and anti-bot bypass.

Setting Up WebScraping.AI Integration

  1. Create HTTP Request Node

    • Method: GET
    • URL: https://api.webscraping.ai/html
    • Authentication: API Key (in headers)
  2. Configure Request Parameters

{
  "url": "={{ $json.targetUrl }}",
  "js": true,
  "proxy": "datacenter",
  "headers": {
    "Accept": "application/json"
  }
}
  1. Add API Key Authentication
    • Authentication Type: Generic Credential Type
    • Header Name: api_key
    • Header Value: Your WebScraping.AI API key

Complete n8n Workflow Example

Here's a full workflow configuration in JSON format that you can import into n8n:

{
  "nodes": [
    {
      "parameters": {
        "rule": {
          "interval": [
            {
              "field": "hours",
              "hoursInterval": 6
            }
          ]
        }
      },
      "name": "Schedule Trigger",
      "type": "n8n-nodes-base.scheduleTrigger",
      "position": [250, 300]
    },
    {
      "parameters": {
        "url": "https://api.webscraping.ai/html",
        "authentication": "genericCredentialType",
        "genericAuthType": "httpHeaderAuth",
        "options": {
          "queryParameters": {
            "parameters": [
              {
                "name": "url",
                "value": "https://example.com/products"
              },
              {
                "name": "js",
                "value": "true"
              },
              {
                "name": "proxy",
                "value": "datacenter"
              }
            ]
          }
        }
      },
      "name": "WebScraping.AI",
      "type": "n8n-nodes-base.httpRequest",
      "position": [450, 300]
    },
    {
      "parameters": {
        "jsCode": "const html = $input.item.json.html;\nconst products = [];\n\n// Parse HTML and extract data\nconst parser = new DOMParser();\nconst doc = parser.parseFromString(html, 'text/html');\n\ndoc.querySelectorAll('.product-item').forEach(item => {\n  products.push({\n    title: item.querySelector('.title')?.textContent.trim(),\n    price: item.querySelector('.price')?.textContent.trim(),\n    url: item.querySelector('a')?.href\n  });\n});\n\nreturn products.map(product => ({ json: product }));"
      },
      "name": "Parse HTML",
      "type": "n8n-nodes-base.code",
      "position": [650, 300]
    },
    {
      "parameters": {
        "operation": "append",
        "documentId": "your-google-sheet-id",
        "sheetName": "Products",
        "columns": "title,price,url"
      },
      "name": "Google Sheets",
      "type": "n8n-nodes-base.googleSheets",
      "position": [850, 300]
    }
  ]
}

Method 3: AI-Powered Data Extraction

WebScraping.AI's question-answering endpoint allows you to extract specific information using natural language, which is particularly useful when dealing with complex page structures.

Using the Question API in n8n

{
  "method": "GET",
  "url": "https://api.webscraping.ai/question",
  "queryParameters": {
    "url": "={{ $json.productUrl }}",
    "question": "What is the product name, price, and shipping time?",
    "js": "true"
  }
}

The API returns structured answers that you can directly use in your workflow without complex HTML parsing, similar to how you might handle authentication in Puppeteer for authenticated scraping scenarios.

Handling JavaScript-Rendered Content

Modern websites often load content dynamically with JavaScript. When scraping these sites, you need to ensure JavaScript execution, just as you would when handling AJAX requests using Puppeteer.

n8n Configuration for JavaScript Sites

{
  "httpRequest": {
    "url": "https://api.webscraping.ai/html",
    "qs": {
      "url": "https://spa-example.com",
      "js": "true",
      "js_timeout": "5000",
      "wait_for": ".content-loaded"
    }
  }
}

Key parameters: - js=true: Enables JavaScript rendering - js_timeout: Milliseconds to wait for JavaScript execution - wait_for: CSS selector to wait for before returning content

Advanced Workflow Patterns

Pattern 1: Pagination Handling

{
  "nodes": [
    {
      "name": "Loop Over Pages",
      "type": "n8n-nodes-base.splitInBatches",
      "parameters": {
        "batchSize": 1,
        "options": {
          "reset": false
        }
      }
    },
    {
      "name": "Scrape Page",
      "type": "n8n-nodes-base.httpRequest",
      "parameters": {
        "url": "https://api.webscraping.ai/html",
        "qs": {
          "url": "=https://example.com/products?page={{ $json.page }}",
          "js": "true"
        }
      }
    }
  ]
}

Pattern 2: Error Handling and Retries

Configure error handling in the HTTP Request node:

{
  "continueOnFail": true,
  "retryOnFail": true,
  "maxTries": 3,
  "waitBetweenTries": 5000
}

Pattern 3: Rate Limiting

Add a delay between requests to avoid overwhelming servers:

{
  "name": "Wait",
  "type": "n8n-nodes-base.wait",
  "parameters": {
    "amount": 2,
    "unit": "seconds"
  }
}

Data Processing and Storage

Transforming Scraped Data

Use the Code node (JavaScript) to clean and transform data:

// Clean price data
items.forEach(item => {
  item.json.price = parseFloat(
    item.json.price.replace(/[^0-9.]/g, '')
  );
  item.json.scrapedAt = new Date().toISOString();
});

return items;

Storage Options

  1. Google Sheets - Simple storage and sharing
  2. PostgreSQL/MySQL - Structured data storage
  3. MongoDB - Flexible document storage
  4. Airtable - Collaborative database
  5. Webhooks - Send data to external systems

Monitoring and Notifications

Add monitoring to your scraping workflows:

{
  "name": "Send Alert",
  "type": "n8n-nodes-base.emailSend",
  "parameters": {
    "subject": "Scraping Completed",
    "text": "Scraped {{ $json.itemCount }} items successfully"
  }
}

Error Notifications

Use the Error Trigger node to catch and report failures:

{
  "name": "Error Trigger",
  "type": "n8n-nodes-base.errorTrigger",
  "parameters": {}
}

Best Practices for n8n Web Scraping

  1. Use API-based scraping services - More reliable than raw HTTP requests
  2. Implement exponential backoff - Handle rate limits gracefully
  3. Store raw HTML - Keep original data for re-processing
  4. Add timestamps - Track when data was collected
  5. Monitor for changes - Set up alerts for workflow failures
  6. Use webhooks for real-time scraping - Trigger workflows on demand
  7. Test with small batches - Verify workflows before scaling

Example: Complete E-commerce Monitoring Workflow

Here's a real-world example that monitors product prices:

  1. Schedule trigger runs every 6 hours
  2. Fetch product pages using WebScraping.AI API
  3. Extract prices and availability with CSS selectors
  4. Compare with previous prices from database
  5. Send email alerts if prices drop below threshold
  6. Store results in Google Sheets
  7. Archive HTML snapshots in cloud storage

This workflow combines multiple nodes to create a production-ready scraping system that runs autonomously.

Troubleshooting Common Issues

Problem: Workflow times out Solution: Increase timeout in HTTP Request settings or use js_timeout parameter

Problem: No data extracted Solution: Verify CSS selectors using browser DevTools, similar to interacting with DOM elements in Puppeteer

Problem: Getting blocked by anti-bot measures Solution: Use WebScraping.AI's proxy rotation and anti-bot bypass features

Problem: JavaScript not rendering Solution: Enable js=true parameter and increase js_timeout

Conclusion

n8n provides a powerful platform for building automated web scraping workflows without extensive coding. By combining n8n's visual workflow builder with robust scraping APIs like WebScraping.AI, you can create reliable, maintainable scraping systems that handle JavaScript rendering, bypass anti-bot protections, and scale to handle large-scale data extraction needs.

Start with simple workflows and gradually add complexity as you learn the platform. The combination of n8n's automation capabilities and specialized scraping APIs offers the best of both worlds: ease of use and professional-grade scraping power.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon