How do I create a workflow automation for web scraping with n8n?

n8n is a powerful workflow automation tool that allows developers to create sophisticated web scraping pipelines without writing extensive code. By combining HTTP requests, data transformation nodes, and integration with web scraping APIs, you can build reliable, scalable scraping workflows that run on schedules or triggers.

Understanding n8n Web Scraping Workflows

n8n uses a node-based architecture where each node performs a specific task. For web scraping, you'll typically combine several node types:

Trigger nodes - Start workflows on schedules or events
HTTP Request nodes - Fetch web pages or call scraping APIs
Data transformation nodes - Parse and clean extracted data
Storage nodes - Save results to databases, spreadsheets, or files
Notification nodes - Alert you when scraping completes or fails

The key advantage of n8n is its visual interface for building complex workflows, while still offering the flexibility to handle authentication, error handling, and data processing.

Method 1: Direct HTTP Requests with HTML Parsing

For simple static websites, you can use n8n's built-in HTTP Request node combined with HTML parsing.

Setting Up a Basic Scraping Workflow

Add a Schedule Trigger
- Drag a "Schedule Trigger" node to your workflow
- Configure run frequency (hourly, daily, weekly)
Add HTTP Request Node
- Add an "HTTP Request" node
- Set method to GET
- Enter target URL
- Configure headers if needed: User-Agent: Mozilla/5.0 (compatible; n8n-bot) Accept: text/html
Extract Data with HTML Node
- Add an "HTML Extract" node
- Use CSS selectors to extract data: Selector: .product-title Attribute: text

Example: Extracting Product Information

Here's how to configure the HTML Extract node for e-commerce data:

{
  "selector": {
    "title": ".product-name h1",
    "price": ".price-value",
    "description": ".product-description p",
    "availability": ".stock-status"
  },
  "returnArray": true
}

Limitations: This approach only works for static HTML and doesn't handle JavaScript-rendered content, which is increasingly common on modern websites.

Method 2: Using WebScraping.AI API in n8n

For JavaScript-heavy sites and more reliable scraping, integrating a dedicated web scraping API like WebScraping.AI provides better results with built-in proxy rotation, JavaScript rendering, and anti-bot bypass.

Setting Up WebScraping.AI Integration

Create HTTP Request Node
- Method: GET
- URL: https://api.webscraping.ai/html
- Authentication: API Key (in headers)
Configure Request Parameters

{
  "url": "={{ $json.targetUrl }}",
  "js": true,
  "proxy": "datacenter",
  "headers": {
    "Accept": "application/json"
  }
}

Add API Key Authentication
- Authentication Type: Generic Credential Type
- Header Name: api_key
- Header Value: Your WebScraping.AI API key

Complete n8n Workflow Example

Here's a full workflow configuration in JSON format that you can import into n8n:

{
  "nodes": [
    {
      "parameters": {
        "rule": {
          "interval": [
            {
              "field": "hours",
              "hoursInterval": 6
            }
          ]
        }
      },
      "name": "Schedule Trigger",
      "type": "n8n-nodes-base.scheduleTrigger",
      "position": [250, 300]
    },
    {
      "parameters": {
        "url": "https://api.webscraping.ai/html",
        "authentication": "genericCredentialType",
        "genericAuthType": "httpHeaderAuth",
        "options": {
          "queryParameters": {
            "parameters": [
              {
                "name": "url",
                "value": "https://example.com/products"
              },
              {
                "name": "js",
                "value": "true"
              },
              {
                "name": "proxy",
                "value": "datacenter"
              }
            ]
          }
        }
      },
      "name": "WebScraping.AI",
      "type": "n8n-nodes-base.httpRequest",
      "position": [450, 300]
    },
    {
      "parameters": {
        "jsCode": "const html = $input.item.json.html;\nconst products = [];\n\n// Parse HTML and extract data\nconst parser = new DOMParser();\nconst doc = parser.parseFromString(html, 'text/html');\n\ndoc.querySelectorAll('.product-item').forEach(item => {\n  products.push({\n    title: item.querySelector('.title')?.textContent.trim(),\n    price: item.querySelector('.price')?.textContent.trim(),\n    url: item.querySelector('a')?.href\n  });\n});\n\nreturn products.map(product => ({ json: product }));"
      },
      "name": "Parse HTML",
      "type": "n8n-nodes-base.code",
      "position": [650, 300]
    },
    {
      "parameters": {
        "operation": "append",
        "documentId": "your-google-sheet-id",
        "sheetName": "Products",
        "columns": "title,price,url"
      },
      "name": "Google Sheets",
      "type": "n8n-nodes-base.googleSheets",
      "position": [850, 300]
    }
  ]
}

Method 3: AI-Powered Data Extraction

WebScraping.AI's question-answering endpoint allows you to extract specific information using natural language, which is particularly useful when dealing with complex page structures.

Using the Question API in n8n

{
  "method": "GET",
  "url": "https://api.webscraping.ai/question",
  "queryParameters": {
    "url": "={{ $json.productUrl }}",
    "question": "What is the product name, price, and shipping time?",
    "js": "true"
  }
}

The API returns structured answers that you can directly use in your workflow without complex HTML parsing, similar to how you might handle authentication in Puppeteer for authenticated scraping scenarios.

Handling JavaScript-Rendered Content

Modern websites often load content dynamically with JavaScript. When scraping these sites, you need to ensure JavaScript execution, just as you would when handling AJAX requests using Puppeteer.

n8n Configuration for JavaScript Sites

{
  "httpRequest": {
    "url": "https://api.webscraping.ai/html",
    "qs": {
      "url": "https://spa-example.com",
      "js": "true",
      "js_timeout": "5000",
      "wait_for": ".content-loaded"
    }
  }
}

Key parameters: - js=true: Enables JavaScript rendering - js_timeout: Milliseconds to wait for JavaScript execution - wait_for: CSS selector to wait for before returning content

Advanced Workflow Patterns

Pattern 1: Pagination Handling

{
  "nodes": [
    {
      "name": "Loop Over Pages",
      "type": "n8n-nodes-base.splitInBatches",
      "parameters": {
        "batchSize": 1,
        "options": {
          "reset": false
        }
      }
    },
    {
      "name": "Scrape Page",
      "type": "n8n-nodes-base.httpRequest",
      "parameters": {
        "url": "https://api.webscraping.ai/html",
        "qs": {
          "url": "=https://example.com/products?page={{ $json.page }}",
          "js": "true"
        }
      }
    }
  ]
}

Pattern 2: Error Handling and Retries

Configure error handling in the HTTP Request node:

{
  "continueOnFail": true,
  "retryOnFail": true,
  "maxTries": 3,
  "waitBetweenTries": 5000
}

Pattern 3: Rate Limiting

Add a delay between requests to avoid overwhelming servers:

{
  "name": "Wait",
  "type": "n8n-nodes-base.wait",
  "parameters": {
    "amount": 2,
    "unit": "seconds"
  }
}

Data Processing and Storage

Transforming Scraped Data

Use the Code node (JavaScript) to clean and transform data:

// Clean price data
items.forEach(item => {
  item.json.price = parseFloat(
    item.json.price.replace(/[^0-9.]/g, '')
  );
  item.json.scrapedAt = new Date().toISOString();
});

return items;

Storage Options

Google Sheets - Simple storage and sharing
PostgreSQL/MySQL - Structured data storage
MongoDB - Flexible document storage
Airtable - Collaborative database
Webhooks - Send data to external systems

Monitoring and Notifications

Add monitoring to your scraping workflows:

{
  "name": "Send Alert",
  "type": "n8n-nodes-base.emailSend",
  "parameters": {
    "subject": "Scraping Completed",
    "text": "Scraped {{ $json.itemCount }} items successfully"
  }
}

Error Notifications

Use the Error Trigger node to catch and report failures:

{
  "name": "Error Trigger",
  "type": "n8n-nodes-base.errorTrigger",
  "parameters": {}
}

Best Practices for n8n Web Scraping

Use API-based scraping services - More reliable than raw HTTP requests
Implement exponential backoff - Handle rate limits gracefully
Store raw HTML - Keep original data for re-processing
Add timestamps - Track when data was collected
Monitor for changes - Set up alerts for workflow failures
Use webhooks for real-time scraping - Trigger workflows on demand
Test with small batches - Verify workflows before scaling

Example: Complete E-commerce Monitoring Workflow

Here's a real-world example that monitors product prices:

Schedule trigger runs every 6 hours
Fetch product pages using WebScraping.AI API
Extract prices and availability with CSS selectors
Compare with previous prices from database
Send email alerts if prices drop below threshold
Store results in Google Sheets
Archive HTML snapshots in cloud storage

This workflow combines multiple nodes to create a production-ready scraping system that runs autonomously.

Troubleshooting Common Issues

Problem: Workflow times out Solution: Increase timeout in HTTP Request settings or use js_timeout parameter

Problem: No data extracted Solution: Verify CSS selectors using browser DevTools, similar to interacting with DOM elements in Puppeteer

Problem: Getting blocked by anti-bot measures Solution: Use WebScraping.AI's proxy rotation and anti-bot bypass features

Problem: JavaScript not rendering Solution: Enable js=true parameter and increase js_timeout

Conclusion

n8n provides a powerful platform for building automated web scraping workflows without extensive coding. By combining n8n's visual workflow builder with robust scraping APIs like WebScraping.AI, you can create reliable, maintainable scraping systems that handle JavaScript rendering, bypass anti-bot protections, and scale to handle large-scale data extraction needs.

Start with simple workflows and gradually add complexity as you learn the platform. The combination of n8n's automation capabilities and specialized scraping APIs offers the best of both worlds: ease of use and professional-grade scraping power.

Table of contents