Table of contents

What are the web scraping best practices when using n8n?

Web scraping with n8n requires careful planning and implementation to ensure reliable, efficient, and ethical data extraction. Following established best practices helps you avoid common pitfalls, prevent website blocks, and create maintainable workflows that scale effectively.

1. Respect Robots.txt and Terms of Service

Before scraping any website, always check the robots.txt file and review the site's Terms of Service. This is not just an ethical consideration but often a legal requirement.

// n8n Code Node - Check robots.txt
const targetUrl = 'https://example.com';
const robotsUrl = new URL('/robots.txt', targetUrl).href;

const response = await $http.request({
  method: 'GET',
  url: robotsUrl,
});

// Parse and check if your user agent is allowed
const robotsTxt = response.data;
if (robotsTxt.includes('Disallow: /')) {
  console.log('Scraping may be restricted');
}

return { robotsTxt };

2. Implement Rate Limiting and Delays

One of the most critical practices is controlling the request rate to avoid overwhelming target servers or triggering anti-bot mechanisms.

Using Wait Node Between Requests

// n8n Code Node - Add random delays
const minDelay = 2000; // 2 seconds
const maxDelay = 5000; // 5 seconds
const delay = Math.floor(Math.random() * (maxDelay - minDelay + 1)) + minDelay;

// Use this value in a Wait node
return { delay };

In your n8n workflow: 1. Add a Wait node after each HTTP Request 2. Set it to wait between 2-5 seconds 3. Use random delays to appear more human-like 4. For production scraping, consider 10-30 second delays

3. Rotate User Agents and Headers

Websites often detect bots by analyzing HTTP headers. Rotating user agents and setting realistic headers makes your requests appear more natural.

// n8n Code Node - Generate realistic headers
const userAgents = [
  'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
  'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
  'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:121.0) Gecko/20100101 Firefox/121.0',
  'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.1 Safari/605.1.15'
];

const randomUA = userAgents[Math.floor(Math.random() * userAgents.length)];

return {
  headers: {
    'User-Agent': randomUA,
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
    'Accept-Language': 'en-US,en;q=0.5',
    'Accept-Encoding': 'gzip, deflate, br',
    'DNT': '1',
    'Connection': 'keep-alive',
    'Upgrade-Insecure-Requests': '1'
  }
};

4. Use Proxies for Large-Scale Scraping

For extensive scraping operations, implementing proxy rotation prevents IP bans and distributes requests across multiple addresses.

// n8n Code Node - Proxy configuration
const proxies = [
  'http://proxy1.example.com:8080',
  'http://proxy2.example.com:8080',
  'http://proxy3.example.com:8080'
];

const randomProxy = proxies[Math.floor(Math.random() * proxies.length)];

return {
  proxy: randomProxy,
  proxyConfig: {
    host: new URL(randomProxy).hostname,
    port: new URL(randomProxy).port,
    protocol: new URL(randomProxy).protocol.replace(':', '')
  }
};

For professional applications, consider using specialized scraping APIs like WebScraping.AI that handle proxy rotation automatically.

5. Implement Robust Error Handling

Web scraping workflows must gracefully handle failures like timeouts, network errors, and unexpected page structures.

// n8n Code Node - Error handling wrapper
try {
  const response = await $http.request({
    method: 'GET',
    url: items[0].json.targetUrl,
    timeout: 30000,
    returnFullResponse: true
  });

  if (response.statusCode === 200) {
    return {
      success: true,
      data: response.body,
      statusCode: response.statusCode
    };
  } else if (response.statusCode === 429) {
    // Rate limited - need to slow down
    return {
      success: false,
      error: 'Rate limited',
      retryAfter: response.headers['retry-after'] || 60
    };
  } else {
    return {
      success: false,
      error: `HTTP ${response.statusCode}`,
      statusCode: response.statusCode
    };
  }
} catch (error) {
  return {
    success: false,
    error: error.message,
    shouldRetry: error.code === 'ETIMEDOUT' || error.code === 'ECONNRESET'
  };
}

n8n Error Workflow Structure

  1. Use IF nodes to check for error conditions
  2. Add Stop and Error nodes for critical failures
  3. Implement Retry logic with exponential backoff
  4. Log errors to a database or file for analysis

6. Handle Dynamic Content Properly

Modern websites often load content dynamically with JavaScript. For such scenarios, handling AJAX requests requires specialized tools like Puppeteer or Playwright.

// n8n Puppeteer Node configuration
{
  "url": "{{ $json.targetUrl }}",
  "waitUntil": "networkidle2",
  "waitForSelector": ".product-list",
  "evaluateScript": `
    // Wait for dynamic content
    return new Promise((resolve) => {
      const checkContent = setInterval(() => {
        const elements = document.querySelectorAll('.product-item');
        if (elements.length > 0) {
          clearInterval(checkContent);
          resolve(document.body.innerHTML);
        }
      }, 100);
    });
  `
}

7. Optimize Data Storage and Processing

Efficient data handling prevents memory issues and improves workflow performance.

// n8n Code Node - Batch processing
const items = $input.all();
const batchSize = 50;
const batches = [];

for (let i = 0; i < items.length; i += batchSize) {
  batches.push(items.slice(i, i + batchSize));
}

// Process first batch and return
return batches[0].map(item => ({
  json: item.json,
  pairedItem: item.pairedItem
}));

Storage Best Practices

  • Use databases for structured data (PostgreSQL, MongoDB)
  • Use Google Sheets for small datasets requiring collaboration
  • Use cloud storage (S3, Dropbox) for large files
  • Implement pagination for processing large result sets
  • Clean data immediately to reduce storage needs

8. Monitor and Log Your Workflows

Comprehensive logging helps debug issues and track scraping performance.

// n8n Code Node - Logging utility
const logEntry = {
  timestamp: new Date().toISOString(),
  workflowId: $workflow.id,
  executionId: $execution.id,
  url: items[0].json.url,
  status: items[0].json.success ? 'success' : 'failed',
  itemsScraped: items[0].json.items?.length || 0,
  duration: items[0].json.duration,
  errorMessage: items[0].json.error || null
};

// Send to logging service or database
return { log: logEntry };

9. Validate and Clean Scraped Data

Always validate extracted data to ensure quality and consistency.

// n8n Code Node - Data validation
const validateProduct = (product) => {
  const errors = [];

  if (!product.title || product.title.trim() === '') {
    errors.push('Missing title');
  }

  if (!product.price || isNaN(parseFloat(product.price))) {
    errors.push('Invalid price');
  }

  if (!product.url || !product.url.startsWith('http')) {
    errors.push('Invalid URL');
  }

  return {
    valid: errors.length === 0,
    errors,
    data: {
      title: product.title?.trim(),
      price: parseFloat(product.price) || null,
      url: product.url,
      scrapedAt: new Date().toISOString()
    }
  };
};

const results = items.map(item => {
  const validation = validateProduct(item.json);
  return {
    json: validation
  };
});

return results;

10. Use Caching to Reduce Requests

Implement caching mechanisms to avoid re-scraping unchanged content.

// n8n Code Node - Simple caching check
const crypto = require('crypto');
const url = items[0].json.url;
const cacheKey = crypto.createHash('md5').update(url).digest('hex');
const cacheExpiry = 3600; // 1 hour in seconds

// Check if cached version exists and is fresh
const cachedData = items[0].json.cache?.[cacheKey];
const cacheTime = items[0].json.cacheTime?.[cacheKey];

if (cachedData && cacheTime) {
  const age = (Date.now() - cacheTime) / 1000;
  if (age < cacheExpiry) {
    return {
      fromCache: true,
      data: cachedData
    };
  }
}

return {
  fromCache: false,
  needsFetch: true,
  cacheKey
};

11. Handle Pagination Correctly

Many websites split content across multiple pages. Proper pagination handling is essential.

// n8n Code Node - Pagination handler
let currentPage = 1;
const maxPages = 10;
const baseUrl = 'https://example.com/products';
const urls = [];

while (currentPage <= maxPages) {
  urls.push({
    json: {
      url: `${baseUrl}?page=${currentPage}`,
      pageNumber: currentPage
    }
  });
  currentPage++;
}

return urls;

12. Respect Website Performance

Consider the impact of your scraping on target websites:

  • Scrape during off-peak hours when possible
  • Limit concurrent connections to 1-3 per domain
  • Use HEAD requests to check if content changed before full download
  • Implement exponential backoff on errors
  • Monitor response times and slow down if servers struggle

13. Use Selectors Wisely

Write robust CSS or XPath selectors that won't break easily.

// n8n Code Node - Robust selector strategy
const extractData = (html) => {
  const $ = cheerio.load(html);

  // Try multiple selector strategies
  const titleSelectors = [
    'h1.product-title',
    '[data-testid="product-title"]',
    'h1[itemprop="name"]',
    '.product-header h1',
    'h1'
  ];

  let title = null;
  for (const selector of titleSelectors) {
    title = $(selector).first().text().trim();
    if (title) break;
  }

  return { title };
};

14. Keep Workflows Modular and Reusable

Design n8n workflows with reusability in mind:

  • Create sub-workflows for common tasks
  • Use variables and expressions for flexibility
  • Document workflow nodes with notes
  • Version control workflow JSON exports
  • Test workflows with sample data before production

Conclusion

Successful web scraping with n8n requires balancing efficiency, reliability, and ethical considerations. By implementing these best practices—respecting rate limits, handling errors gracefully, using proper browser session management, and monitoring your workflows—you'll create robust scraping solutions that extract data reliably while maintaining good relationships with target websites.

Remember that web scraping exists in a gray area legally and ethically. Always prioritize obtaining data through official APIs when available, respect website terms of service, and consider the impact of your scraping activities on target infrastructure. When these best practices aren't sufficient for your needs, consider using specialized web scraping services that handle these complexities professionally.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon