Table of contents

How can I scrape websites using n8n and JavaScript?

Web scraping with n8n and JavaScript combines the power of workflow automation with flexible scripting capabilities. This guide demonstrates how to extract data from websites using n8n's built-in nodes and custom JavaScript code for complex scenarios.

Understanding n8n Web Scraping Approaches

n8n offers multiple methods for web scraping:

  1. HTTP Request node: For simple HTML fetching
  2. HTML Extract node: For parsing HTML content
  3. Code node: For custom JavaScript logic
  4. Function node: For transforming data
  5. Third-party integrations: Like WebScraping.AI for advanced scenarios

Method 1: Basic Web Scraping with HTTP Request + HTML Extract

The simplest approach combines HTTP Request and HTML Extract nodes:

Step 1: Fetch the HTML Content

Add an HTTP Request node with these settings:

{
  "method": "GET",
  "url": "https://example.com/products",
  "options": {
    "redirect": {
      "followRedirects": true
    }
  }
}

Step 2: Extract Data with CSS Selectors

Add an HTML Extract node to parse the HTML:

{
  "extractionValues": {
    "title": {
      "cssSelector": "h1.product-title",
      "returnValue": "text"
    },
    "price": {
      "cssSelector": ".product-price",
      "returnValue": "text"
    },
    "image": {
      "cssSelector": "img.product-image",
      "returnValue": "attribute",
      "attribute": "src"
    }
  }
}

Method 2: Advanced Scraping with JavaScript Code Node

For complex scraping scenarios, use the Code node with JavaScript:

// Access the HTML from previous node
const html = $input.first().json.data;

// Use Cheerio for HTML parsing (available in n8n)
const cheerio = require('cheerio');
const $ = cheerio.load(html);

// Extract product data
const products = [];

$('.product-item').each((index, element) => {
  const product = {
    title: $(element).find('.product-title').text().trim(),
    price: parseFloat($(element).find('.price').text().replace(/[^0-9.]/g, '')),
    description: $(element).find('.description').text().trim(),
    url: $(element).find('a').attr('href'),
    inStock: $(element).find('.stock').text().includes('In Stock'),
    rating: parseFloat($(element).find('.rating').attr('data-rating') || 0)
  };

  products.push(product);
});

// Return structured data
return products.map(product => ({ json: product }));

Method 3: Scraping JavaScript-Rendered Pages

Many modern websites render content with JavaScript. For these sites, you'll need to execute JavaScript in a browser context. While n8n doesn't have built-in browser automation, you can use external services or APIs.

Using WebScraping.AI API with n8n

For JavaScript-heavy sites, integrate WebScraping.AI:

// In a Code node
const targetUrl = 'https://example.com/dynamic-content';

// Make request to WebScraping.AI
const response = await this.helpers.httpRequest({
  method: 'GET',
  url: 'https://api.webscraping.ai/html',
  qs: {
    url: targetUrl,
    js: true,
    proxy: 'datacenter'
  },
  headers: {
    'Api-Key': 'YOUR_API_KEY'
  }
});

// Parse the returned HTML
const cheerio = require('cheerio');
const $ = cheerio.load(response);

// Extract data from JavaScript-rendered content
const data = [];
$('.dynamic-item').each((i, el) => {
  data.push({
    text: $(el).find('.text').text(),
    value: $(el).attr('data-value')
  });
});

return [{ json: { items: data } }];

Method 4: Handling Pagination in n8n Workflows

To scrape multiple pages, create a loop in your workflow:

// In a Code node - Generate page URLs
const baseUrl = 'https://example.com/products';
const totalPages = 10;
const urls = [];

for (let page = 1; page <= totalPages; page++) {
  urls.push({
    json: {
      url: `${baseUrl}?page=${page}`,
      pageNumber: page
    }
  });
}

return urls;

Then connect this to an HTTP Request node with Split In Batches to process pages sequentially or in parallel.

Method 5: Handling Authentication and Headers

Many websites require authentication or specific headers:

// In a Code node
const response = await this.helpers.httpRequest({
  method: 'GET',
  url: 'https://example.com/api/data',
  headers: {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
    'Accept': 'text/html,application/xhtml+xml',
    'Accept-Language': 'en-US,en;q=0.9',
    'Authorization': 'Bearer YOUR_TOKEN'
  },
  auth: {
    username: 'your-username',
    password: 'your-password'
  }
});

return [{ json: response }];

Method 6: Extracting Data from APIs

Many websites have underlying APIs that are easier to scrape than HTML:

// In a Code node - Fetch JSON data
const apiResponse = await this.helpers.httpRequest({
  method: 'GET',
  url: 'https://example.com/api/products',
  qs: {
    category: 'electronics',
    limit: 100
  },
  json: true
});

// Transform and filter the data
const filteredProducts = apiResponse.products
  .filter(p => p.price < 1000)
  .map(p => ({
    name: p.title,
    price: p.price,
    available: p.stock > 0
  }));

return filteredProducts.map(p => ({ json: p }));

Handling Common Scraping Challenges

1. Rate Limiting and Delays

Add delays between requests to avoid being blocked:

// In a Code node
async function delay(ms) {
  return new Promise(resolve => setTimeout(resolve, ms));
}

const results = [];
const urls = $input.all();

for (const item of urls) {
  const response = await this.helpers.httpRequest({
    method: 'GET',
    url: item.json.url
  });

  results.push({ json: response });

  // Wait 2 seconds between requests
  await delay(2000);
}

return results;

2. Error Handling

Implement robust error handling in your workflows:

// In a Code node
const results = [];
const errors = [];

for (const item of $input.all()) {
  try {
    const response = await this.helpers.httpRequest({
      method: 'GET',
      url: item.json.url,
      timeout: 30000
    });

    results.push({ json: { success: true, data: response } });
  } catch (error) {
    errors.push({
      json: {
        success: false,
        url: item.json.url,
        error: error.message
      }
    });
  }
}

return [...results, ...errors];

3. Data Cleaning and Transformation

Clean extracted data before storage:

// In a Code node
function cleanText(text) {
  return text
    .trim()
    .replace(/\s+/g, ' ')
    .replace(/\n+/g, ' ')
    .replace(/[^\x20-\x7E]/g, '');
}

function cleanPrice(priceStr) {
  const match = priceStr.match(/[\d,]+\.?\d*/);
  return match ? parseFloat(match[0].replace(/,/g, '')) : null;
}

const cleanedData = $input.all().map(item => ({
  json: {
    title: cleanText(item.json.title),
    price: cleanPrice(item.json.price),
    description: cleanText(item.json.description),
    scrapedAt: new Date().toISOString()
  }
}));

return cleanedData;

Storing Scraped Data

n8n can save scraped data to various destinations:

Save to Google Sheets

Add a Google Sheets node after your scraping logic:

{
  "operation": "append",
  "sheetName": "Scraped Products",
  "dataMode": "autoMapInputData"
}

Save to Database

Use Postgres or MySQL nodes:

{
  "operation": "insert",
  "table": "products",
  "columns": "title, price, url, scraped_at"
}

Save to JSON File

Use the Write Binary File node:

// Prepare data for file output
const data = $input.all().map(item => item.json);
const jsonContent = JSON.stringify(data, null, 2);

return [{
  json: {},
  binary: {
    data: Buffer.from(jsonContent, 'utf-8')
  }
}];

Complete n8n Web Scraping Workflow Example

Here's a complete workflow that scrapes product data:

  1. Schedule Trigger: Run daily at 9 AM
  2. Code Node (Generate URLs): Create list of pages to scrape
  3. Split In Batches: Process 5 URLs at a time
  4. HTTP Request: Fetch each page
  5. Code Node (Parse HTML): Extract product data using Cheerio
  6. Code Node (Clean Data): Clean and validate extracted data
  7. Filter: Remove items without prices
  8. Google Sheets: Save to spreadsheet
  9. Slack: Send notification when complete

Best Practices for n8n Web Scraping

  1. Respect robots.txt: Check the website's robots.txt file before scraping
  2. Add delays: Use the Wait node or delays in code to avoid overwhelming servers
  3. Handle errors gracefully: Use the Error Trigger node to catch and log failures
  4. Use appropriate User-Agent: Identify your scraper properly in headers
  5. Monitor execution: Set up notifications for failed workflows
  6. Cache results: Store intermediate results to avoid re-scraping on failures
  7. Validate data: Always validate extracted data before storage
  8. Consider legal aspects: Ensure you have permission to scrape the target website

When to Use External Scraping Services

For complex scenarios involving JavaScript rendering, CAPTCHAs, or rotating proxies, consider using dedicated scraping APIs like WebScraping.AI. These services provide:

  • Automatic proxy rotation
  • JavaScript rendering
  • CAPTCHA solving
  • Anti-bot detection bypass
  • Reliable infrastructure

Similar to how you handle browser sessions in Puppeteer, maintaining consistent scraping sessions in n8n workflows requires careful state management. For dynamic content that requires waiting for elements to load, techniques similar to handling AJAX requests using Puppeteer can be applied through API-based solutions integrated into your n8n workflows.

Conclusion

n8n provides powerful capabilities for web scraping through its visual workflow builder and JavaScript support. Start with simple HTTP Request and HTML Extract nodes for static content, then progress to custom JavaScript code for complex scenarios. For JavaScript-heavy websites, integrate external services or APIs to handle browser automation. Always follow ethical scraping practices and respect website terms of service.

With proper error handling, rate limiting, and data validation, you can build robust, automated web scraping workflows that run reliably on schedule or triggered by events.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon