How can I scrape websites using n8n and JavaScript?

Web scraping with n8n and JavaScript combines the power of workflow automation with flexible scripting capabilities. This guide demonstrates how to extract data from websites using n8n's built-in nodes and custom JavaScript code for complex scenarios.

Understanding n8n Web Scraping Approaches

n8n offers multiple methods for web scraping:

HTTP Request node: For simple HTML fetching
HTML Extract node: For parsing HTML content
Code node: For custom JavaScript logic
Function node: For transforming data
Third-party integrations: Like WebScraping.AI for advanced scenarios

Method 1: Basic Web Scraping with HTTP Request + HTML Extract

The simplest approach combines HTTP Request and HTML Extract nodes:

Step 1: Fetch the HTML Content

Add an HTTP Request node with these settings:

{
  "method": "GET",
  "url": "https://example.com/products",
  "options": {
    "redirect": {
      "followRedirects": true
    }
  }
}

Step 2: Extract Data with CSS Selectors

Add an HTML Extract node to parse the HTML:

{
  "extractionValues": {
    "title": {
      "cssSelector": "h1.product-title",
      "returnValue": "text"
    },
    "price": {
      "cssSelector": ".product-price",
      "returnValue": "text"
    },
    "image": {
      "cssSelector": "img.product-image",
      "returnValue": "attribute",
      "attribute": "src"
    }
  }
}

Method 2: Advanced Scraping with JavaScript Code Node

For complex scraping scenarios, use the Code node with JavaScript:

// Access the HTML from previous node
const html = $input.first().json.data;

// Use Cheerio for HTML parsing (available in n8n)
const cheerio = require('cheerio');
const $ = cheerio.load(html);

// Extract product data
const products = [];

$('.product-item').each((index, element) => {
  const product = {
    title: $(element).find('.product-title').text().trim(),
    price: parseFloat($(element).find('.price').text().replace(/[^0-9.]/g, '')),
    description: $(element).find('.description').text().trim(),
    url: $(element).find('a').attr('href'),
    inStock: $(element).find('.stock').text().includes('In Stock'),
    rating: parseFloat($(element).find('.rating').attr('data-rating') || 0)
  };

  products.push(product);
});

// Return structured data
return products.map(product => ({ json: product }));

Method 3: Scraping JavaScript-Rendered Pages

Many modern websites render content with JavaScript. For these sites, you'll need to execute JavaScript in a browser context. While n8n doesn't have built-in browser automation, you can use external services or APIs.

Using WebScraping.AI API with n8n

For JavaScript-heavy sites, integrate WebScraping.AI:

// In a Code node
const targetUrl = 'https://example.com/dynamic-content';

// Make request to WebScraping.AI
const response = await this.helpers.httpRequest({
  method: 'GET',
  url: 'https://api.webscraping.ai/html',
  qs: {
    url: targetUrl,
    js: true,
    proxy: 'datacenter'
  },
  headers: {
    'Api-Key': 'YOUR_API_KEY'
  }
});

// Parse the returned HTML
const cheerio = require('cheerio');
const $ = cheerio.load(response);

// Extract data from JavaScript-rendered content
const data = [];
$('.dynamic-item').each((i, el) => {
  data.push({
    text: $(el).find('.text').text(),
    value: $(el).attr('data-value')
  });
});

return [{ json: { items: data } }];

Method 4: Handling Pagination in n8n Workflows

To scrape multiple pages, create a loop in your workflow:

// In a Code node - Generate page URLs
const baseUrl = 'https://example.com/products';
const totalPages = 10;
const urls = [];

for (let page = 1; page <= totalPages; page++) {
  urls.push({
    json: {
      url: `${baseUrl}?page=${page}`,
      pageNumber: page
    }
  });
}

return urls;

Then connect this to an HTTP Request node with Split In Batches to process pages sequentially or in parallel.

Method 5: Handling Authentication and Headers

Many websites require authentication or specific headers:

// In a Code node
const response = await this.helpers.httpRequest({
  method: 'GET',
  url: 'https://example.com/api/data',
  headers: {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
    'Accept': 'text/html,application/xhtml+xml',
    'Accept-Language': 'en-US,en;q=0.9',
    'Authorization': 'Bearer YOUR_TOKEN'
  },
  auth: {
    username: 'your-username',
    password: 'your-password'
  }
});

return [{ json: response }];

Method 6: Extracting Data from APIs

Many websites have underlying APIs that are easier to scrape than HTML:

// In a Code node - Fetch JSON data
const apiResponse = await this.helpers.httpRequest({
  method: 'GET',
  url: 'https://example.com/api/products',
  qs: {
    category: 'electronics',
    limit: 100
  },
  json: true
});

// Transform and filter the data
const filteredProducts = apiResponse.products
  .filter(p => p.price < 1000)
  .map(p => ({
    name: p.title,
    price: p.price,
    available: p.stock > 0
  }));

return filteredProducts.map(p => ({ json: p }));

Handling Common Scraping Challenges

1. Rate Limiting and Delays

Add delays between requests to avoid being blocked:

// In a Code node
async function delay(ms) {
  return new Promise(resolve => setTimeout(resolve, ms));
}

const results = [];
const urls = $input.all();

for (const item of urls) {
  const response = await this.helpers.httpRequest({
    method: 'GET',
    url: item.json.url
  });

  results.push({ json: response });

  // Wait 2 seconds between requests
  await delay(2000);
}

return results;

2. Error Handling

Implement robust error handling in your workflows:

// In a Code node
const results = [];
const errors = [];

for (const item of $input.all()) {
  try {
    const response = await this.helpers.httpRequest({
      method: 'GET',
      url: item.json.url,
      timeout: 30000
    });

    results.push({ json: { success: true, data: response } });
  } catch (error) {
    errors.push({
      json: {
        success: false,
        url: item.json.url,
        error: error.message
      }
    });
  }
}

return [...results, ...errors];

3. Data Cleaning and Transformation

Clean extracted data before storage:

// In a Code node
function cleanText(text) {
  return text
    .trim()
    .replace(/\s+/g, ' ')
    .replace(/\n+/g, ' ')
    .replace(/[^\x20-\x7E]/g, '');
}

function cleanPrice(priceStr) {
  const match = priceStr.match(/[\d,]+\.?\d*/);
  return match ? parseFloat(match[0].replace(/,/g, '')) : null;
}

const cleanedData = $input.all().map(item => ({
  json: {
    title: cleanText(item.json.title),
    price: cleanPrice(item.json.price),
    description: cleanText(item.json.description),
    scrapedAt: new Date().toISOString()
  }
}));

return cleanedData;

Storing Scraped Data

n8n can save scraped data to various destinations:

Save to Google Sheets

Add a Google Sheets node after your scraping logic:

{
  "operation": "append",
  "sheetName": "Scraped Products",
  "dataMode": "autoMapInputData"
}

Save to Database

Use Postgres or MySQL nodes:

{
  "operation": "insert",
  "table": "products",
  "columns": "title, price, url, scraped_at"
}

Save to JSON File

Use the Write Binary File node:

// Prepare data for file output
const data = $input.all().map(item => item.json);
const jsonContent = JSON.stringify(data, null, 2);

return [{
  json: {},
  binary: {
    data: Buffer.from(jsonContent, 'utf-8')
  }
}];

Complete n8n Web Scraping Workflow Example

Here's a complete workflow that scrapes product data:

Schedule Trigger: Run daily at 9 AM
Code Node (Generate URLs): Create list of pages to scrape
Split In Batches: Process 5 URLs at a time
HTTP Request: Fetch each page
Code Node (Parse HTML): Extract product data using Cheerio
Code Node (Clean Data): Clean and validate extracted data
Filter: Remove items without prices
Google Sheets: Save to spreadsheet
Slack: Send notification when complete

Best Practices for n8n Web Scraping

Respect robots.txt: Check the website's robots.txt file before scraping
Add delays: Use the Wait node or delays in code to avoid overwhelming servers
Handle errors gracefully: Use the Error Trigger node to catch and log failures
Use appropriate User-Agent: Identify your scraper properly in headers
Monitor execution: Set up notifications for failed workflows
Cache results: Store intermediate results to avoid re-scraping on failures
Validate data: Always validate extracted data before storage
Consider legal aspects: Ensure you have permission to scrape the target website

When to Use External Scraping Services

For complex scenarios involving JavaScript rendering, CAPTCHAs, or rotating proxies, consider using dedicated scraping APIs like WebScraping.AI. These services provide:

Automatic proxy rotation
JavaScript rendering
CAPTCHA solving
Anti-bot detection bypass
Reliable infrastructure

Similar to how you handle browser sessions in Puppeteer, maintaining consistent scraping sessions in n8n workflows requires careful state management. For dynamic content that requires waiting for elements to load, techniques similar to handling AJAX requests using Puppeteer can be applied through API-based solutions integrated into your n8n workflows.

Conclusion

n8n provides powerful capabilities for web scraping through its visual workflow builder and JavaScript support. Start with simple HTTP Request and HTML Extract nodes for static content, then progress to custom JavaScript code for complex scenarios. For JavaScript-heavy websites, integrate external services or APIs to handle browser automation. Always follow ethical scraping practices and respect website terms of service.

With proper error handling, rate limiting, and data validation, you can build robust, automated web scraping workflows that run reliably on schedule or triggered by events.

Table of contents