How do I save scraped data to Google Sheets with n8n?

Saving scraped data directly to Google Sheets is one of the most common use cases for n8n automation workflows. This integration allows you to collect data from websites and automatically populate spreadsheets for analysis, reporting, or sharing with your team. In this guide, we'll walk through the complete process of setting up a web scraping workflow that saves data to Google Sheets.

Prerequisites

Before you begin, make sure you have:

An n8n instance running (cloud or self-hosted)
A Google account with access to Google Sheets
Basic understanding of n8n workflows
The target website URL you want to scrape

Setting Up Google Sheets Authentication

First, you need to connect your Google account to n8n:

In your n8n workflow, add a Google Sheets node
Click on Credentials and select Create New
Choose the authentication method:
- OAuth2 (recommended for most users)
- Service Account (for automated/production environments)

For OAuth2 authentication:

# You'll be redirected to Google to grant permissions
# Allow n8n to access your Google Sheets

Once authenticated, you can reuse these credentials across multiple workflows.

Basic Web Scraping to Google Sheets Workflow

Here's a simple workflow structure:

Trigger Node - Start the workflow (manual, schedule, webhook)
HTTP Request or Code Node - Scrape the website
Data Processing Node - Clean and format the data
Google Sheets Node - Save to spreadsheet

Example: Scraping Product Data

Let's create a workflow that scrapes product information and saves it to Google Sheets.

Step 1: HTTP Request Node

Configure the HTTP Request node to fetch the webpage:

// HTTP Request Node Configuration
{
  "method": "GET",
  "url": "https://example.com/products",
  "options": {
    "headers": {
      "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
    }
  }
}

Step 2: Code Node for Data Extraction

Use the Code node to parse HTML and extract data:

// Extract data using Cheerio (built into n8n)
const cheerio = require('cheerio');

// Get the HTML from the previous node
const html = $input.first().json.body;
const $ = cheerio.load(html);

// Extract product data
const products = [];

$('.product-item').each((i, element) => {
  const product = {
    name: $(element).find('.product-name').text().trim(),
    price: $(element).find('.product-price').text().trim(),
    url: $(element).find('a').attr('href'),
    inStock: $(element).find('.stock-status').text().includes('In Stock'),
    timestamp: new Date().toISOString()
  };

  products.push(product);
});

// Return the data in n8n format
return products.map(product => ({ json: product }));

Step 3: Google Sheets Node Configuration

Configure the Google Sheets node to append data:

Node Settings: - Operation: Append or Update - Document: Select your spreadsheet - Sheet: Choose the worksheet (e.g., "Sheet1") - Columns: Map your data fields

Field Mapping:

{
  "A": "={{ $json.name }}",
  "B": "={{ $json.price }}",
  "C": "={{ $json.url }}",
  "D": "={{ $json.inStock }}",
  "E": "={{ $json.timestamp }}"
}

Advanced Workflow with Puppeteer

For JavaScript-heavy websites that require browser automation, you can use the n8n Puppeteer integration to handle dynamic content:

// Code Node with Puppeteer
const puppeteer = require('puppeteer');

async function scrapeWithPuppeteer() {
  const browser = await puppeteer.launch({
    headless: true,
    args: ['--no-sandbox', '--disable-setuid-sandbox']
  });

  const page = await browser.newPage();

  // Navigate to the page
  await page.goto('https://example.com/products', {
    waitUntil: 'networkidle2'
  });

  // Wait for content to load
  await page.waitForSelector('.product-item');

  // Extract data
  const products = await page.evaluate(() => {
    const items = document.querySelectorAll('.product-item');
    return Array.from(items).map(item => ({
      name: item.querySelector('.product-name')?.textContent.trim(),
      price: item.querySelector('.product-price')?.textContent.trim(),
      image: item.querySelector('img')?.src,
      rating: item.querySelector('.rating')?.textContent.trim()
    }));
  });

  await browser.close();
  return products;
}

// Execute and return results
const data = await scrapeWithPuppeteer();
return data.map(item => ({ json: item }));

Handling Pagination

When scraping multiple pages, you'll need to loop through results:

// Loop Node Configuration for Pagination
const maxPages = 5;
const results = [];

for (let page = 1; page <= maxPages; page++) {
  // Construct URL with page parameter
  const url = `https://example.com/products?page=${page}`;

  // Scrape each page
  const response = await $http.get(url);
  const $ = cheerio.load(response.data);

  $('.product-item').each((i, element) => {
    results.push({
      page: page,
      name: $(element).find('.product-name').text(),
      price: $(element).find('.product-price').text()
    });
  });

  // Respect rate limits
  await new Promise(resolve => setTimeout(resolve, 1000));
}

return results.map(item => ({ json: item }));

Data Formatting and Validation

Before saving to Google Sheets, clean and validate your data:

// Code Node for Data Cleaning
const items = $input.all();

const cleanedData = items.map(item => {
  const data = item.json;

  return {
    // Remove currency symbols and convert to number
    price: parseFloat(data.price.replace(/[$,]/g, '')),

    // Normalize text
    name: data.name.trim().replace(/\s+/g, ' '),

    // Format dates
    scrapedDate: new Date().toLocaleDateString('en-US'),

    // Clean URLs
    url: data.url.startsWith('http') ? data.url : `https://example.com${data.url}`,

    // Boolean conversion
    inStock: data.inStock === 'true' || data.inStock === true
  };
});

return cleanedData.map(item => ({ json: item }));

Google Sheets Operations

Appending Data

To add new rows to the end of your sheet:

// Google Sheets Node - Append Operation
{
  "operation": "append",
  "sheetId": "1abc123...",
  "range": "Sheet1!A:E",
  "options": {
    "valueInputOption": "USER_ENTERED"
  }
}

Updating Existing Rows

To update data based on a key (like product ID):

// Google Sheets Node - Update Operation
{
  "operation": "update",
  "sheetId": "1abc123...",
  "range": "Sheet1!A:E",
  "options": {
    "valueInputOption": "USER_ENTERED",
    "lookupColumn": "A", // Product ID column
    "lookupValue": "={{ $json.productId }}"
  }
}

Creating New Sheets

To organize data by date or category:

// Google Sheets Node - Create Sheet
{
  "operation": "create",
  "title": "Products_{{ $now.format('YYYY-MM-DD') }}"
}

Error Handling and Monitoring

Implement error handling to ensure data reliability:

// Code Node with Try-Catch
try {
  const html = $input.first().json.body;

  if (!html || html.length < 100) {
    throw new Error('Invalid HTML response');
  }

  const $ = cheerio.load(html);
  const products = [];

  $('.product-item').each((i, element) => {
    try {
      const product = {
        name: $(element).find('.product-name').text().trim(),
        price: $(element).find('.product-price').text().trim()
      };

      // Validate required fields
      if (product.name && product.price) {
        products.push(product);
      }
    } catch (err) {
      console.error(`Error parsing product ${i}:`, err.message);
    }
  });

  if (products.length === 0) {
    throw new Error('No products found');
  }

  return products.map(p => ({ json: p }));

} catch (error) {
  // Return error info for monitoring
  return [{
    json: {
      error: error.message,
      timestamp: new Date().toISOString(),
      url: $input.first().json.url
    }
  }];
}

Scheduling Automated Scraping

Set up a Cron node to run your workflow automatically:

// Cron Node Configuration
{
  "mode": "everyHour",
  // Or use custom cron expression:
  // "cronExpression": "0 */6 * * *"  // Every 6 hours
}

Common schedules: - Every hour: 0 * * * * - Every day at 9 AM: 0 9 * * * - Every Monday at 8 AM: 0 8 * * 1 - Every 15 minutes: */15 * * * *

Best Practices

1. Rate Limiting

Respect website resources by adding delays:

// Add delay between requests
const delay = ms => new Promise(resolve => setTimeout(resolve, ms));
await delay(2000); // 2 second delay

2. Use Webhooks for Real-Time Updates

Instead of scheduled scraping, use webhooks when available:

// Webhook Trigger Node
// Listens for external events
// URL: https://your-n8n.com/webhook/product-updates

3. Data Deduplication

Prevent duplicate entries in your spreadsheet:

// Code Node - Check for Duplicates
const existingData = $('Google Sheets').all();
const newData = $('Scraper').all();

const duplicates = new Set(existingData.map(item => item.json.id));

const uniqueData = newData.filter(item => !duplicates.has(item.json.id));

return uniqueData;

4. Structured Error Logging

Create a separate error log sheet:

// If Node - On Error
{
  "operation": "append",
  "sheetName": "Error_Log",
  "data": {
    "timestamp": "={{ $now }}",
    "error": "={{ $json.error }}",
    "workflow": "={{ $workflow.name }}",
    "node": "={{ $node.name }}"
  }
}

Using WebScraping.AI API with n8n

For more reliable scraping that handles AJAX requests and browser authentication, you can use WebScraping.AI API:

// HTTP Request Node - WebScraping.AI
{
  "method": "GET",
  "url": "https://api.webscraping.ai/html",
  "qs": {
    "api_key": "YOUR_API_KEY",
    "url": "https://example.com/products",
    "js": true,
    "proxy": "datacenter"
  }
}

This approach handles: - JavaScript rendering - Anti-bot detection - Proxy rotation - CAPTCHA solving - Automatic retries

Complete Workflow Example

Here's a JSON export of a complete n8n workflow:

{
  "nodes": [
    {
      "name": "Schedule Trigger",
      "type": "n8n-nodes-base.cron",
      "parameters": {
        "triggerTimes": {
          "item": [
            {
              "mode": "everyHour"
            }
          ]
        }
      }
    },
    {
      "name": "Scrape Website",
      "type": "n8n-nodes-base.httpRequest",
      "parameters": {
        "url": "https://example.com/products",
        "options": {}
      }
    },
    {
      "name": "Parse HTML",
      "type": "n8n-nodes-base.code",
      "parameters": {
        "jsCode": "// Cheerio parsing code here"
      }
    },
    {
      "name": "Save to Google Sheets",
      "type": "n8n-nodes-base.googleSheets",
      "parameters": {
        "operation": "append",
        "sheetId": "YOUR_SHEET_ID",
        "range": "Sheet1"
      }
    }
  ],
  "connections": {
    "Schedule Trigger": {
      "main": [[{"node": "Scrape Website"}]]
    },
    "Scrape Website": {
      "main": [[{"node": "Parse HTML"}]]
    },
    "Parse HTML": {
      "main": [[{"node": "Save to Google Sheets"}]]
    }
  }
}

Troubleshooting Common Issues

Issue: Authentication Errors

Solution: Re-authenticate your Google Sheets credentials and ensure the OAuth token hasn't expired.

Issue: Rate Limiting

Solution: Add delays between requests and consider using a proxy service or WebScraping.AI API.

Issue: Empty Data

Solution: Verify your CSS selectors or XPath expressions. Use browser DevTools to inspect the page structure.

Issue: Duplicate Data

Solution: Implement the deduplication logic shown above or use the "Update" operation with a unique identifier column.

Conclusion

Saving scraped data to Google Sheets with n8n provides a powerful automation solution for data collection and analysis. By following this guide, you can build robust workflows that extract data from websites and automatically populate your spreadsheets. Remember to implement error handling, respect rate limits, and monitor your workflows for optimal performance.

For production environments, consider using dedicated scraping APIs like WebScraping.AI to handle complex scenarios and reduce maintenance overhead.

Table of contents