How do I save scraped data to Google Sheets with n8n?
Saving scraped data directly to Google Sheets is one of the most common use cases for n8n automation workflows. This integration allows you to collect data from websites and automatically populate spreadsheets for analysis, reporting, or sharing with your team. In this guide, we'll walk through the complete process of setting up a web scraping workflow that saves data to Google Sheets.
Prerequisites
Before you begin, make sure you have:
- An n8n instance running (cloud or self-hosted)
- A Google account with access to Google Sheets
- Basic understanding of n8n workflows
- The target website URL you want to scrape
Setting Up Google Sheets Authentication
First, you need to connect your Google account to n8n:
- In your n8n workflow, add a Google Sheets node
- Click on Credentials and select Create New
- Choose the authentication method:
- OAuth2 (recommended for most users)
- Service Account (for automated/production environments)
For OAuth2 authentication:
# You'll be redirected to Google to grant permissions
# Allow n8n to access your Google Sheets
Once authenticated, you can reuse these credentials across multiple workflows.
Basic Web Scraping to Google Sheets Workflow
Here's a simple workflow structure:
- Trigger Node - Start the workflow (manual, schedule, webhook)
- HTTP Request or Code Node - Scrape the website
- Data Processing Node - Clean and format the data
- Google Sheets Node - Save to spreadsheet
Example: Scraping Product Data
Let's create a workflow that scrapes product information and saves it to Google Sheets.
Step 1: HTTP Request Node
Configure the HTTP Request node to fetch the webpage:
// HTTP Request Node Configuration
{
"method": "GET",
"url": "https://example.com/products",
"options": {
"headers": {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
}
}
}
Step 2: Code Node for Data Extraction
Use the Code node to parse HTML and extract data:
// Extract data using Cheerio (built into n8n)
const cheerio = require('cheerio');
// Get the HTML from the previous node
const html = $input.first().json.body;
const $ = cheerio.load(html);
// Extract product data
const products = [];
$('.product-item').each((i, element) => {
const product = {
name: $(element).find('.product-name').text().trim(),
price: $(element).find('.product-price').text().trim(),
url: $(element).find('a').attr('href'),
inStock: $(element).find('.stock-status').text().includes('In Stock'),
timestamp: new Date().toISOString()
};
products.push(product);
});
// Return the data in n8n format
return products.map(product => ({ json: product }));
Step 3: Google Sheets Node Configuration
Configure the Google Sheets node to append data:
Node Settings: - Operation: Append or Update - Document: Select your spreadsheet - Sheet: Choose the worksheet (e.g., "Sheet1") - Columns: Map your data fields
Field Mapping:
{
"A": "={{ $json.name }}",
"B": "={{ $json.price }}",
"C": "={{ $json.url }}",
"D": "={{ $json.inStock }}",
"E": "={{ $json.timestamp }}"
}
Advanced Workflow with Puppeteer
For JavaScript-heavy websites that require browser automation, you can use the n8n Puppeteer integration to handle dynamic content:
// Code Node with Puppeteer
const puppeteer = require('puppeteer');
async function scrapeWithPuppeteer() {
const browser = await puppeteer.launch({
headless: true,
args: ['--no-sandbox', '--disable-setuid-sandbox']
});
const page = await browser.newPage();
// Navigate to the page
await page.goto('https://example.com/products', {
waitUntil: 'networkidle2'
});
// Wait for content to load
await page.waitForSelector('.product-item');
// Extract data
const products = await page.evaluate(() => {
const items = document.querySelectorAll('.product-item');
return Array.from(items).map(item => ({
name: item.querySelector('.product-name')?.textContent.trim(),
price: item.querySelector('.product-price')?.textContent.trim(),
image: item.querySelector('img')?.src,
rating: item.querySelector('.rating')?.textContent.trim()
}));
});
await browser.close();
return products;
}
// Execute and return results
const data = await scrapeWithPuppeteer();
return data.map(item => ({ json: item }));
Handling Pagination
When scraping multiple pages, you'll need to loop through results:
// Loop Node Configuration for Pagination
const maxPages = 5;
const results = [];
for (let page = 1; page <= maxPages; page++) {
// Construct URL with page parameter
const url = `https://example.com/products?page=${page}`;
// Scrape each page
const response = await $http.get(url);
const $ = cheerio.load(response.data);
$('.product-item').each((i, element) => {
results.push({
page: page,
name: $(element).find('.product-name').text(),
price: $(element).find('.product-price').text()
});
});
// Respect rate limits
await new Promise(resolve => setTimeout(resolve, 1000));
}
return results.map(item => ({ json: item }));
Data Formatting and Validation
Before saving to Google Sheets, clean and validate your data:
// Code Node for Data Cleaning
const items = $input.all();
const cleanedData = items.map(item => {
const data = item.json;
return {
// Remove currency symbols and convert to number
price: parseFloat(data.price.replace(/[$,]/g, '')),
// Normalize text
name: data.name.trim().replace(/\s+/g, ' '),
// Format dates
scrapedDate: new Date().toLocaleDateString('en-US'),
// Clean URLs
url: data.url.startsWith('http') ? data.url : `https://example.com${data.url}`,
// Boolean conversion
inStock: data.inStock === 'true' || data.inStock === true
};
});
return cleanedData.map(item => ({ json: item }));
Google Sheets Operations
Appending Data
To add new rows to the end of your sheet:
// Google Sheets Node - Append Operation
{
"operation": "append",
"sheetId": "1abc123...",
"range": "Sheet1!A:E",
"options": {
"valueInputOption": "USER_ENTERED"
}
}
Updating Existing Rows
To update data based on a key (like product ID):
// Google Sheets Node - Update Operation
{
"operation": "update",
"sheetId": "1abc123...",
"range": "Sheet1!A:E",
"options": {
"valueInputOption": "USER_ENTERED",
"lookupColumn": "A", // Product ID column
"lookupValue": "={{ $json.productId }}"
}
}
Creating New Sheets
To organize data by date or category:
// Google Sheets Node - Create Sheet
{
"operation": "create",
"title": "Products_{{ $now.format('YYYY-MM-DD') }}"
}
Error Handling and Monitoring
Implement error handling to ensure data reliability:
// Code Node with Try-Catch
try {
const html = $input.first().json.body;
if (!html || html.length < 100) {
throw new Error('Invalid HTML response');
}
const $ = cheerio.load(html);
const products = [];
$('.product-item').each((i, element) => {
try {
const product = {
name: $(element).find('.product-name').text().trim(),
price: $(element).find('.product-price').text().trim()
};
// Validate required fields
if (product.name && product.price) {
products.push(product);
}
} catch (err) {
console.error(`Error parsing product ${i}:`, err.message);
}
});
if (products.length === 0) {
throw new Error('No products found');
}
return products.map(p => ({ json: p }));
} catch (error) {
// Return error info for monitoring
return [{
json: {
error: error.message,
timestamp: new Date().toISOString(),
url: $input.first().json.url
}
}];
}
Scheduling Automated Scraping
Set up a Cron node to run your workflow automatically:
// Cron Node Configuration
{
"mode": "everyHour",
// Or use custom cron expression:
// "cronExpression": "0 */6 * * *" // Every 6 hours
}
Common schedules:
- Every hour: 0 * * * *
- Every day at 9 AM: 0 9 * * *
- Every Monday at 8 AM: 0 8 * * 1
- Every 15 minutes: */15 * * * *
Best Practices
1. Rate Limiting
Respect website resources by adding delays:
// Add delay between requests
const delay = ms => new Promise(resolve => setTimeout(resolve, ms));
await delay(2000); // 2 second delay
2. Use Webhooks for Real-Time Updates
Instead of scheduled scraping, use webhooks when available:
// Webhook Trigger Node
// Listens for external events
// URL: https://your-n8n.com/webhook/product-updates
3. Data Deduplication
Prevent duplicate entries in your spreadsheet:
// Code Node - Check for Duplicates
const existingData = $('Google Sheets').all();
const newData = $('Scraper').all();
const duplicates = new Set(existingData.map(item => item.json.id));
const uniqueData = newData.filter(item => !duplicates.has(item.json.id));
return uniqueData;
4. Structured Error Logging
Create a separate error log sheet:
// If Node - On Error
{
"operation": "append",
"sheetName": "Error_Log",
"data": {
"timestamp": "={{ $now }}",
"error": "={{ $json.error }}",
"workflow": "={{ $workflow.name }}",
"node": "={{ $node.name }}"
}
}
Using WebScraping.AI API with n8n
For more reliable scraping that handles AJAX requests and browser authentication, you can use WebScraping.AI API:
// HTTP Request Node - WebScraping.AI
{
"method": "GET",
"url": "https://api.webscraping.ai/html",
"qs": {
"api_key": "YOUR_API_KEY",
"url": "https://example.com/products",
"js": true,
"proxy": "datacenter"
}
}
This approach handles: - JavaScript rendering - Anti-bot detection - Proxy rotation - CAPTCHA solving - Automatic retries
Complete Workflow Example
Here's a JSON export of a complete n8n workflow:
{
"nodes": [
{
"name": "Schedule Trigger",
"type": "n8n-nodes-base.cron",
"parameters": {
"triggerTimes": {
"item": [
{
"mode": "everyHour"
}
]
}
}
},
{
"name": "Scrape Website",
"type": "n8n-nodes-base.httpRequest",
"parameters": {
"url": "https://example.com/products",
"options": {}
}
},
{
"name": "Parse HTML",
"type": "n8n-nodes-base.code",
"parameters": {
"jsCode": "// Cheerio parsing code here"
}
},
{
"name": "Save to Google Sheets",
"type": "n8n-nodes-base.googleSheets",
"parameters": {
"operation": "append",
"sheetId": "YOUR_SHEET_ID",
"range": "Sheet1"
}
}
],
"connections": {
"Schedule Trigger": {
"main": [[{"node": "Scrape Website"}]]
},
"Scrape Website": {
"main": [[{"node": "Parse HTML"}]]
},
"Parse HTML": {
"main": [[{"node": "Save to Google Sheets"}]]
}
}
}
Troubleshooting Common Issues
Issue: Authentication Errors
Solution: Re-authenticate your Google Sheets credentials and ensure the OAuth token hasn't expired.
Issue: Rate Limiting
Solution: Add delays between requests and consider using a proxy service or WebScraping.AI API.
Issue: Empty Data
Solution: Verify your CSS selectors or XPath expressions. Use browser DevTools to inspect the page structure.
Issue: Duplicate Data
Solution: Implement the deduplication logic shown above or use the "Update" operation with a unique identifier column.
Conclusion
Saving scraped data to Google Sheets with n8n provides a powerful automation solution for data collection and analysis. By following this guide, you can build robust workflows that extract data from websites and automatically populate your spreadsheets. Remember to implement error handling, respect rate limits, and monitor your workflows for optimal performance.
For production environments, consider using dedicated scraping APIs like WebScraping.AI to handle complex scenarios and reduce maintenance overhead.