How do I integrate n8n with other scraping APIs?

Integrating n8n with external scraping APIs allows you to leverage powerful third-party tools while maintaining the flexibility of workflow automation. This guide covers various approaches to connect n8n with popular scraping APIs, including authentication, error handling, and data transformation.

Understanding n8n API Integration

n8n provides multiple methods for integrating with external scraping APIs:

HTTP Request Node - For direct API calls
Webhook Node - For receiving scraping results
Custom Nodes - For frequently used APIs
Function Nodes - For complex data transformations

Using the HTTP Request Node

The HTTP Request node is the primary method for integrating with scraping APIs. Here's how to configure it for common scenarios:

Basic API Integration

// Example: Making a GET request to a scraping API
{
  "method": "GET",
  "url": "https://api.webscraping.ai/html",
  "authentication": "headerAuth",
  "qs": {
    "url": "https://example.com",
    "api_key": "{{$credentials.apiKey}}"
  }
}

To set up the HTTP Request node:

Add an HTTP Request node to your workflow
Select the request method (GET, POST, PUT, etc.)
Enter the API endpoint URL
Configure authentication (API key, OAuth, Basic Auth)
Add query parameters or request body as needed

WebScraping.AI Integration

WebScraping.AI is a developer-friendly scraping API that works seamlessly with n8n:

// HTTP Request Node Configuration
{
  "method": "GET",
  "url": "https://api.webscraping.ai/html",
  "qs": {
    "url": "{{$node["Webhook"].json["targetUrl"]}}",
    "api_key": "{{$credentials.webScrapingAI}}",
    "js": true,
    "proxy": "datacenter"
  }
}

Python equivalent for understanding the API structure:

import requests

url = "https://api.webscraping.ai/html"
params = {
    "url": "https://example.com",
    "api_key": "YOUR_API_KEY",
    "js": "true",
    "proxy": "datacenter"
}

response = requests.get(url, params=params)
html_content = response.text

Authentication Methods

API Key Authentication

Most scraping APIs use API key authentication. Configure it in n8n:

Go to Credentials → New Credentials
Select Header Auth or API Key
Add your API key details
Reference in HTTP Request node: {{$credentials.apiName}}

// Header Auth Configuration
{
  "name": "X-API-Key",
  "value": "your_api_key_here"
}

OAuth 2.0 Authentication

For APIs requiring OAuth:

// OAuth2 Configuration in n8n
{
  "authUrl": "https://api.example.com/oauth/authorize",
  "accessTokenUrl": "https://api.example.com/oauth/token",
  "clientId": "{{$credentials.clientId}}",
  "clientSecret": "{{$credentials.clientSecret}}",
  "scope": "scraping:read scraping:write"
}

Handling API Responses

Parsing JSON Responses

Use the Function node to transform API responses:

// Function Node: Parse and extract data
const apiResponse = items[0].json;

// Extract specific fields
const extractedData = {
  title: apiResponse.data.title,
  content: apiResponse.data.content,
  timestamp: new Date().toISOString()
};

return [{ json: extractedData }];

HTML Parsing with Code Node

When working with HTML responses, you can parse data using the Code node:

// Code Node: Parse HTML response
const cheerio = require('cheerio');

for (const item of $input.all()) {
  const html = item.json.html;
  const $ = cheerio.load(html);

  const products = [];
  $('.product').each((i, elem) => {
    products.push({
      name: $(elem).find('.product-name').text(),
      price: $(elem).find('.product-price').text(),
      url: $(elem).find('a').attr('href')
    });
  });

  item.json.parsedData = products;
}

return $input.all();

JavaScript equivalent for external testing:

const axios = require('axios');
const cheerio = require('cheerio');

async function scrapeWithAPI() {
  const response = await axios.get('https://api.webscraping.ai/html', {
    params: {
      url: 'https://example.com/products',
      api_key: 'YOUR_API_KEY'
    }
  });

  const $ = cheerio.load(response.data);
  const products = [];

  $('.product').each((i, elem) => {
    products.push({
      name: $(elem).find('.product-name').text(),
      price: $(elem).find('.product-price').text()
    });
  });

  return products;
}

Advanced Integration Patterns

Batch Processing with Loop

Process multiple URLs using the Loop node:

// Split in Batches Node Configuration
{
  "batchSize": 10,
  "options": {}
}

// HTTP Request Node (inside loop)
{
  "method": "GET",
  "url": "https://api.webscraping.ai/html",
  "qs": {
    "url": "{{$json["url"]}}",
    "api_key": "{{$credentials.apiKey}}"
  }
}

Error Handling and Retry Logic

Implement robust error handling using the Error Trigger and IF nodes:

// IF Node: Check for API errors
{
  "conditions": {
    "string": [
      {
        "value1": "={{$json["status"]}}",
        "operation": "notEqual",
        "value2": "success"
      }
    ]
  }
}

// Wait Node: Delay before retry
{
  "unit": "seconds",
  "amount": 5
}

Rate Limiting

Prevent API rate limit issues with throttling:

// Function Node: Add delay between requests
const delay = ms => new Promise(resolve => setTimeout(resolve, ms));

for (let i = 0; i < items.length; i++) {
  if (i > 0) {
    await delay(1000); // 1 second delay between requests
  }
  items[i].json.processed = true;
}

return items;

Integration with Puppeteer-Based APIs

Many scraping APIs offer browser automation capabilities similar to Puppeteer for handling JavaScript-heavy sites:

// HTTP Request for browser-based scraping
{
  "method": "POST",
  "url": "https://api.webscraping.ai/html",
  "body": {
    "url": "{{$json["targetUrl"]}}",
    "js": true,
    "js_timeout": 5000,
    "proxy": "residential"
  },
  "headers": {
    "Content-Type": "application/json"
  }
}

Webhook Integration for Async Scraping

For long-running scraping tasks, use webhooks to receive results:

Step 1: Set Up Webhook Node

// Webhook Node Configuration
{
  "path": "scraping-callback",
  "method": "POST",
  "responseMode": "onReceived"
}

Step 2: Send Webhook URL to API

// HTTP Request: Initiate scraping with callback
{
  "method": "POST",
  "url": "https://api.scraper.com/scrape",
  "body": {
    "url": "https://example.com",
    "callback_url": "{{$node["Webhook"].json["webhookUrl"]}}"
  }
}

Step 3: Process Webhook Data

// Function Node: Process callback data
const webhookData = items[0].json;

return [{
  json: {
    jobId: webhookData.job_id,
    status: webhookData.status,
    results: webhookData.data,
    completedAt: new Date().toISOString()
  }
}];

Proxy Configuration

Many scraping APIs support proxy configuration for avoiding blocks:

// HTTP Request with proxy parameters
{
  "method": "GET",
  "url": "https://api.webscraping.ai/html",
  "qs": {
    "url": "{{$json["url"]}}",
    "api_key": "{{$credentials.apiKey}}",
    "proxy": "residential",
    "country": "us",
    "device": "desktop"
  }
}

Data Storage and Export

Save to Database

// PostgreSQL Node Configuration
{
  "operation": "insert",
  "table": "scraped_data",
  "columns": "url,title,content,scraped_at",
  "returnFields": "*"
}

Export to Google Sheets

// Google Sheets Node
{
  "operation": "append",
  "sheetId": "{{$json["sheetId"]}}",
  "range": "Sheet1!A:D",
  "options": {
    "valueInputMode": "USER_ENTERED"
  }
}

Testing and Debugging

Console Logging in Function Nodes

// Function Node: Debug API responses
console.log('API Response:', JSON.stringify(items[0].json, null, 2));
console.log('Status Code:', items[0].json.statusCode);
console.log('Headers:', items[0].json.headers);

return items;

Manual Execution Testing

Use the Execute Node feature to test individual API calls before running the full workflow. Check the execution log for:

Request headers and body
Response status codes
Response data structure
Execution time

Best Practices

Credential Management: Store API keys in n8n credentials, never hardcode them
Error Handling: Always implement try-catch logic and error branches
Rate Limiting: Respect API rate limits using Wait nodes
Data Validation: Validate API responses before processing
Logging: Log important events for debugging and monitoring
Caching: Cache results when possible to reduce API calls
Monitoring: Set up notifications for workflow failures

Common Integration Examples

ScraperAPI Integration

{
  "method": "GET",
  "url": "http://api.scraperapi.com",
  "qs": {
    "api_key": "{{$credentials.scraperAPI}}",
    "url": "{{$json["targetUrl"]}}",
    "render": "true"
  }
}

Bright Data (Formerly Luminati)

{
  "method": "POST",
  "url": "https://api.brightdata.com/request",
  "authentication": "basicAuth",
  "body": {
    "zone": "scraping_browser",
    "url": "{{$json["url"]}}",
    "format": "raw"
  }
}

Apify Integration

{
  "method": "POST",
  "url": "https://api.apify.com/v2/acts/[ACTOR_ID]/runs",
  "qs": {
    "token": "{{$credentials.apifyToken}}"
  },
  "body": {
    "startUrls": [{"url": "{{$json["url"]}}"}]
  }
}

Conclusion

Integrating n8n with scraping APIs provides a powerful combination of automation and data extraction capabilities. By using the HTTP Request node, proper authentication, error handling, and understanding how to monitor network requests, you can build robust scraping workflows that scale with your needs.

Whether you're processing single pages or running large-scale data extraction operations with parallel execution patterns, n8n's flexibility makes it an excellent choice for automating web scraping tasks through external APIs.

Table of contents