Table of contents

How do I integrate n8n with other scraping APIs?

Integrating n8n with external scraping APIs allows you to leverage powerful third-party tools while maintaining the flexibility of workflow automation. This guide covers various approaches to connect n8n with popular scraping APIs, including authentication, error handling, and data transformation.

Understanding n8n API Integration

n8n provides multiple methods for integrating with external scraping APIs:

  1. HTTP Request Node - For direct API calls
  2. Webhook Node - For receiving scraping results
  3. Custom Nodes - For frequently used APIs
  4. Function Nodes - For complex data transformations

Using the HTTP Request Node

The HTTP Request node is the primary method for integrating with scraping APIs. Here's how to configure it for common scenarios:

Basic API Integration

// Example: Making a GET request to a scraping API
{
  "method": "GET",
  "url": "https://api.webscraping.ai/html",
  "authentication": "headerAuth",
  "qs": {
    "url": "https://example.com",
    "api_key": "{{$credentials.apiKey}}"
  }
}

To set up the HTTP Request node:

  1. Add an HTTP Request node to your workflow
  2. Select the request method (GET, POST, PUT, etc.)
  3. Enter the API endpoint URL
  4. Configure authentication (API key, OAuth, Basic Auth)
  5. Add query parameters or request body as needed

WebScraping.AI Integration

WebScraping.AI is a developer-friendly scraping API that works seamlessly with n8n:

// HTTP Request Node Configuration
{
  "method": "GET",
  "url": "https://api.webscraping.ai/html",
  "qs": {
    "url": "{{$node["Webhook"].json["targetUrl"]}}",
    "api_key": "{{$credentials.webScrapingAI}}",
    "js": true,
    "proxy": "datacenter"
  }
}

Python equivalent for understanding the API structure:

import requests

url = "https://api.webscraping.ai/html"
params = {
    "url": "https://example.com",
    "api_key": "YOUR_API_KEY",
    "js": "true",
    "proxy": "datacenter"
}

response = requests.get(url, params=params)
html_content = response.text

Authentication Methods

API Key Authentication

Most scraping APIs use API key authentication. Configure it in n8n:

  1. Go to CredentialsNew Credentials
  2. Select Header Auth or API Key
  3. Add your API key details
  4. Reference in HTTP Request node: {{$credentials.apiName}}
// Header Auth Configuration
{
  "name": "X-API-Key",
  "value": "your_api_key_here"
}

OAuth 2.0 Authentication

For APIs requiring OAuth:

// OAuth2 Configuration in n8n
{
  "authUrl": "https://api.example.com/oauth/authorize",
  "accessTokenUrl": "https://api.example.com/oauth/token",
  "clientId": "{{$credentials.clientId}}",
  "clientSecret": "{{$credentials.clientSecret}}",
  "scope": "scraping:read scraping:write"
}

Handling API Responses

Parsing JSON Responses

Use the Function node to transform API responses:

// Function Node: Parse and extract data
const apiResponse = items[0].json;

// Extract specific fields
const extractedData = {
  title: apiResponse.data.title,
  content: apiResponse.data.content,
  timestamp: new Date().toISOString()
};

return [{ json: extractedData }];

HTML Parsing with Code Node

When working with HTML responses, you can parse data using the Code node:

// Code Node: Parse HTML response
const cheerio = require('cheerio');

for (const item of $input.all()) {
  const html = item.json.html;
  const $ = cheerio.load(html);

  const products = [];
  $('.product').each((i, elem) => {
    products.push({
      name: $(elem).find('.product-name').text(),
      price: $(elem).find('.product-price').text(),
      url: $(elem).find('a').attr('href')
    });
  });

  item.json.parsedData = products;
}

return $input.all();

JavaScript equivalent for external testing:

const axios = require('axios');
const cheerio = require('cheerio');

async function scrapeWithAPI() {
  const response = await axios.get('https://api.webscraping.ai/html', {
    params: {
      url: 'https://example.com/products',
      api_key: 'YOUR_API_KEY'
    }
  });

  const $ = cheerio.load(response.data);
  const products = [];

  $('.product').each((i, elem) => {
    products.push({
      name: $(elem).find('.product-name').text(),
      price: $(elem).find('.product-price').text()
    });
  });

  return products;
}

Advanced Integration Patterns

Batch Processing with Loop

Process multiple URLs using the Loop node:

// Split in Batches Node Configuration
{
  "batchSize": 10,
  "options": {}
}

// HTTP Request Node (inside loop)
{
  "method": "GET",
  "url": "https://api.webscraping.ai/html",
  "qs": {
    "url": "{{$json["url"]}}",
    "api_key": "{{$credentials.apiKey}}"
  }
}

Error Handling and Retry Logic

Implement robust error handling using the Error Trigger and IF nodes:

// IF Node: Check for API errors
{
  "conditions": {
    "string": [
      {
        "value1": "={{$json["status"]}}",
        "operation": "notEqual",
        "value2": "success"
      }
    ]
  }
}

// Wait Node: Delay before retry
{
  "unit": "seconds",
  "amount": 5
}

Rate Limiting

Prevent API rate limit issues with throttling:

// Function Node: Add delay between requests
const delay = ms => new Promise(resolve => setTimeout(resolve, ms));

for (let i = 0; i < items.length; i++) {
  if (i > 0) {
    await delay(1000); // 1 second delay between requests
  }
  items[i].json.processed = true;
}

return items;

Integration with Puppeteer-Based APIs

Many scraping APIs offer browser automation capabilities similar to Puppeteer for handling JavaScript-heavy sites:

// HTTP Request for browser-based scraping
{
  "method": "POST",
  "url": "https://api.webscraping.ai/html",
  "body": {
    "url": "{{$json["targetUrl"]}}",
    "js": true,
    "js_timeout": 5000,
    "proxy": "residential"
  },
  "headers": {
    "Content-Type": "application/json"
  }
}

Webhook Integration for Async Scraping

For long-running scraping tasks, use webhooks to receive results:

Step 1: Set Up Webhook Node

// Webhook Node Configuration
{
  "path": "scraping-callback",
  "method": "POST",
  "responseMode": "onReceived"
}

Step 2: Send Webhook URL to API

// HTTP Request: Initiate scraping with callback
{
  "method": "POST",
  "url": "https://api.scraper.com/scrape",
  "body": {
    "url": "https://example.com",
    "callback_url": "{{$node["Webhook"].json["webhookUrl"]}}"
  }
}

Step 3: Process Webhook Data

// Function Node: Process callback data
const webhookData = items[0].json;

return [{
  json: {
    jobId: webhookData.job_id,
    status: webhookData.status,
    results: webhookData.data,
    completedAt: new Date().toISOString()
  }
}];

Proxy Configuration

Many scraping APIs support proxy configuration for avoiding blocks:

// HTTP Request with proxy parameters
{
  "method": "GET",
  "url": "https://api.webscraping.ai/html",
  "qs": {
    "url": "{{$json["url"]}}",
    "api_key": "{{$credentials.apiKey}}",
    "proxy": "residential",
    "country": "us",
    "device": "desktop"
  }
}

Data Storage and Export

Save to Database

// PostgreSQL Node Configuration
{
  "operation": "insert",
  "table": "scraped_data",
  "columns": "url,title,content,scraped_at",
  "returnFields": "*"
}

Export to Google Sheets

// Google Sheets Node
{
  "operation": "append",
  "sheetId": "{{$json["sheetId"]}}",
  "range": "Sheet1!A:D",
  "options": {
    "valueInputMode": "USER_ENTERED"
  }
}

Testing and Debugging

Console Logging in Function Nodes

// Function Node: Debug API responses
console.log('API Response:', JSON.stringify(items[0].json, null, 2));
console.log('Status Code:', items[0].json.statusCode);
console.log('Headers:', items[0].json.headers);

return items;

Manual Execution Testing

Use the Execute Node feature to test individual API calls before running the full workflow. Check the execution log for:

  • Request headers and body
  • Response status codes
  • Response data structure
  • Execution time

Best Practices

  1. Credential Management: Store API keys in n8n credentials, never hardcode them
  2. Error Handling: Always implement try-catch logic and error branches
  3. Rate Limiting: Respect API rate limits using Wait nodes
  4. Data Validation: Validate API responses before processing
  5. Logging: Log important events for debugging and monitoring
  6. Caching: Cache results when possible to reduce API calls
  7. Monitoring: Set up notifications for workflow failures

Common Integration Examples

ScraperAPI Integration

{
  "method": "GET",
  "url": "http://api.scraperapi.com",
  "qs": {
    "api_key": "{{$credentials.scraperAPI}}",
    "url": "{{$json["targetUrl"]}}",
    "render": "true"
  }
}

Bright Data (Formerly Luminati)

{
  "method": "POST",
  "url": "https://api.brightdata.com/request",
  "authentication": "basicAuth",
  "body": {
    "zone": "scraping_browser",
    "url": "{{$json["url"]}}",
    "format": "raw"
  }
}

Apify Integration

{
  "method": "POST",
  "url": "https://api.apify.com/v2/acts/[ACTOR_ID]/runs",
  "qs": {
    "token": "{{$credentials.apifyToken}}"
  },
  "body": {
    "startUrls": [{"url": "{{$json["url"]}}"}]
  }
}

Conclusion

Integrating n8n with scraping APIs provides a powerful combination of automation and data extraction capabilities. By using the HTTP Request node, proper authentication, error handling, and understanding how to monitor network requests, you can build robust scraping workflows that scale with your needs.

Whether you're processing single pages or running large-scale data extraction operations with parallel execution patterns, n8n's flexibility makes it an excellent choice for automating web scraping tasks through external APIs.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon