How do I integrate n8n with other scraping APIs?
Integrating n8n with external scraping APIs allows you to leverage powerful third-party tools while maintaining the flexibility of workflow automation. This guide covers various approaches to connect n8n with popular scraping APIs, including authentication, error handling, and data transformation.
Understanding n8n API Integration
n8n provides multiple methods for integrating with external scraping APIs:
- HTTP Request Node - For direct API calls
- Webhook Node - For receiving scraping results
- Custom Nodes - For frequently used APIs
- Function Nodes - For complex data transformations
Using the HTTP Request Node
The HTTP Request node is the primary method for integrating with scraping APIs. Here's how to configure it for common scenarios:
Basic API Integration
// Example: Making a GET request to a scraping API
{
"method": "GET",
"url": "https://api.webscraping.ai/html",
"authentication": "headerAuth",
"qs": {
"url": "https://example.com",
"api_key": "{{$credentials.apiKey}}"
}
}
To set up the HTTP Request node:
- Add an HTTP Request node to your workflow
- Select the request method (GET, POST, PUT, etc.)
- Enter the API endpoint URL
- Configure authentication (API key, OAuth, Basic Auth)
- Add query parameters or request body as needed
WebScraping.AI Integration
WebScraping.AI is a developer-friendly scraping API that works seamlessly with n8n:
// HTTP Request Node Configuration
{
"method": "GET",
"url": "https://api.webscraping.ai/html",
"qs": {
"url": "{{$node["Webhook"].json["targetUrl"]}}",
"api_key": "{{$credentials.webScrapingAI}}",
"js": true,
"proxy": "datacenter"
}
}
Python equivalent for understanding the API structure:
import requests
url = "https://api.webscraping.ai/html"
params = {
"url": "https://example.com",
"api_key": "YOUR_API_KEY",
"js": "true",
"proxy": "datacenter"
}
response = requests.get(url, params=params)
html_content = response.text
Authentication Methods
API Key Authentication
Most scraping APIs use API key authentication. Configure it in n8n:
- Go to Credentials → New Credentials
- Select Header Auth or API Key
- Add your API key details
- Reference in HTTP Request node:
{{$credentials.apiName}}
// Header Auth Configuration
{
"name": "X-API-Key",
"value": "your_api_key_here"
}
OAuth 2.0 Authentication
For APIs requiring OAuth:
// OAuth2 Configuration in n8n
{
"authUrl": "https://api.example.com/oauth/authorize",
"accessTokenUrl": "https://api.example.com/oauth/token",
"clientId": "{{$credentials.clientId}}",
"clientSecret": "{{$credentials.clientSecret}}",
"scope": "scraping:read scraping:write"
}
Handling API Responses
Parsing JSON Responses
Use the Function node to transform API responses:
// Function Node: Parse and extract data
const apiResponse = items[0].json;
// Extract specific fields
const extractedData = {
title: apiResponse.data.title,
content: apiResponse.data.content,
timestamp: new Date().toISOString()
};
return [{ json: extractedData }];
HTML Parsing with Code Node
When working with HTML responses, you can parse data using the Code node:
// Code Node: Parse HTML response
const cheerio = require('cheerio');
for (const item of $input.all()) {
const html = item.json.html;
const $ = cheerio.load(html);
const products = [];
$('.product').each((i, elem) => {
products.push({
name: $(elem).find('.product-name').text(),
price: $(elem).find('.product-price').text(),
url: $(elem).find('a').attr('href')
});
});
item.json.parsedData = products;
}
return $input.all();
JavaScript equivalent for external testing:
const axios = require('axios');
const cheerio = require('cheerio');
async function scrapeWithAPI() {
const response = await axios.get('https://api.webscraping.ai/html', {
params: {
url: 'https://example.com/products',
api_key: 'YOUR_API_KEY'
}
});
const $ = cheerio.load(response.data);
const products = [];
$('.product').each((i, elem) => {
products.push({
name: $(elem).find('.product-name').text(),
price: $(elem).find('.product-price').text()
});
});
return products;
}
Advanced Integration Patterns
Batch Processing with Loop
Process multiple URLs using the Loop node:
// Split in Batches Node Configuration
{
"batchSize": 10,
"options": {}
}
// HTTP Request Node (inside loop)
{
"method": "GET",
"url": "https://api.webscraping.ai/html",
"qs": {
"url": "{{$json["url"]}}",
"api_key": "{{$credentials.apiKey}}"
}
}
Error Handling and Retry Logic
Implement robust error handling using the Error Trigger and IF nodes:
// IF Node: Check for API errors
{
"conditions": {
"string": [
{
"value1": "={{$json["status"]}}",
"operation": "notEqual",
"value2": "success"
}
]
}
}
// Wait Node: Delay before retry
{
"unit": "seconds",
"amount": 5
}
Rate Limiting
Prevent API rate limit issues with throttling:
// Function Node: Add delay between requests
const delay = ms => new Promise(resolve => setTimeout(resolve, ms));
for (let i = 0; i < items.length; i++) {
if (i > 0) {
await delay(1000); // 1 second delay between requests
}
items[i].json.processed = true;
}
return items;
Integration with Puppeteer-Based APIs
Many scraping APIs offer browser automation capabilities similar to Puppeteer for handling JavaScript-heavy sites:
// HTTP Request for browser-based scraping
{
"method": "POST",
"url": "https://api.webscraping.ai/html",
"body": {
"url": "{{$json["targetUrl"]}}",
"js": true,
"js_timeout": 5000,
"proxy": "residential"
},
"headers": {
"Content-Type": "application/json"
}
}
Webhook Integration for Async Scraping
For long-running scraping tasks, use webhooks to receive results:
Step 1: Set Up Webhook Node
// Webhook Node Configuration
{
"path": "scraping-callback",
"method": "POST",
"responseMode": "onReceived"
}
Step 2: Send Webhook URL to API
// HTTP Request: Initiate scraping with callback
{
"method": "POST",
"url": "https://api.scraper.com/scrape",
"body": {
"url": "https://example.com",
"callback_url": "{{$node["Webhook"].json["webhookUrl"]}}"
}
}
Step 3: Process Webhook Data
// Function Node: Process callback data
const webhookData = items[0].json;
return [{
json: {
jobId: webhookData.job_id,
status: webhookData.status,
results: webhookData.data,
completedAt: new Date().toISOString()
}
}];
Proxy Configuration
Many scraping APIs support proxy configuration for avoiding blocks:
// HTTP Request with proxy parameters
{
"method": "GET",
"url": "https://api.webscraping.ai/html",
"qs": {
"url": "{{$json["url"]}}",
"api_key": "{{$credentials.apiKey}}",
"proxy": "residential",
"country": "us",
"device": "desktop"
}
}
Data Storage and Export
Save to Database
// PostgreSQL Node Configuration
{
"operation": "insert",
"table": "scraped_data",
"columns": "url,title,content,scraped_at",
"returnFields": "*"
}
Export to Google Sheets
// Google Sheets Node
{
"operation": "append",
"sheetId": "{{$json["sheetId"]}}",
"range": "Sheet1!A:D",
"options": {
"valueInputMode": "USER_ENTERED"
}
}
Testing and Debugging
Console Logging in Function Nodes
// Function Node: Debug API responses
console.log('API Response:', JSON.stringify(items[0].json, null, 2));
console.log('Status Code:', items[0].json.statusCode);
console.log('Headers:', items[0].json.headers);
return items;
Manual Execution Testing
Use the Execute Node feature to test individual API calls before running the full workflow. Check the execution log for:
- Request headers and body
- Response status codes
- Response data structure
- Execution time
Best Practices
- Credential Management: Store API keys in n8n credentials, never hardcode them
- Error Handling: Always implement try-catch logic and error branches
- Rate Limiting: Respect API rate limits using Wait nodes
- Data Validation: Validate API responses before processing
- Logging: Log important events for debugging and monitoring
- Caching: Cache results when possible to reduce API calls
- Monitoring: Set up notifications for workflow failures
Common Integration Examples
ScraperAPI Integration
{
"method": "GET",
"url": "http://api.scraperapi.com",
"qs": {
"api_key": "{{$credentials.scraperAPI}}",
"url": "{{$json["targetUrl"]}}",
"render": "true"
}
}
Bright Data (Formerly Luminati)
{
"method": "POST",
"url": "https://api.brightdata.com/request",
"authentication": "basicAuth",
"body": {
"zone": "scraping_browser",
"url": "{{$json["url"]}}",
"format": "raw"
}
}
Apify Integration
{
"method": "POST",
"url": "https://api.apify.com/v2/acts/[ACTOR_ID]/runs",
"qs": {
"token": "{{$credentials.apifyToken}}"
},
"body": {
"startUrls": [{"url": "{{$json["url"]}}"}]
}
}
Conclusion
Integrating n8n with scraping APIs provides a powerful combination of automation and data extraction capabilities. By using the HTTP Request node, proper authentication, error handling, and understanding how to monitor network requests, you can build robust scraping workflows that scale with your needs.
Whether you're processing single pages or running large-scale data extraction operations with parallel execution patterns, n8n's flexibility makes it an excellent choice for automating web scraping tasks through external APIs.