Table of contents

What is CORS and how does it affect API-based web scraping?

What is CORS?

CORS (Cross-Origin Resource Sharing) is a browser security mechanism that controls how web pages can access resources from different domains. It works alongside the same-origin policy, which prevents scripts from one domain from accessing resources on another domain without explicit permission.

Understanding Origins

An origin consists of three components: - Protocol (http/https) - Domain (example.com) - Port (80, 443, 3000, etc.)

These URLs represent different origins: - https://api.example.com vs https://example.com (different subdomain) - https://example.com vs http://example.com (different protocol) - https://example.com:3000 vs https://example.com (different port)

How CORS Headers Work

When a browser makes a cross-origin request, the server must include specific headers to allow access:

Access-Control-Allow-Origin: https://mywebsite.com
Access-Control-Allow-Methods: GET, POST, PUT, DELETE
Access-Control-Allow-Headers: Content-Type, Authorization

CORS Impact on Web Scraping

When CORS Applies

CORS only affects browser-based requests. It does NOT apply to: - ✅ Server-side scripts (Python, Node.js, Go, etc.) - ✅ Desktop applications - ✅ Mobile apps - ✅ Command-line tools (curl, wget)

When CORS Blocks Requests

CORS blocks these browser-based scenarios: - ❌ JavaScript fetch/XMLHttpRequest from web pages - ❌ AJAX calls to external APIs - ❌ Client-side web scraping attempts

Common CORS Error Example

// This will trigger CORS error in browser
fetch('https://api.github.com/users/octocat')
  .then(response => response.json())
  .then(data => console.log(data))
  .catch(error => {
    // Error: Access to fetch at 'https://api.github.com/users/octocat' 
    // from origin 'https://mysite.com' has been blocked by CORS policy
    console.error('CORS Error:', error);
  });

Proven Solutions for CORS in Web Scraping

1. Server-Side Scraping (Recommended)

Python Example:

import requests
import json

# No CORS restrictions on server-side
response = requests.get('https://api.github.com/users/octocat')
data = response.json()
print(f"User: {data['login']}, Followers: {data['followers']}")

Node.js Example:

const axios = require('axios');

async function scrapeAPI() {
  try {
    const response = await axios.get('https://api.github.com/users/octocat');
    console.log(`User: ${response.data.login}, Followers: ${response.data.followers}`);
  } catch (error) {
    console.error('Error:', error.message);
  }
}

scrapeAPI();

2. CORS Proxy Services

Public Proxy (Not for Production):

const proxyUrl = 'https://cors-anywhere.herokuapp.com/';
const targetUrl = 'https://api.example.com/data';

fetch(proxyUrl + targetUrl)
  .then(response => response.json())
  .then(data => console.log(data))
  .catch(error => console.error('Error:', error));

Self-Hosted Proxy (Node.js/Express):

const express = require('express');
const cors = require('cors');
const { createProxyMiddleware } = require('http-proxy-middleware');

const app = express();

// Enable CORS for all routes
app.use(cors());

// Proxy API requests
app.use('/proxy', createProxyMiddleware({
  target: 'https://api.example.com',
  changeOrigin: true,
  pathRewrite: { '^/proxy': '' }
}));

app.listen(3001, () => {
  console.log('CORS proxy server running on port 3001');
});

3. Browser Extension Approach

Browser extensions have elevated privileges and can bypass CORS:

Manifest.json:

{
  "manifest_version": 3,
  "name": "API Scraper Extension",
  "version": "1.0",
  "permissions": [
    "https://api.example.com/*"
  ],
  "background": {
    "service_worker": "background.js"
  }
}

Background.js:

// This works in browser extensions without CORS issues
chrome.runtime.onMessage.addListener((request, sender, sendResponse) => {
  if (request.action === 'fetchData') {
    fetch('https://api.example.com/data')
      .then(response => response.json())
      .then(data => sendResponse({ success: true, data }))
      .catch(error => sendResponse({ success: false, error: error.message }));

    return true; // Keeps message channel open for async response
  }
});

4. Web Scraping APIs

Use services that handle CORS and anti-bot measures:

// Using WebScraping.AI API
const apiKey = 'your-api-key';
const targetUrl = 'https://api.example.com/data';

fetch(`https://api.webscraping.ai/html?api_key=${apiKey}&url=${encodeURIComponent(targetUrl)}`)
  .then(response => response.text())
  .then(html => {
    // Process the scraped HTML
    console.log(html);
  })
  .catch(error => console.error('Error:', error));

5. Development-Only Browser Flag

For testing only - disable web security in Chrome:

# macOS/Linux
google-chrome --disable-web-security --user-data-dir="/tmp/chrome_dev_test"

# Windows
chrome.exe --disable-web-security --user-data-dir="c:\temp\chrome_dev_test"

⚠️ Warning: Never use this for production or regular browsing as it disables important security features.

Best Practices

  1. Use Server-Side Scraping - Most reliable and performant approach
  2. Respect Rate Limits - Implement delays between requests
  3. Handle Errors Gracefully - APIs can be unreliable
  4. Cache Responses - Reduce API calls and improve performance
  5. Follow Terms of Service - Always comply with API usage policies
import time
import requests
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry

def create_session_with_retries():
    session = requests.Session()
    retry_strategy = Retry(
        total=3,
        backoff_factor=1,
        status_forcelist=[429, 500, 502, 503, 504]
    )
    adapter = HTTPAdapter(max_retries=retry_strategy)
    session.mount("http://", adapter)
    session.mount("https://", adapter)
    return session

# Robust API scraping with retries and rate limiting
session = create_session_with_retries()
urls = ['https://api.example.com/data/1', 'https://api.example.com/data/2']

for url in urls:
    try:
        response = session.get(url, timeout=10)
        data = response.json()
        print(f"Scraped: {data}")
        time.sleep(1)  # Rate limiting
    except requests.exceptions.RequestException as e:
        print(f"Error scraping {url}: {e}")

Summary

CORS only affects browser-based requests, not server-side scraping. For web scraping: - Best approach: Use server-side scripts (Python, Node.js, etc.) - Alternative: CORS proxies or web scraping APIs - Browser-only: Consider browser extensions with proper permissions

Always scrape responsibly and comply with website terms of service and applicable laws.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon