What is CORS?
CORS (Cross-Origin Resource Sharing) is a browser security mechanism that controls how web pages can access resources from different domains. It works alongside the same-origin policy, which prevents scripts from one domain from accessing resources on another domain without explicit permission.
Understanding Origins
An origin consists of three components: - Protocol (http/https) - Domain (example.com) - Port (80, 443, 3000, etc.)
These URLs represent different origins:
- https://api.example.com
vs https://example.com
(different subdomain)
- https://example.com
vs http://example.com
(different protocol)
- https://example.com:3000
vs https://example.com
(different port)
How CORS Headers Work
When a browser makes a cross-origin request, the server must include specific headers to allow access:
Access-Control-Allow-Origin: https://mywebsite.com
Access-Control-Allow-Methods: GET, POST, PUT, DELETE
Access-Control-Allow-Headers: Content-Type, Authorization
CORS Impact on Web Scraping
When CORS Applies
CORS only affects browser-based requests. It does NOT apply to: - ✅ Server-side scripts (Python, Node.js, Go, etc.) - ✅ Desktop applications - ✅ Mobile apps - ✅ Command-line tools (curl, wget)
When CORS Blocks Requests
CORS blocks these browser-based scenarios: - ❌ JavaScript fetch/XMLHttpRequest from web pages - ❌ AJAX calls to external APIs - ❌ Client-side web scraping attempts
Common CORS Error Example
// This will trigger CORS error in browser
fetch('https://api.github.com/users/octocat')
.then(response => response.json())
.then(data => console.log(data))
.catch(error => {
// Error: Access to fetch at 'https://api.github.com/users/octocat'
// from origin 'https://mysite.com' has been blocked by CORS policy
console.error('CORS Error:', error);
});
Proven Solutions for CORS in Web Scraping
1. Server-Side Scraping (Recommended)
Python Example:
import requests
import json
# No CORS restrictions on server-side
response = requests.get('https://api.github.com/users/octocat')
data = response.json()
print(f"User: {data['login']}, Followers: {data['followers']}")
Node.js Example:
const axios = require('axios');
async function scrapeAPI() {
try {
const response = await axios.get('https://api.github.com/users/octocat');
console.log(`User: ${response.data.login}, Followers: ${response.data.followers}`);
} catch (error) {
console.error('Error:', error.message);
}
}
scrapeAPI();
2. CORS Proxy Services
Public Proxy (Not for Production):
const proxyUrl = 'https://cors-anywhere.herokuapp.com/';
const targetUrl = 'https://api.example.com/data';
fetch(proxyUrl + targetUrl)
.then(response => response.json())
.then(data => console.log(data))
.catch(error => console.error('Error:', error));
Self-Hosted Proxy (Node.js/Express):
const express = require('express');
const cors = require('cors');
const { createProxyMiddleware } = require('http-proxy-middleware');
const app = express();
// Enable CORS for all routes
app.use(cors());
// Proxy API requests
app.use('/proxy', createProxyMiddleware({
target: 'https://api.example.com',
changeOrigin: true,
pathRewrite: { '^/proxy': '' }
}));
app.listen(3001, () => {
console.log('CORS proxy server running on port 3001');
});
3. Browser Extension Approach
Browser extensions have elevated privileges and can bypass CORS:
Manifest.json:
{
"manifest_version": 3,
"name": "API Scraper Extension",
"version": "1.0",
"permissions": [
"https://api.example.com/*"
],
"background": {
"service_worker": "background.js"
}
}
Background.js:
// This works in browser extensions without CORS issues
chrome.runtime.onMessage.addListener((request, sender, sendResponse) => {
if (request.action === 'fetchData') {
fetch('https://api.example.com/data')
.then(response => response.json())
.then(data => sendResponse({ success: true, data }))
.catch(error => sendResponse({ success: false, error: error.message }));
return true; // Keeps message channel open for async response
}
});
4. Web Scraping APIs
Use services that handle CORS and anti-bot measures:
// Using WebScraping.AI API
const apiKey = 'your-api-key';
const targetUrl = 'https://api.example.com/data';
fetch(`https://api.webscraping.ai/html?api_key=${apiKey}&url=${encodeURIComponent(targetUrl)}`)
.then(response => response.text())
.then(html => {
// Process the scraped HTML
console.log(html);
})
.catch(error => console.error('Error:', error));
5. Development-Only Browser Flag
For testing only - disable web security in Chrome:
# macOS/Linux
google-chrome --disable-web-security --user-data-dir="/tmp/chrome_dev_test"
# Windows
chrome.exe --disable-web-security --user-data-dir="c:\temp\chrome_dev_test"
⚠️ Warning: Never use this for production or regular browsing as it disables important security features.
Best Practices
- Use Server-Side Scraping - Most reliable and performant approach
- Respect Rate Limits - Implement delays between requests
- Handle Errors Gracefully - APIs can be unreliable
- Cache Responses - Reduce API calls and improve performance
- Follow Terms of Service - Always comply with API usage policies
import time
import requests
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry
def create_session_with_retries():
session = requests.Session()
retry_strategy = Retry(
total=3,
backoff_factor=1,
status_forcelist=[429, 500, 502, 503, 504]
)
adapter = HTTPAdapter(max_retries=retry_strategy)
session.mount("http://", adapter)
session.mount("https://", adapter)
return session
# Robust API scraping with retries and rate limiting
session = create_session_with_retries()
urls = ['https://api.example.com/data/1', 'https://api.example.com/data/2']
for url in urls:
try:
response = session.get(url, timeout=10)
data = response.json()
print(f"Scraped: {data}")
time.sleep(1) # Rate limiting
except requests.exceptions.RequestException as e:
print(f"Error scraping {url}: {e}")
Summary
CORS only affects browser-based requests, not server-side scraping. For web scraping: - Best approach: Use server-side scripts (Python, Node.js, etc.) - Alternative: CORS proxies or web scraping APIs - Browser-only: Consider browser extensions with proper permissions
Always scrape responsibly and comply with website terms of service and applicable laws.