How can you find hidden or undocumented APIs for scraping?

Hidden APIs are internal endpoints that websites use for their front-end applications but don't publicly document. Finding these APIs can provide cleaner, more efficient data extraction compared to HTML scraping. Here's a comprehensive guide to discover them.

Primary Discovery Methods

1. Browser Developer Tools (Most Effective)

The network tab in browser developer tools is your primary weapon for API discovery.

Step-by-Step Process:

  1. Open Developer Tools: Press F12 (or Ctrl+Shift+I on Windows/Linux, Cmd+Opt+I on Mac)
  2. Navigate to Network Tab: Click "Network" and ensure recording is enabled
  3. Clear existing requests: Click the clear button (🚫) to start fresh
  4. Filter by request type: Use filters like XHR, Fetch, or JS to focus on API calls
  5. Interact with the website: Perform actions that load the data you want to scrape
  6. Analyze requests: Look for requests returning JSON/XML data

Pro Tips for Network Analysis:

# Look for these URL patterns in the network tab:
/api/
/v1/
/v2/
/graphql
/ajax/
/json/
/_next/data/
/__data.json

Example Network Request Analysis:

GET /api/v2/products?page=1&limit=20&category=electronics
Host: example.com
Authorization: Bearer eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9...
User-Agent: Mozilla/5.0...

2. JavaScript Source Code Analysis

APIs are often hardcoded or dynamically constructed in JavaScript files.

Search Techniques:

// In browser console, search for these patterns:
// 1. Global search in Sources tab
// Ctrl+Shift+F (Windows/Linux) or Cmd+Shift+F (Mac)

// 2. Common search terms:
"fetch("
"axios."
"XMLHttpRequest"
"$.ajax"
"endpoint"
"baseURL"
"API_URL"
"/api/"
"graphql"

Example JavaScript API Discovery:

// Found in bundled JavaScript file:
const API_BASE = 'https://api.example.com/v3/';
const endpoints = {
  products: `${API_BASE}products`,
  categories: `${API_BASE}categories`,
  search: `${API_BASE}search/query`
};

// Usage in code:
fetch(`${endpoints.products}?category=${categoryId}`)
  .then(response => response.json())

3. WebSocket Traffic Inspection

For real-time applications, WebSockets often carry valuable data.

WebSocket Analysis Steps: 1. Filter by WS: In Network tab, filter by "WS" (WebSockets) 2. Monitor frames: Click on WebSocket connections to see message frames 3. Document message structure: Note the JSON message format and triggers

Example WebSocket Message:

{
  "type": "product_update",
  "data": {
    "product_id": 12345,
    "price": 99.99,
    "stock": 15
  },
  "timestamp": "2024-01-15T10:30:00Z"
}

Advanced Discovery Techniques

4. Mobile App Traffic Analysis

Mobile apps often use simpler APIs that are easier to reverse-engineer.

Tools for Mobile Analysis:

# Using mitmproxy (cross-platform)
pip install mitmproxy
mitmproxy --mode transparent

# Using Charles Proxy (GUI tool)
# Configure mobile device to use proxy
# Monitor HTTPS traffic with SSL certificate installation

Python script for mitmproxy:

# save as addon_script.py
from mitmproxy import http

def response(flow: http.HTTPFlow) -> None:
    if "api" in flow.request.pretty_url:
        print(f"API Endpoint: {flow.request.method} {flow.request.pretty_url}")
        print(f"Response: {flow.response.status_code}")
        if flow.response.headers.get("content-type", "").startswith("application/json"):
            print(f"JSON Response: {flow.response.text[:200]}...")

5. Subdomain and Path Enumeration

APIs are often hosted on separate subdomains or paths.

Subdomain Discovery:

# Using subfinder
subfinder -d example.com | grep api

# Using amass
amass enum -d example.com | grep -E "(api|v[0-9]+|dev|staging)"

# Common API subdomains to check:
api.example.com
api-v2.example.com
internal-api.example.com
mobile-api.example.com

Path Discovery:

# Using dirb/dirbuster for API path discovery
dirb https://example.com /usr/share/dirb/wordlists/common.txt

# API-specific wordlists:
/api/
/api/v1/
/api/v2/
/rest/
/graphql/
/json/
/ajax/

6. Browser Extension Method

Useful Browser Extensions: - Postman Interceptor: Captures all requests automatically - HTTP Request/Response Logger: Logs all network activity - Developer Tools++: Enhanced network monitoring

Practical Implementation Examples

Python Implementation with Session Handling

import requests
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry

class HiddenAPIClient:
    def __init__(self, base_url, headers=None):
        self.base_url = base_url
        self.session = requests.Session()

        # Common headers that mimic browser behavior
        default_headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
            'Accept': 'application/json, text/plain, */*',
            'Accept-Language': 'en-US,en;q=0.9',
            'Accept-Encoding': 'gzip, deflate, br',
            'Connection': 'keep-alive',
            'Sec-Fetch-Dest': 'empty',
            'Sec-Fetch-Mode': 'cors',
            'Sec-Fetch-Site': 'same-origin'
        }

        if headers:
            default_headers.update(headers)

        self.session.headers.update(default_headers)

        # Setup retry strategy
        retry_strategy = Retry(
            total=3,
            backoff_factor=1,
            status_forcelist=[429, 500, 502, 503, 504]
        )

        adapter = HTTPAdapter(max_retries=retry_strategy)
        self.session.mount("http://", adapter)
        self.session.mount("https://", adapter)

    def get_data(self, endpoint, params=None):
        """Fetch data from discovered API endpoint"""
        url = f"{self.base_url}/{endpoint.lstrip('/')}"

        try:
            response = self.session.get(url, params=params, timeout=30)
            response.raise_for_status()

            # Handle different response types
            content_type = response.headers.get('content-type', '')

            if 'application/json' in content_type:
                return response.json()
            elif 'text/' in content_type:
                return response.text
            else:
                return response.content

        except requests.exceptions.RequestException as e:
            print(f"Error fetching {url}: {e}")
            return None

# Usage example
client = HiddenAPIClient('https://api.example.com')

# Add authentication if discovered
client.session.headers.update({
    'Authorization': 'Bearer your_discovered_token',
    'X-API-Key': 'your_api_key'
})

# Fetch data from discovered endpoints
products = client.get_data('/api/v2/products', params={'category': 'electronics'})
user_data = client.get_data('/api/user/profile')

JavaScript/Node.js Implementation

const axios = require('axios');

class HiddenAPIClient {
    constructor(baseURL, options = {}) {
        this.client = axios.create({
            baseURL,
            timeout: 30000,
            headers: {
                'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
                'Accept': 'application/json, text/plain, */*',
                'Accept-Language': 'en-US,en;q=0.9',
                ...options.headers
            }
        });

        // Add request interceptor for debugging
        this.client.interceptors.request.use(
            config => {
                console.log(`Making request to: ${config.method.toUpperCase()} ${config.url}`);
                return config;
            },
            error => Promise.reject(error)
        );

        // Add response interceptor for error handling
        this.client.interceptors.response.use(
            response => response,
            error => {
                console.error(`API Error: ${error.response?.status} ${error.response?.statusText}`);
                return Promise.reject(error);
            }
        );
    }

    async fetchData(endpoint, params = {}) {
        try {
            const response = await this.client.get(endpoint, { params });
            return response.data;
        } catch (error) {
            console.error(`Failed to fetch ${endpoint}:`, error.message);
            throw error;
        }
    }

    async postData(endpoint, data) {
        try {
            const response = await this.client.post(endpoint, data);
            return response.data;
        } catch (error) {
            console.error(`Failed to post to ${endpoint}:`, error.message);
            throw error;
        }
    }
}

// Usage
(async () => {
    const apiClient = new HiddenAPIClient('https://api.example.com', {
        headers: {
            'Authorization': 'Bearer discovered_token',
            'X-Requested-With': 'XMLHttpRequest'
        }
    });

    try {
        const products = await apiClient.fetchData('/api/v2/products', {
            page: 1,
            limit: 50,
            category: 'electronics'
        });

        console.log('Products:', products);
    } catch (error) {
        console.error('Error:', error);
    }
})();

API Authentication and Headers

Common Authentication Methods Found:

# 1. Bearer Token Authentication
headers = {
    'Authorization': 'Bearer eyJ0eXAiOiJKV1QiLCUzI1NiJ9...'
}

# 2. API Key in Header
headers = {
    'X-API-Key': 'your_api_key_here',
    'X-RapidAPI-Key': 'rapid_api_key'
}

# 3. Custom Headers
headers = {
    'X-Requested-With': 'XMLHttpRequest',
    'X-CSRF-Token': 'csrf_token_value',
    'Referer': 'https://example.com/page'
}

# 4. Cookies (session-based)
cookies = {
    'sessionid': 'session_value',
    'csrftoken': 'csrf_value'
}

Rate Limiting and Best Practices

import time
import random
from functools import wraps

def rate_limit(calls_per_second=1):
    """Decorator to add rate limiting"""
    min_interval = 1.0 / calls_per_second
    last_called = [0.0]

    def decorator(func):
        @wraps(func)
        def wrapper(*args, **kwargs):
            elapsed = time.time() - last_called[0]
            left_to_wait = min_interval - elapsed
            if left_to_wait > 0:
                time.sleep(left_to_wait + random.uniform(0, 0.1))
            ret = func(*args, **kwargs)
            last_called[0] = time.time()
            return ret
        return wrapper
    return decorator

# Usage
@rate_limit(calls_per_second=2)  # Max 2 calls per second
def fetch_api_data(endpoint):
    # Your API call here
    pass

Legal and Ethical Considerations

Critical Guidelines:

  1. Terms of Service: Always review and comply with the website's terms of service
  2. Rate Limiting: Implement reasonable delays between requests to avoid overloading servers
  3. robots.txt: Respect the robots.txt file, even for APIs
  4. Data Privacy: Be aware of GDPR, CCPA, and other privacy regulations
  5. Copyright: Respect intellectual property rights
  6. Attribution: Give credit when required by the data source

Recommended Practices:

# Good: Respectful scraping with delays
import time
import random

def respectful_api_call(url):
    # Random delay between 1-3 seconds
    time.sleep(random.uniform(1, 3))

    headers = {
        'User-Agent': 'YourBot/1.0 (contact@yoursite.com)',  # Identify yourself
        'Accept': 'application/json'
    }

    return requests.get(url, headers=headers)

Legal Compliance Checklist: - [ ] Read and understand the website's Terms of Service - [ ] Check for explicit API usage policies - [ ] Implement rate limiting (1-2 requests per second maximum) - [ ] Use proper User-Agent identification - [ ] Respect HTTP status codes (especially 429 Too Many Requests) - [ ] Don't scrape personal or sensitive data without consent - [ ] Consider reaching out to the website owner for permission

Remember: Hidden APIs are meant for internal use. While discovering them isn't illegal, using them may violate terms of service. Always prioritize ethical scraping practices and consider reaching out to website owners for official API access when possible.

Troubleshooting Common Issues

403 Forbidden Errors: - Check if authentication headers are required - Verify the Referer header matches the website - Ensure User-Agent mimics a real browser

429 Too Many Requests: - Implement exponential backoff - Reduce request frequency - Use rotating proxies if necessary

CORS Issues (Browser-based): - APIs may block cross-origin requests - Use server-side scraping instead of browser-based - Consider using CORS proxy for development only

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon