Table of contents

How do I scrape data from APIs using Python?

API scraping with Python is one of the most efficient ways to collect structured data from web services. Unlike traditional web scraping that parses HTML, API scraping involves making HTTP requests to dedicated endpoints that return data in formats like JSON or XML. This approach is faster, more reliable, and less likely to break when websites update their interfaces.

Understanding API Scraping vs Web Scraping

API scraping differs from traditional web scraping in several key ways:

  • Structured Data: APIs return data in structured formats (JSON, XML) rather than HTML
  • Better Performance: Direct data access without parsing HTML overhead
  • Rate Limits: APIs often have explicit rate limiting and usage policies
  • Authentication: Many APIs require authentication tokens or API keys
  • Documentation: APIs typically provide comprehensive documentation

Basic API Scraping with Python Requests

The requests library is the foundation for API scraping in Python. Here's a simple example:

import requests
import json
import time

def scrape_basic_api():
    """Basic API scraping example"""
    url = "https://jsonplaceholder.typicode.com/posts"

    try:
        response = requests.get(url)
        response.raise_for_status()  # Raises an HTTPError for bad responses

        data = response.json()
        print(f"Retrieved {len(data)} posts")

        # Process the data
        for post in data[:5]:  # First 5 posts
            print(f"Post {post['id']}: {post['title']}")

    except requests.exceptions.RequestException as e:
        print(f"Error fetching data: {e}")

scrape_basic_api()

Advanced Request Configuration

For production API scraping, you'll need more sophisticated request handling:

import requests
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry

class APIScaper:
    def __init__(self, base_url, timeout=30):
        self.base_url = base_url
        self.session = requests.Session()
        self.timeout = timeout

        # Configure retry strategy
        retry_strategy = Retry(
            total=3,
            status_forcelist=[429, 500, 502, 503, 504],
            method_whitelist=["HEAD", "GET", "OPTIONS"],
            backoff_factor=1
        )

        adapter = HTTPAdapter(max_retries=retry_strategy)
        self.session.mount("http://", adapter)
        self.session.mount("https://", adapter)

        # Set common headers
        self.session.headers.update({
            'User-Agent': 'Python API Scraper 1.0',
            'Accept': 'application/json',
            'Content-Type': 'application/json'
        })

    def get(self, endpoint, params=None):
        """Make GET request with error handling"""
        url = f"{self.base_url.rstrip('/')}/{endpoint.lstrip('/')}"

        try:
            response = self.session.get(
                url, 
                params=params, 
                timeout=self.timeout
            )
            response.raise_for_status()
            return response.json()

        except requests.exceptions.Timeout:
            print(f"Timeout error for {url}")
            return None
        except requests.exceptions.RequestException as e:
            print(f"Request error: {e}")
            return None

Handling API Authentication

Most production APIs require authentication. Here are common methods:

API Key Authentication

class AuthenticatedScraper(APIScaper):
    def __init__(self, base_url, api_key, auth_type='header'):
        super().__init__(base_url)
        self.api_key = api_key

        if auth_type == 'header':
            self.session.headers['Authorization'] = f'Bearer {api_key}'
        elif auth_type == 'param':
            self.default_params = {'api_key': api_key}

    def get(self, endpoint, params=None):
        if hasattr(self, 'default_params'):
            params = {**(params or {}), **self.default_params}
        return super().get(endpoint, params)

# Usage
scraper = AuthenticatedScraper(
    "https://api.example.com", 
    "your-api-key-here"
)
data = scraper.get("/users")

OAuth Authentication

import requests_oauthlib
from requests_oauthlib import OAuth2Session

def oauth_scraper(client_id, client_secret, token_url, api_url):
    """OAuth 2.0 authentication example"""

    # Get access token
    oauth = OAuth2Session(client_id)
    token = oauth.fetch_token(
        token_url=token_url,
        client_secret=client_secret
    )

    # Make authenticated requests
    response = oauth.get(api_url)
    return response.json()

Handling Pagination

APIs often paginate results. Here's how to handle different pagination patterns:

Offset-Based Pagination

def scrape_paginated_api(base_url, endpoint, page_size=100):
    """Handle offset-based pagination"""
    all_data = []
    offset = 0

    while True:
        params = {
            'limit': page_size,
            'offset': offset
        }

        response = requests.get(f"{base_url}/{endpoint}", params=params)
        data = response.json()

        if not data.get('results'):
            break

        all_data.extend(data['results'])

        # Check if there's more data
        if len(data['results']) < page_size:
            break

        offset += page_size
        time.sleep(0.1)  # Rate limiting

    return all_data

Cursor-Based Pagination

def scrape_cursor_pagination(api_scraper, endpoint):
    """Handle cursor-based pagination"""
    all_data = []
    next_cursor = None

    while True:
        params = {'cursor': next_cursor} if next_cursor else {}
        data = api_scraper.get(endpoint, params)

        if not data or not data.get('items'):
            break

        all_data.extend(data['items'])
        next_cursor = data.get('next_cursor')

        if not next_cursor:
            break

        time.sleep(0.1)  # Rate limiting

    return all_data

Rate Limiting and Throttling

Proper rate limiting prevents API blocking and ensures sustainable scraping:

import time
from datetime import datetime, timedelta

class RateLimitedScraper:
    def __init__(self, requests_per_minute=60):
        self.requests_per_minute = requests_per_minute
        self.request_times = []

    def wait_if_needed(self):
        """Implement rate limiting"""
        now = datetime.now()

        # Remove requests older than 1 minute
        cutoff = now - timedelta(minutes=1)
        self.request_times = [t for t in self.request_times if t > cutoff]

        # Check if we need to wait
        if len(self.request_times) >= self.requests_per_minute:
            sleep_time = 60 - (now - self.request_times[0]).total_seconds()
            if sleep_time > 0:
                print(f"Rate limit reached. Sleeping for {sleep_time:.2f} seconds")
                time.sleep(sleep_time)

        self.request_times.append(now)

    def get(self, url, **kwargs):
        self.wait_if_needed()
        return requests.get(url, **kwargs)

Error Handling and Monitoring

Robust error handling is crucial for reliable API scraping:

import logging
from functools import wraps

# Configure logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

def retry_on_failure(max_retries=3, delay=1):
    """Decorator for retrying failed API calls"""
    def decorator(func):
        @wraps(func)
        def wrapper(*args, **kwargs):
            for attempt in range(max_retries):
                try:
                    return func(*args, **kwargs)
                except requests.exceptions.RequestException as e:
                    logger.warning(f"Attempt {attempt + 1} failed: {e}")
                    if attempt == max_retries - 1:
                        raise
                    time.sleep(delay * (2 ** attempt))  # Exponential backoff
            return None
        return wrapper
    return decorator

class MonitoredAPIScraper:
    def __init__(self):
        self.request_count = 0
        self.error_count = 0
        self.start_time = datetime.now()

    @retry_on_failure(max_retries=3)
    def fetch_data(self, url, params=None):
        """Fetch data with monitoring"""
        self.request_count += 1

        try:
            response = requests.get(url, params=params, timeout=30)
            response.raise_for_status()

            logger.info(f"Successfully fetched data from {url}")
            return response.json()

        except requests.exceptions.RequestException as e:
            self.error_count += 1
            logger.error(f"Error fetching {url}: {e}")
            raise

    def get_stats(self):
        """Get scraping statistics"""
        runtime = datetime.now() - self.start_time
        return {
            'requests': self.request_count,
            'errors': self.error_count,
            'runtime': str(runtime),
            'success_rate': (self.request_count - self.error_count) / self.request_count if self.request_count > 0 else 0
        }

Data Processing and Storage

Once you've scraped API data, you'll need to process and store it:

import pandas as pd
import sqlite3
from datetime import datetime

class DataProcessor:
    def __init__(self, db_path='scraped_data.db'):
        self.db_path = db_path
        self.setup_database()

    def setup_database(self):
        """Initialize SQLite database"""
        conn = sqlite3.connect(self.db_path)
        cursor = conn.cursor()

        cursor.execute('''
            CREATE TABLE IF NOT EXISTS api_data (
                id INTEGER PRIMARY KEY AUTOINCREMENT,
                source TEXT,
                data TEXT,
                scraped_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
            )
        ''')

        conn.commit()
        conn.close()

    def save_data(self, data, source):
        """Save scraped data to database"""
        conn = sqlite3.connect(self.db_path)
        cursor = conn.cursor()

        cursor.execute(
            'INSERT INTO api_data (source, data) VALUES (?, ?)',
            (source, json.dumps(data))
        )

        conn.commit()
        conn.close()

    def to_dataframe(self, data):
        """Convert API data to pandas DataFrame"""
        if isinstance(data, list) and len(data) > 0:
            return pd.DataFrame(data)
        return pd.DataFrame()

Complete API Scraping Example

Here's a comprehensive example that combines all concepts:

import requests
import json
import time
import pandas as pd
from datetime import datetime
import logging

class ComprehensiveAPIScraper:
    def __init__(self, base_url, api_key=None):
        self.base_url = base_url
        self.session = requests.Session()
        self.api_key = api_key

        # Setup logging
        logging.basicConfig(level=logging.INFO)
        self.logger = logging.getLogger(__name__)

        # Configure session
        if api_key:
            self.session.headers['Authorization'] = f'Bearer {api_key}'

        self.session.headers.update({
            'User-Agent': 'Python API Scraper',
            'Accept': 'application/json'
        })

    def scrape_endpoint(self, endpoint, params=None, paginated=True):
        """Scrape a specific API endpoint"""
        all_data = []
        page = 1

        while True:
            current_params = params.copy() if params else {}

            if paginated:
                current_params['page'] = page
                current_params['per_page'] = 100

            try:
                response = self.session.get(
                    f"{self.base_url}/{endpoint.lstrip('/')}",
                    params=current_params,
                    timeout=30
                )
                response.raise_for_status()
                data = response.json()

                if isinstance(data, list):
                    all_data.extend(data)
                    if len(data) == 0 or not paginated:
                        break
                elif isinstance(data, dict):
                    items = data.get('data', data.get('items', []))
                    all_data.extend(items)

                    if not items or not paginated:
                        break

                page += 1
                time.sleep(0.1)  # Rate limiting

            except requests.exceptions.RequestException as e:
                self.logger.error(f"Error scraping {endpoint}: {e}")
                break

        self.logger.info(f"Scraped {len(all_data)} items from {endpoint}")
        return all_data

# Usage example
def main():
    scraper = ComprehensiveAPIScraper(
        "https://api.example.com",
        api_key="your-api-key"
    )

    # Scrape different endpoints
    users = scraper.scrape_endpoint("/users")
    posts = scraper.scrape_endpoint("/posts")

    # Convert to DataFrames for analysis
    users_df = pd.DataFrame(users)
    posts_df = pd.DataFrame(posts)

    print(f"Scraped {len(users)} users and {len(posts)} posts")

if __name__ == "__main__":
    main()

Best Practices for API Scraping

  1. Respect Rate Limits: Always implement proper rate limiting to avoid being blocked
  2. Handle Errors Gracefully: Use try-catch blocks and implement retry logic
  3. Cache Responses: Store responses locally to avoid redundant requests
  4. Monitor Usage: Track your API usage to stay within limits
  5. Use Session Objects: Reuse connections for better performance
  6. Follow Documentation: Always read and follow the API's terms of service

Alternative Approaches for Complex Cases

When APIs aren't available or sufficient, you might need to combine API scraping with other techniques. For JavaScript-heavy applications that load data dynamically, consider using Puppeteer for single page application scraping or monitoring network requests in Puppeteer to capture API calls made by the browser.

Conclusion

API scraping with Python is a powerful technique for collecting structured data efficiently. By using the requests library with proper authentication, pagination handling, rate limiting, and error management, you can build robust scrapers that collect data reliably. Remember to always respect the API's terms of service and implement proper monitoring to ensure your scraping operations run smoothly.

The key to successful API scraping is understanding the specific requirements of each API, implementing proper error handling, and designing your scraper to be resilient and respectful of the service you're accessing.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon