Table of contents

How do I scrape data from APIs using Scrapy?

Scrapy is primarily designed for web scraping HTML content, but it's equally powerful for scraping data from REST APIs. By leveraging Scrapy's Request objects and response handling capabilities, you can efficiently extract data from JSON APIs, handle authentication, manage rate limiting, and process large datasets.

Understanding API Scraping with Scrapy

When scraping APIs with Scrapy, you're essentially making HTTP requests to API endpoints and processing JSON responses instead of parsing HTML. This approach offers several advantages:

  • Structured Data: APIs typically return well-structured JSON data
  • Better Performance: No need to parse complex HTML DOM structures
  • Reliability: APIs are designed for programmatic access
  • Rate Limiting: Built-in support for respecting API limits

Basic API Spider Structure

Here's a fundamental Scrapy spider for API scraping:

import scrapy
import json

class ApiSpider(scrapy.Spider):
    name = 'api_spider'

    def start_requests(self):
        # Define your API endpoints
        urls = [
            'https://api.example.com/users',
            'https://api.example.com/posts',
        ]

        for url in urls:
            yield scrapy.Request(
                url=url,
                headers={'Accept': 'application/json'},
                callback=self.parse_api_response
            )

    def parse_api_response(self, response):
        # Parse JSON response
        try:
            data = json.loads(response.text)

            # Process each item in the response
            for item in data.get('results', []):
                yield {
                    'id': item.get('id'),
                    'name': item.get('name'),
                    'email': item.get('email'),
                    'created_at': item.get('created_at')
                }

            # Handle pagination
            next_page = data.get('next')
            if next_page:
                yield scrapy.Request(
                    url=next_page,
                    headers={'Accept': 'application/json'},
                    callback=self.parse_api_response
                )

        except json.JSONDecodeError:
            self.logger.error(f"Failed to parse JSON from {response.url}")

Handling Authentication

Many APIs require authentication. Here are common authentication methods with Scrapy:

API Key Authentication

class AuthenticatedApiSpider(scrapy.Spider):
    name = 'auth_api_spider'

    def __init__(self, api_key=None, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.api_key = api_key or 'your-api-key-here'

    def start_requests(self):
        headers = {
            'Authorization': f'Bearer {self.api_key}',
            'Content-Type': 'application/json'
        }

        yield scrapy.Request(
            url='https://api.example.com/protected-endpoint',
            headers=headers,
            callback=self.parse_protected_data
        )

    def parse_protected_data(self, response):
        if response.status == 401:
            self.logger.error("Authentication failed")
            return

        data = response.json()
        for item in data.get('data', []):
            yield item

OAuth 2.0 Authentication

For OAuth 2.0 flows, you might need to handle token refresh:

import time
from scrapy.http import Request

class OAuthApiSpider(scrapy.Spider):
    name = 'oauth_spider'

    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.access_token = None
        self.token_expires = 0

    def start_requests(self):
        # First, get the access token
        yield Request(
            url='https://api.example.com/oauth/token',
            method='POST',
            body=json.dumps({
                'client_id': 'your-client-id',
                'client_secret': 'your-client-secret',
                'grant_type': 'client_credentials'
            }),
            headers={'Content-Type': 'application/json'},
            callback=self.handle_token_response
        )

    def handle_token_response(self, response):
        token_data = response.json()
        self.access_token = token_data['access_token']
        self.token_expires = time.time() + token_data['expires_in']

        # Now make authenticated requests
        yield self.make_authenticated_request('https://api.example.com/data')

    def make_authenticated_request(self, url):
        if time.time() >= self.token_expires:
            # Token expired, refresh it
            return self.start_requests()

        return Request(
            url=url,
            headers={'Authorization': f'Bearer {self.access_token}'},
            callback=self.parse_api_data
        )

Advanced Request Handling

POST Requests with Data

When working with APIs that require POST requests:

def make_post_request(self):
    payload = {
        'query': 'search term',
        'filters': {
            'category': 'technology',
            'date_range': '2024-01-01'
        }
    }

    yield scrapy.Request(
        url='https://api.example.com/search',
        method='POST',
        body=json.dumps(payload),
        headers={
            'Content-Type': 'application/json',
            'Accept': 'application/json'
        },
        callback=self.parse_search_results
    )

def parse_search_results(self, response):
    results = response.json()
    for result in results.get('items', []):
        # Extract detailed information from each result
        detail_url = f"https://api.example.com/items/{result['id']}"
        yield scrapy.Request(
            url=detail_url,
            headers={'Accept': 'application/json'},
            callback=self.parse_item_details,
            meta={'item_data': result}
        )

def parse_item_details(self, response):
    item_data = response.meta['item_data']
    details = response.json()

    # Combine original data with detailed information
    yield {
        **item_data,
        'detailed_description': details.get('description'),
        'specifications': details.get('specs', [])
    }

Handling Different Response Formats

Some APIs return data in various formats. Handle them appropriately:

def parse_flexible_response(self, response):
    content_type = response.headers.get('Content-Type', b'').decode()

    if 'application/json' in content_type:
        data = response.json()
    elif 'application/xml' in content_type:
        # Handle XML response
        import xml.etree.ElementTree as ET
        root = ET.fromstring(response.text)
        data = self.xml_to_dict(root)
    elif 'text/csv' in content_type:
        # Handle CSV response
        import csv
        import io
        reader = csv.DictReader(io.StringIO(response.text))
        data = list(reader)
    else:
        self.logger.warning(f"Unexpected content type: {content_type}")
        return

    # Process the data regardless of original format
    for item in data:
        yield self.process_item(item)

Rate Limiting and Throttling

API scraping requires careful rate limiting to avoid being blocked. Configure Scrapy settings appropriately:

# In settings.py
DOWNLOAD_DELAY = 1  # 1 second delay between requests
RANDOMIZE_DOWNLOAD_DELAY = 0.5  # 0.5 * to 1.5 * DOWNLOAD_DELAY
AUTOTHROTTLE_ENABLED = True
AUTOTHROTTLE_START_DELAY = 1
AUTOTHROTTLE_MAX_DELAY = 10
AUTOTHROTTLE_TARGET_CONCURRENCY = 2.0

For APIs with specific rate limits, implement custom middleware:

import time
from scrapy.downloadermiddlewares.retry import RetryMiddleware

class APIRateLimitMiddleware(RetryMiddleware):
    def __init__(self):
        self.last_request_time = 0
        self.min_delay = 1.0  # Minimum delay between requests

    def process_request(self, request, spider):
        # Ensure minimum delay between requests
        current_time = time.time()
        time_since_last = current_time - self.last_request_time

        if time_since_last < self.min_delay:
            time.sleep(self.min_delay - time_since_last)

        self.last_request_time = time.time()
        return None

    def process_response(self, request, response, spider):
        # Handle rate limit responses
        if response.status == 429:  # Too Many Requests
            retry_after = response.headers.get('Retry-After')
            if retry_after:
                delay = int(retry_after)
                spider.logger.info(f"Rate limited. Retrying after {delay} seconds")
                time.sleep(delay)
                return self._retry(request, "rate_limited", spider)

        return response

Error Handling and Retries

Robust error handling is crucial for API scraping:

class RobustApiSpider(scrapy.Spider):
    name = 'robust_api_spider'

    def parse_api_response(self, response):
        # Check response status
        if response.status != 200:
            self.logger.error(f"API returned status {response.status} for {response.url}")
            return

        try:
            data = response.json()
        except json.JSONDecodeError as e:
            self.logger.error(f"JSON decode error: {e}")
            return

        # Validate expected data structure
        if 'data' not in data:
            self.logger.warning(f"Unexpected response structure from {response.url}")
            return

        # Process data with error handling
        for item in data.get('data', []):
            try:
                processed_item = self.process_item(item)
                if processed_item:
                    yield processed_item
            except Exception as e:
                self.logger.error(f"Error processing item: {e}")
                continue

    def process_item(self, item):
        # Validate required fields
        required_fields = ['id', 'name']
        if not all(field in item for field in required_fields):
            return None

        return {
            'id': item['id'],
            'name': item['name'],
            'description': item.get('description', ''),
            'timestamp': item.get('created_at')
        }

Pagination Handling

Many APIs use pagination. Here's how to handle different pagination patterns:

def handle_cursor_pagination(self, response):
    """Handle cursor-based pagination"""
    data = response.json()

    # Process current page items
    for item in data.get('data', []):
        yield item

    # Check for next page
    pagination = data.get('pagination', {})
    if pagination.get('has_next') and pagination.get('next_cursor'):
        next_url = f"{response.url}?cursor={pagination['next_cursor']}"
        yield scrapy.Request(
            url=next_url,
            callback=self.handle_cursor_pagination
        )

def handle_offset_pagination(self, response):
    """Handle offset/limit pagination"""
    data = response.json()

    # Process current page items
    for item in data.get('results', []):
        yield item

    # Calculate next offset
    current_offset = response.meta.get('offset', 0)
    limit = response.meta.get('limit', 20)
    total_count = data.get('total_count', 0)

    if current_offset + limit < total_count:
        next_offset = current_offset + limit
        next_url = f"{response.url}?offset={next_offset}&limit={limit}"
        yield scrapy.Request(
            url=next_url,
            callback=self.handle_offset_pagination,
            meta={'offset': next_offset, 'limit': limit}
        )

GraphQL API Support

For GraphQL APIs, you can also use Scrapy:

class GraphQLSpider(scrapy.Spider):
    name = 'graphql_spider'

    def start_requests(self):
        query = """
        query GetUsers($limit: Int!, $offset: Int!) {
            users(limit: $limit, offset: $offset) {
                id
                name
                email
                posts {
                    title
                    content
                }
            }
        }
        """

        variables = {
            'limit': 50,
            'offset': 0
        }

        payload = {
            'query': query,
            'variables': variables
        }

        yield scrapy.Request(
            url='https://api.example.com/graphql',
            method='POST',
            body=json.dumps(payload),
            headers={'Content-Type': 'application/json'},
            callback=self.parse_graphql_response,
            meta={'offset': 0, 'limit': 50}
        )

    def parse_graphql_response(self, response):
        data = response.json()

        if 'errors' in data:
            self.logger.error(f"GraphQL errors: {data['errors']}")
            return

        users = data.get('data', {}).get('users', [])

        for user in users:
            yield {
                'user_id': user['id'],
                'name': user['name'],
                'email': user['email'],
                'post_count': len(user.get('posts', []))
            }

        # Handle pagination for GraphQL
        if len(users) == response.meta['limit']:
            new_offset = response.meta['offset'] + response.meta['limit']
            # Make next request with updated offset
            # ... (similar to above with new variables)

Custom Headers and User Agents

Some APIs require specific headers or user agents:

class CustomHeaderSpider(scrapy.Spider):
    name = 'custom_header_spider'

    custom_settings = {
        'DEFAULT_REQUEST_HEADERS': {
            'Accept': 'application/json',
            'Accept-Language': 'en-US,en;q=0.9',
            'Accept-Encoding': 'gzip, deflate',
            'User-Agent': 'MyApp/1.0 (contact@example.com)'
        }
    }

    def start_requests(self):
        yield scrapy.Request(
            url='https://api.example.com/data',
            headers={
                'X-API-Version': '2024-01-01',
                'X-Client-ID': 'your-client-id'
            },
            callback=self.parse_api_response
        )

Running Your API Spider

Execute your spider with proper settings:

# Basic execution
scrapy crawl api_spider

# With custom settings
scrapy crawl api_spider -s DOWNLOAD_DELAY=2 -s CONCURRENT_REQUESTS=1

# With custom arguments
scrapy crawl api_spider -a api_key=your-api-key

# Output to JSON file
scrapy crawl api_spider -o api_data.json

# With specific log level
scrapy crawl api_spider -L INFO

Monitoring and Debugging

Add comprehensive logging to monitor your API scraping:

import logging

class MonitoredApiSpider(scrapy.Spider):
    name = 'monitored_api_spider'

    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.stats = {
            'requests_made': 0,
            'successful_responses': 0,
            'failed_responses': 0,
            'items_scraped': 0
        }

    def parse_api_response(self, response):
        self.stats['requests_made'] += 1

        if response.status == 200:
            self.stats['successful_responses'] += 1
            data = response.json()

            for item in data.get('results', []):
                self.stats['items_scraped'] += 1
                yield item
        else:
            self.stats['failed_responses'] += 1
            self.logger.warning(f"Failed request: {response.status} - {response.url}")

    def closed(self, reason):
        self.logger.info(f"Spider closed: {reason}")
        self.logger.info(f"Final stats: {self.stats}")

Best Practices for API Scraping

  1. Respect Rate Limits: Always implement appropriate delays and respect API rate limits
  2. Handle Errors Gracefully: Implement comprehensive error handling for network issues and API errors
  3. Use Proper Authentication: Securely manage API keys and tokens
  4. Monitor API Changes: APIs can change; implement monitoring for structural changes
  5. Cache Responses: For development, consider caching API responses to avoid unnecessary requests
  6. Validate Data: Always validate the structure and content of API responses
  7. Use Connection Pooling: Leverage Scrapy's built-in connection pooling for better performance

While Scrapy excels at API scraping, you might also consider handling AJAX requests using Puppeteer for scenarios where you need to interact with dynamic web applications that make API calls, or explore monitoring network requests in Puppeteer to understand API patterns in web applications.

API scraping with Scrapy provides a robust foundation for extracting structured data efficiently. By following these patterns and best practices, you can build reliable, scalable API scraping solutions that handle authentication, pagination, and error scenarios effectively.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon