Table of contents

How to Implement Authentication in Scrapy

Authentication is a crucial aspect of web scraping when dealing with protected content or user-specific data. Scrapy provides several built-in mechanisms and patterns to handle different types of authentication methods. This comprehensive guide covers various authentication techniques you can implement in your Scrapy projects.

Table of Contents

  1. HTTP Basic Authentication
  2. Form-Based Login Authentication
  3. Session Cookies Authentication
  4. Custom Headers Authentication
  5. OAuth Authentication
  6. Advanced Authentication Patterns
  7. Best Practices

HTTP Basic Authentication

HTTP Basic Authentication is the simplest form of authentication where credentials are sent in the request headers. Scrapy supports this through the HttpAuthMiddleware.

Enable HTTP Auth Middleware

First, enable the middleware in your settings.py:

# settings.py
DOWNLOADER_MIDDLEWARES = {
    'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware': 560,
}

Implementation Example

import scrapy

class BasicAuthSpider(scrapy.Spider):
    name = 'basic_auth'
    start_urls = ['https://httpbin.org/basic-auth/user/pass']

    def start_requests(self):
        for url in self.start_urls:
            yield scrapy.Request(
                url,
                meta={
                    'http_auth_username': 'user',
                    'http_auth_password': 'pass'
                }
            )

    def parse(self, response):
        self.logger.info(f"Status: {response.status}")
        yield {
            'authenticated': response.status == 200,
            'content': response.text
        }

Alternative Method with Headers

You can also manually set the Authorization header:

import base64
import scrapy

class ManualBasicAuthSpider(scrapy.Spider):
    name = 'manual_basic_auth'

    def start_requests(self):
        username = 'user'
        password = 'pass'
        credentials = base64.b64encode(f'{username}:{password}'.encode()).decode()

        headers = {
            'Authorization': f'Basic {credentials}'
        }

        yield scrapy.Request(
            'https://httpbin.org/basic-auth/user/pass',
            headers=headers
        )

Form-Based Login Authentication

Form-based authentication involves submitting login credentials through an HTML form. This is the most common authentication method for web applications.

Step-by-Step Implementation

import scrapy

class LoginSpider(scrapy.Spider):
    name = 'login_spider'
    start_urls = ['https://example.com/login']

    def parse(self, response):
        # Extract form data and CSRF tokens
        return scrapy.FormRequest.from_response(
            response,
            formdata={
                'username': 'your_username',
                'password': 'your_password'
            },
            callback=self.after_login
        )

    def after_login(self, response):
        # Check if login was successful
        if "Welcome" in response.text:
            # Login successful, proceed to scrape protected content
            yield scrapy.Request(
                url='https://example.com/protected-page',
                callback=self.parse_protected_content
            )
        else:
            self.logger.error("Login failed")

    def parse_protected_content(self, response):
        # Extract data from protected pages
        yield {
            'title': response.css('h1::text').get(),
            'content': response.css('.content::text').getall()
        }

Handling CSRF Tokens

Many modern web applications use CSRF tokens for security:

import scrapy

class CSRFLoginSpider(scrapy.Spider):
    name = 'csrf_login'
    start_urls = ['https://example.com/login']

    def parse(self, response):
        # Extract CSRF token
        csrf_token = response.css('input[name="csrf_token"]::attr(value)').get()

        return scrapy.FormRequest.from_response(
            response,
            formdata={
                'username': 'your_username',
                'password': 'your_password',
                'csrf_token': csrf_token
            },
            callback=self.after_login
        )

Session Cookies Authentication

Some applications require maintaining session cookies across multiple requests. Scrapy automatically handles cookies, but you can also manage them manually.

Automatic Cookie Handling

# Enable cookie middleware (enabled by default)
DOWNLOADER_MIDDLEWARES = {
    'scrapy.downloadermiddlewares.cookies.CookiesMiddleware': 700,
}

# settings.py
COOKIES_ENABLED = True
COOKIES_DEBUG = True  # Enable for debugging

Manual Cookie Management

import scrapy

class CookieAuthSpider(scrapy.Spider):
    name = 'cookie_auth'

    def start_requests(self):
        # Set initial cookies
        cookies = {
            'session_id': 'your_session_id',
            'auth_token': 'your_auth_token'
        }

        yield scrapy.Request(
            'https://example.com/protected',
            cookies=cookies,
            callback=self.parse
        )

    def parse(self, response):
        # Cookies are automatically maintained for subsequent requests
        yield scrapy.Request(
            'https://example.com/another-protected-page',
            callback=self.parse_content
        )

Persistent Cookie Storage

For long-running scraping sessions, you might want to persist cookies:

# settings.py
COOKIES_ENABLED = True
COOKIES_DEBUG = True

# Custom middleware to save/load cookies
class PersistentCookiesMiddleware:
    def __init__(self):
        self.cookies_file = 'cookies.json'

    def process_request(self, request, spider):
        # Load cookies from file
        pass

    def process_response(self, request, response, spider):
        # Save cookies to file
        pass

Custom Headers Authentication

Some APIs use custom headers for authentication, such as API keys or bearer tokens.

API Key Authentication

import scrapy

class APIKeySpider(scrapy.Spider):
    name = 'api_key_auth'

    custom_settings = {
        'DEFAULT_REQUEST_HEADERS': {
            'X-API-Key': 'your-api-key-here',
            'User-Agent': 'MyBot 1.0'
        }
    }

    def start_requests(self):
        yield scrapy.Request('https://api.example.com/data')

Bearer Token Authentication

import scrapy

class BearerTokenSpider(scrapy.Spider):
    name = 'bearer_token'

    def start_requests(self):
        headers = {
            'Authorization': 'Bearer your-jwt-token-here',
            'Content-Type': 'application/json'
        }

        yield scrapy.Request(
            'https://api.example.com/protected',
            headers=headers
        )

OAuth Authentication

OAuth authentication is more complex and typically requires multiple steps. Here's a simplified example:

import scrapy
import requests
from urllib.parse import parse_qs, urlparse

class OAuthSpider(scrapy.Spider):
    name = 'oauth_spider'

    def __init__(self):
        self.client_id = 'your_client_id'
        self.client_secret = 'your_client_secret'
        self.access_token = None

    def start_requests(self):
        # Step 1: Get access token (this should be done outside Scrapy)
        token_url = 'https://api.example.com/oauth/token'
        data = {
            'grant_type': 'client_credentials',
            'client_id': self.client_id,
            'client_secret': self.client_secret
        }

        # In practice, use a separate script to get the token
        response = requests.post(token_url, data=data)
        self.access_token = response.json()['access_token']

        # Step 2: Use token for API requests
        headers = {
            'Authorization': f'Bearer {self.access_token}'
        }

        yield scrapy.Request(
            'https://api.example.com/protected-data',
            headers=headers
        )

Advanced Authentication Patterns

Multi-Step Authentication

import scrapy

class MultiStepAuthSpider(scrapy.Spider):
    name = 'multi_step_auth'

    def start_requests(self):
        # Step 1: Get initial page
        yield scrapy.Request(
            'https://example.com/step1',
            callback=self.step1
        )

    def step1(self, response):
        # Extract data needed for step 2
        token = response.css('input[name="token"]::attr(value)').get()

        # Step 2: Submit first form
        return scrapy.FormRequest.from_response(
            response,
            formdata={'verification_code': 'your_code', 'token': token},
            callback=self.step2
        )

    def step2(self, response):
        # Step 3: Final authentication
        return scrapy.FormRequest.from_response(
            response,
            formdata={'username': 'user', 'password': 'pass'},
            callback=self.authenticated
        )

Custom Authentication Middleware

Create a custom middleware for complex authentication logic:

# middlewares.py
class CustomAuthMiddleware:
    def __init__(self):
        self.authenticated = False
        self.auth_token = None

    def process_request(self, request, spider):
        if not self.authenticated:
            # Perform authentication
            self.authenticate()

        if self.auth_token:
            request.headers['Authorization'] = f'Bearer {self.auth_token}'

    def authenticate(self):
        # Custom authentication logic
        pass

Enable the middleware in settings:

# settings.py
DOWNLOADER_MIDDLEWARES = {
    'myproject.middlewares.CustomAuthMiddleware': 543,
}

Best Practices

1. Secure Credential Management

Never hardcode credentials in your code:

import os
from scrapy.utils.project import get_project_settings

class SecureSpider(scrapy.Spider):
    name = 'secure_spider'

    def __init__(self):
        settings = get_project_settings()
        self.username = os.getenv('SCRAPY_USERNAME')
        self.password = os.getenv('SCRAPY_PASSWORD')

        if not self.username or not self.password:
            raise ValueError("Credentials not found in environment variables")

2. Handle Authentication Failures

def after_login(self, response):
    if response.status == 401:
        self.logger.error("Authentication failed: Invalid credentials")
        return
    elif response.status == 403:
        self.logger.error("Authentication failed: Access forbidden")
        return
    elif "login" in response.url.lower():
        self.logger.error("Login redirect detected: Authentication failed")
        return

    # Authentication successful
    yield scrapy.Request(
        url='https://example.com/protected',
        callback=self.parse_protected
    )

3. Rate Limiting and Respectful Scraping

# settings.py
DOWNLOAD_DELAY = 1  # 1 second delay between requests
RANDOMIZE_DOWNLOAD_DELAY = 0.5  # Random delay (0.5 to 1.5 seconds)
CONCURRENT_REQUESTS = 1  # Limit concurrent requests
AUTOTHROTTLE_ENABLED = True  # Enable AutoThrottle

4. Session Management

class SessionManagedSpider(scrapy.Spider):
    name = 'session_managed'

    def start_requests(self):
        # Use dont_cache to avoid session issues
        yield scrapy.Request(
            'https://example.com/login',
            meta={'dont_cache': True},
            callback=self.login
        )

Testing Authentication

Create a simple test to verify your authentication:

# Test with scrapy shell
scrapy shell 'https://example.com/login'

# In the shell, test your authentication logic
>>> fetch('https://example.com/login')
>>> form = response.css('form')
>>> request = scrapy.FormRequest.from_response(response, formdata={'username': 'test', 'password': 'test'})

Command Line Usage

Run your authenticated spider:

# Set environment variables for credentials
export SCRAPY_USERNAME=your_username
export SCRAPY_PASSWORD=your_password

# Run the spider
scrapy crawl login_spider

# Run with custom settings
scrapy crawl login_spider -s COOKIES_DEBUG=True

# Save output to file
scrapy crawl login_spider -o authenticated_data.json

Authentication in Scrapy requires careful planning and implementation based on the target website's security mechanisms. While the examples above cover the most common scenarios, similar concepts can be applied to handle authentication in Puppeteer for JavaScript-heavy sites that require browser automation. Always respect the website's terms of service and implement appropriate rate limiting to ensure responsible scraping practices.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon