Table of contents

How do I handle file downloads in Scrapy?

Downloading files in Scrapy is a common requirement when scraping websites that contain documents, images, media files, or other downloadable content. Scrapy provides built-in support for file downloads through its Files Pipeline and Images Pipeline, along with the flexibility to create custom download solutions for specific needs.

Understanding Scrapy's Built-in Download Pipelines

Files Pipeline

The Files Pipeline is Scrapy's general-purpose file download mechanism that can handle any type of file. It provides automatic deduplication, configurable storage paths, and metadata extraction.

Basic setup in settings.py:

# settings.py
ITEM_PIPELINES = {
    'scrapy.pipelines.files.FilesPipeline': 1,
}

FILES_STORE = 'downloads'  # Directory to store downloaded files
FILES_URLS_FIELD = 'file_urls'  # Field containing file URLs
FILES_RESULT_FIELD = 'files'    # Field to store download results

Spider implementation:

import scrapy

class DocumentSpider(scrapy.Spider):
    name = 'documents'
    start_urls = ['https://example.com/documents']

    def parse(self, response):
        # Extract file URLs from the page
        file_urls = response.css('a[href$=".pdf"]::attr(href)').getall()

        # Convert relative URLs to absolute
        file_urls = [response.urljoin(url) for url in file_urls]

        yield {
            'title': response.css('h1::text').get(),
            'file_urls': file_urls,  # This triggers the Files Pipeline
        }

Images Pipeline

For image downloads specifically, Scrapy offers the Images Pipeline with additional features like thumbnail generation and image format validation.

Setup for images:

# settings.py
ITEM_PIPELINES = {
    'scrapy.pipelines.images.ImagesPipeline': 1,
}

IMAGES_STORE = 'images'
IMAGES_URLS_FIELD = 'image_urls'
IMAGES_RESULT_FIELD = 'images'

# Optional: Generate thumbnails
IMAGES_THUMBS = {
    'small': (50, 50),
    'big': (270, 270),
}

# Image quality and format settings
IMAGES_MIN_HEIGHT = 110
IMAGES_MIN_WIDTH = 110

Advanced File Download Techniques

Custom File Pipeline

For more control over the download process, you can create a custom pipeline:

# pipelines.py
import os
import requests
from urllib.parse import urlparse
from scrapy.pipelines.files import FilesPipeline
from scrapy.http import Request

class CustomFilesPipeline(FilesPipeline):

    def get_media_requests(self, item, info):
        """Generate requests for file downloads"""
        urls = item.get(self.urls_field, [])
        for url in urls:
            yield Request(
                url,
                meta={
                    'filename': self.get_filename(url, item),
                    'item': item
                }
            )

    def get_filename(self, url, item):
        """Generate custom filename based on item data"""
        parsed = urlparse(url)
        filename = os.path.basename(parsed.path)

        # Add item-specific prefix
        title = item.get('title', 'unknown')
        safe_title = "".join(c for c in title if c.isalnum() or c in (' ', '-', '_')).rstrip()

        return f"{safe_title}_{filename}"

    def file_path(self, request, response=None, info=None, *, item=None):
        """Define the file storage path"""
        filename = request.meta.get('filename')
        if filename:
            return filename
        return super().file_path(request, response, info, item=item)

Downloading Files with Custom Headers

Sometimes you need to add authentication headers or user agents for file downloads:

class AuthenticatedFilesPipeline(FilesPipeline):

    def get_media_requests(self, item, info):
        headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
            'Authorization': 'Bearer your-token-here',
            'Referer': item.get('source_url', ''),
        }

        for url in item.get(self.urls_field, []):
            yield Request(url, headers=headers)

Handling Large Files and Streaming Downloads

For large files, you might want to implement streaming downloads to manage memory usage:

import scrapy
from scrapy.http import Request

class StreamingDownloadSpider(scrapy.Spider):
    name = 'streaming_download'

    def parse(self, response):
        large_file_url = response.css('a.large-file::attr(href)').get()

        if large_file_url:
            yield Request(
                response.urljoin(large_file_url),
                callback=self.download_large_file,
                meta={'download_timeout': 300}  # 5 minutes timeout
            )

    def download_large_file(self, response):
        filename = response.url.split('/')[-1]
        filepath = f'downloads/{filename}'

        with open(filepath, 'wb') as f:
            f.write(response.body)

        yield {
            'filename': filename,
            'size': len(response.body),
            'url': response.url,
            'status': 'downloaded'
        }

Configuration and Best Practices

Optimal Settings for File Downloads

# settings.py - Optimized for file downloads
DOWNLOAD_DELAY = 1  # Be respectful to servers
RANDOMIZE_DOWNLOAD_DELAY = 0.5  # Random delay up to 0.5 * DOWNLOAD_DELAY

# Increase timeouts for large files
DOWNLOAD_TIMEOUT = 180  # 3 minutes

# Enable HTTP caching to avoid re-downloading
HTTPCACHE_ENABLED = True
HTTPCACHE_EXPIRATION_SECS = 3600  # 1 hour

# Concurrent downloads
CONCURRENT_REQUESTS = 16
CONCURRENT_REQUESTS_PER_DOMAIN = 8

# File pipeline settings
FILES_EXPIRES = 90  # Days before re-downloading files
FILES_STORE_S3_ACL = 'public-read'  # If using S3 storage

Error Handling and Retry Logic

class RobustFilesPipeline(FilesPipeline):

    def get_media_requests(self, item, info):
        for url in item.get(self.urls_field, []):
            request = Request(url)
            request.meta['download_timeout'] = 300
            request.meta['max_retry_times'] = 3
            yield request

    def media_failed(self, failure, request, info):
        """Handle failed downloads"""
        self.logger.error(f"Failed to download {request.url}: {failure}")
        return failure

    def media_downloaded(self, response, request, info, *, item=None):
        """Process successful downloads"""
        self.logger.info(f"Successfully downloaded {request.url}")
        return super().media_downloaded(response, request, info, item=item)

Filtering and Validation

Add validation to ensure you only download the files you need:

class FilteredFilesPipeline(FilesPipeline):

    ALLOWED_EXTENSIONS = {'.pdf', '.doc', '.docx', '.xls', '.xlsx'}
    MAX_FILE_SIZE = 50 * 1024 * 1024  # 50MB

    def get_media_requests(self, item, info):
        urls = item.get(self.urls_field, [])
        filtered_urls = []

        for url in urls:
            if self.is_valid_file(url):
                filtered_urls.append(url)
            else:
                self.logger.warning(f"Skipped invalid file: {url}")

        for url in filtered_urls:
            yield Request(url)

    def is_valid_file(self, url):
        """Validate file URL before downloading"""
        parsed = urlparse(url)
        filename = os.path.basename(parsed.path)
        _, ext = os.path.splitext(filename)

        return ext.lower() in self.ALLOWED_EXTENSIONS

    def media_downloaded(self, response, request, info, *, item=None):
        """Validate downloaded file"""
        if len(response.body) > self.MAX_FILE_SIZE:
            raise DropItem(f"File too large: {request.url}")

        return super().media_downloaded(response, request, info, item=item)

Integration with Cloud Storage

Amazon S3 Integration

For production environments, consider storing files in cloud storage:

# settings.py
FILES_STORE = 's3://your-bucket-name/scrapy-files/'
AWS_ACCESS_KEY_ID = 'your-access-key'
AWS_SECRET_ACCESS_KEY = 'your-secret-key'

# Custom S3 pipeline
from scrapy.pipelines.files import S3FilesStore

class CustomS3FilesPipeline(FilesPipeline):
    STORE_SCHEMES = {
        's3': S3FilesStore,
    }

Testing File Downloads

Create unit tests to ensure your download pipelines work correctly:

# test_pipelines.py
import unittest
from scrapy.http import Request, HtmlResponse
from myproject.pipelines import CustomFilesPipeline

class TestFilesPipeline(unittest.TestCase):

    def setUp(self):
        self.pipeline = CustomFilesPipeline()

    def test_file_requests_generation(self):
        item = {
            'title': 'Test Document',
            'file_urls': ['https://example.com/test.pdf']
        }

        requests = list(self.pipeline.get_media_requests(item, None))
        self.assertEqual(len(requests), 1)
        self.assertIsInstance(requests[0], Request)

Performance Optimization Tips

  1. Use appropriate concurrent settings: Balance between speed and server load
  2. Implement caching: Avoid re-downloading identical files
  3. Monitor memory usage: For large files, consider streaming downloads
  4. Set proper timeouts: Prevent hanging downloads from blocking the spider
  5. Use checksums: Verify file integrity after download

Similar to how Puppeteer handles file downloads, Scrapy provides robust mechanisms for managing file downloads at scale. While Puppeteer excels at downloading files from JavaScript-heavy sites, Scrapy's pipeline approach is more efficient for bulk file downloads from traditional web pages.

Conclusion

Scrapy's file download capabilities are both powerful and flexible. The built-in Files and Images pipelines handle most common scenarios, while custom pipelines provide the flexibility needed for complex requirements. By following best practices for configuration, error handling, and performance optimization, you can build robust file download systems that scale effectively.

Remember to always respect robots.txt files, implement appropriate delays, and consider the server load when designing your file download workflows. With proper implementation, Scrapy can efficiently handle file downloads ranging from small documents to large media files across thousands of web pages.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon