How do I handle file downloads in Scrapy?
Downloading files in Scrapy is a common requirement when scraping websites that contain documents, images, media files, or other downloadable content. Scrapy provides built-in support for file downloads through its Files Pipeline and Images Pipeline, along with the flexibility to create custom download solutions for specific needs.
Understanding Scrapy's Built-in Download Pipelines
Files Pipeline
The Files Pipeline is Scrapy's general-purpose file download mechanism that can handle any type of file. It provides automatic deduplication, configurable storage paths, and metadata extraction.
Basic setup in settings.py:
# settings.py
ITEM_PIPELINES = {
'scrapy.pipelines.files.FilesPipeline': 1,
}
FILES_STORE = 'downloads' # Directory to store downloaded files
FILES_URLS_FIELD = 'file_urls' # Field containing file URLs
FILES_RESULT_FIELD = 'files' # Field to store download results
Spider implementation:
import scrapy
class DocumentSpider(scrapy.Spider):
name = 'documents'
start_urls = ['https://example.com/documents']
def parse(self, response):
# Extract file URLs from the page
file_urls = response.css('a[href$=".pdf"]::attr(href)').getall()
# Convert relative URLs to absolute
file_urls = [response.urljoin(url) for url in file_urls]
yield {
'title': response.css('h1::text').get(),
'file_urls': file_urls, # This triggers the Files Pipeline
}
Images Pipeline
For image downloads specifically, Scrapy offers the Images Pipeline with additional features like thumbnail generation and image format validation.
Setup for images:
# settings.py
ITEM_PIPELINES = {
'scrapy.pipelines.images.ImagesPipeline': 1,
}
IMAGES_STORE = 'images'
IMAGES_URLS_FIELD = 'image_urls'
IMAGES_RESULT_FIELD = 'images'
# Optional: Generate thumbnails
IMAGES_THUMBS = {
'small': (50, 50),
'big': (270, 270),
}
# Image quality and format settings
IMAGES_MIN_HEIGHT = 110
IMAGES_MIN_WIDTH = 110
Advanced File Download Techniques
Custom File Pipeline
For more control over the download process, you can create a custom pipeline:
# pipelines.py
import os
import requests
from urllib.parse import urlparse
from scrapy.pipelines.files import FilesPipeline
from scrapy.http import Request
class CustomFilesPipeline(FilesPipeline):
def get_media_requests(self, item, info):
"""Generate requests for file downloads"""
urls = item.get(self.urls_field, [])
for url in urls:
yield Request(
url,
meta={
'filename': self.get_filename(url, item),
'item': item
}
)
def get_filename(self, url, item):
"""Generate custom filename based on item data"""
parsed = urlparse(url)
filename = os.path.basename(parsed.path)
# Add item-specific prefix
title = item.get('title', 'unknown')
safe_title = "".join(c for c in title if c.isalnum() or c in (' ', '-', '_')).rstrip()
return f"{safe_title}_{filename}"
def file_path(self, request, response=None, info=None, *, item=None):
"""Define the file storage path"""
filename = request.meta.get('filename')
if filename:
return filename
return super().file_path(request, response, info, item=item)
Downloading Files with Custom Headers
Sometimes you need to add authentication headers or user agents for file downloads:
class AuthenticatedFilesPipeline(FilesPipeline):
def get_media_requests(self, item, info):
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
'Authorization': 'Bearer your-token-here',
'Referer': item.get('source_url', ''),
}
for url in item.get(self.urls_field, []):
yield Request(url, headers=headers)
Handling Large Files and Streaming Downloads
For large files, you might want to implement streaming downloads to manage memory usage:
import scrapy
from scrapy.http import Request
class StreamingDownloadSpider(scrapy.Spider):
name = 'streaming_download'
def parse(self, response):
large_file_url = response.css('a.large-file::attr(href)').get()
if large_file_url:
yield Request(
response.urljoin(large_file_url),
callback=self.download_large_file,
meta={'download_timeout': 300} # 5 minutes timeout
)
def download_large_file(self, response):
filename = response.url.split('/')[-1]
filepath = f'downloads/{filename}'
with open(filepath, 'wb') as f:
f.write(response.body)
yield {
'filename': filename,
'size': len(response.body),
'url': response.url,
'status': 'downloaded'
}
Configuration and Best Practices
Optimal Settings for File Downloads
# settings.py - Optimized for file downloads
DOWNLOAD_DELAY = 1 # Be respectful to servers
RANDOMIZE_DOWNLOAD_DELAY = 0.5 # Random delay up to 0.5 * DOWNLOAD_DELAY
# Increase timeouts for large files
DOWNLOAD_TIMEOUT = 180 # 3 minutes
# Enable HTTP caching to avoid re-downloading
HTTPCACHE_ENABLED = True
HTTPCACHE_EXPIRATION_SECS = 3600 # 1 hour
# Concurrent downloads
CONCURRENT_REQUESTS = 16
CONCURRENT_REQUESTS_PER_DOMAIN = 8
# File pipeline settings
FILES_EXPIRES = 90 # Days before re-downloading files
FILES_STORE_S3_ACL = 'public-read' # If using S3 storage
Error Handling and Retry Logic
class RobustFilesPipeline(FilesPipeline):
def get_media_requests(self, item, info):
for url in item.get(self.urls_field, []):
request = Request(url)
request.meta['download_timeout'] = 300
request.meta['max_retry_times'] = 3
yield request
def media_failed(self, failure, request, info):
"""Handle failed downloads"""
self.logger.error(f"Failed to download {request.url}: {failure}")
return failure
def media_downloaded(self, response, request, info, *, item=None):
"""Process successful downloads"""
self.logger.info(f"Successfully downloaded {request.url}")
return super().media_downloaded(response, request, info, item=item)
Filtering and Validation
Add validation to ensure you only download the files you need:
class FilteredFilesPipeline(FilesPipeline):
ALLOWED_EXTENSIONS = {'.pdf', '.doc', '.docx', '.xls', '.xlsx'}
MAX_FILE_SIZE = 50 * 1024 * 1024 # 50MB
def get_media_requests(self, item, info):
urls = item.get(self.urls_field, [])
filtered_urls = []
for url in urls:
if self.is_valid_file(url):
filtered_urls.append(url)
else:
self.logger.warning(f"Skipped invalid file: {url}")
for url in filtered_urls:
yield Request(url)
def is_valid_file(self, url):
"""Validate file URL before downloading"""
parsed = urlparse(url)
filename = os.path.basename(parsed.path)
_, ext = os.path.splitext(filename)
return ext.lower() in self.ALLOWED_EXTENSIONS
def media_downloaded(self, response, request, info, *, item=None):
"""Validate downloaded file"""
if len(response.body) > self.MAX_FILE_SIZE:
raise DropItem(f"File too large: {request.url}")
return super().media_downloaded(response, request, info, item=item)
Integration with Cloud Storage
Amazon S3 Integration
For production environments, consider storing files in cloud storage:
# settings.py
FILES_STORE = 's3://your-bucket-name/scrapy-files/'
AWS_ACCESS_KEY_ID = 'your-access-key'
AWS_SECRET_ACCESS_KEY = 'your-secret-key'
# Custom S3 pipeline
from scrapy.pipelines.files import S3FilesStore
class CustomS3FilesPipeline(FilesPipeline):
STORE_SCHEMES = {
's3': S3FilesStore,
}
Testing File Downloads
Create unit tests to ensure your download pipelines work correctly:
# test_pipelines.py
import unittest
from scrapy.http import Request, HtmlResponse
from myproject.pipelines import CustomFilesPipeline
class TestFilesPipeline(unittest.TestCase):
def setUp(self):
self.pipeline = CustomFilesPipeline()
def test_file_requests_generation(self):
item = {
'title': 'Test Document',
'file_urls': ['https://example.com/test.pdf']
}
requests = list(self.pipeline.get_media_requests(item, None))
self.assertEqual(len(requests), 1)
self.assertIsInstance(requests[0], Request)
Performance Optimization Tips
- Use appropriate concurrent settings: Balance between speed and server load
- Implement caching: Avoid re-downloading identical files
- Monitor memory usage: For large files, consider streaming downloads
- Set proper timeouts: Prevent hanging downloads from blocking the spider
- Use checksums: Verify file integrity after download
Similar to how Puppeteer handles file downloads, Scrapy provides robust mechanisms for managing file downloads at scale. While Puppeteer excels at downloading files from JavaScript-heavy sites, Scrapy's pipeline approach is more efficient for bulk file downloads from traditional web pages.
Conclusion
Scrapy's file download capabilities are both powerful and flexible. The built-in Files and Images pipelines handle most common scenarios, while custom pipelines provide the flexibility needed for complex requirements. By following best practices for configuration, error handling, and performance optimization, you can build robust file download systems that scale effectively.
Remember to always respect robots.txt files, implement appropriate delays, and consider the server load when designing your file download workflows. With proper implementation, Scrapy can efficiently handle file downloads ranging from small documents to large media files across thousands of web pages.