Scrapy provides a powerful built-in ImagesPipeline
for downloading images automatically. This guide covers everything from basic setup to advanced image filtering and handling.
Step 1: Configure Scrapy Settings
First, enable the ImagesPipeline in your settings.py
file:
# Enable the Images Pipeline
ITEM_PIPELINES = {
'scrapy.pipelines.images.ImagesPipeline': 1,
}
# Configure image storage
IMAGES_STORE = 'images' # Relative to project root
# IMAGES_STORE = '/absolute/path/to/images' # Or absolute path
# Optional: Configure image quality and thumbnails
IMAGES_MIN_HEIGHT = 110
IMAGES_MIN_WIDTH = 110
IMAGES_THUMBS = {
'small': (50, 50),
'big': (270, 270),
}
# Optional: Set expiry days (default is 90)
IMAGES_EXPIRES = 30
Step 2: Define Your Item
Create an item class in items.py
with required fields:
import scrapy
class ImageItem(scrapy.Item):
# Required for ImagesPipeline
image_urls = scrapy.Field() # List of image URLs to download
images = scrapy.Field() # Pipeline populates this with metadata
# Optional: Additional metadata
title = scrapy.Field()
alt_text = scrapy.Field()
page_url = scrapy.Field()
Step 3: Create Your Spider
Build a spider that extracts image URLs and handles relative links:
import scrapy
from myproject.items import ImageItem
class ImageSpider(scrapy.Spider):
name = 'image_spider'
start_urls = ['https://example.com/gallery']
def parse(self, response):
# Extract all image URLs
img_urls = response.css('img::attr(src)').getall()
# Convert relative URLs to absolute URLs
absolute_urls = [response.urljoin(url) for url in img_urls]
# Create and yield item
item = ImageItem()
item['image_urls'] = absolute_urls
item['page_url'] = response.url
yield item
# Follow pagination links
next_page = response.css('a.next::attr(href)').get()
if next_page:
yield response.follow(next_page, self.parse)
Step 4: Advanced Spider with Filtering
For more control over which images to download:
import scrapy
from urllib.parse import urlparse
from myproject.items import ImageItem
class FilteredImageSpider(scrapy.Spider):
name = 'filtered_image_spider'
start_urls = ['https://example.com']
def parse(self, response):
# Extract images with additional metadata
for img in response.css('img'):
src = img.css('::attr(src)').get()
alt = img.css('::attr(alt)').get()
if self.is_valid_image(src):
item = ImageItem()
item['image_urls'] = [response.urljoin(src)]
item['alt_text'] = alt or ''
item['title'] = response.css('title::text').get()
item['page_url'] = response.url
yield item
def is_valid_image(self, url):
"""Filter images by extension and size indicators"""
if not url:
return False
# Check file extension
parsed = urlparse(url.lower())
valid_extensions = ['.jpg', '.jpeg', '.png', '.gif', '.webp']
if not any(parsed.path.endswith(ext) for ext in valid_extensions):
return False
# Skip thumbnails and small images based on filename
skip_patterns = ['thumb', 'icon', 'logo', 'avatar']
if any(pattern in url.lower() for pattern in skip_patterns):
return False
return True
Step 5: Run Your Spider
Execute the spider from your project directory:
# Basic run
scrapy crawl image_spider
# Run with custom settings
scrapy crawl image_spider -s IMAGES_STORE=/custom/path
# Save items to JSON file
scrapy crawl image_spider -o images.json
Understanding Image Pipeline Behavior
The ImagesPipeline
automatically:
- Downloads images from URLs in the
image_urls
field - Generates unique filenames using SHA1 hash of the image URL
- Populates the
images
field with metadata:
item['images'] = [
{
'url': 'https://example.com/image.jpg',
'path': 'full/0a79c461a4062ac4dc05c8e7d6d78a2e4e5a7d4b.jpg',
'checksum': '0a79c461a4062ac4dc05c8e7d6d78a2e4e5a7d4b'
}
]
- Avoids duplicate downloads by checking the checksum
- Creates thumbnails if configured
Advanced Configuration
Custom Image Pipeline
Create a custom pipeline for additional processing:
from scrapy.pipelines.images import ImagesPipeline
from scrapy import Request
import os
class CustomImagesPipeline(ImagesPipeline):
def get_media_requests(self, item, info):
# Filter URLs before downloading
for image_url in item['image_urls']:
if self.should_download(image_url):
yield Request(image_url, meta={'item': item})
def should_download(self, url):
# Custom filtering logic
return url.endswith(('.jpg', '.png', '.jpeg'))
def file_path(self, request, response=None, info=None):
# Custom filename generation
item = request.meta['item']
image_name = request.url.split('/')[-1]
return f"{item.get('title', 'untitled')}/{image_name}"
Register the custom pipeline in settings.py
:
ITEM_PIPELINES = {
'myproject.pipelines.CustomImagesPipeline': 1,
}
Troubleshooting Common Issues
Images Not Downloading
- Verify
ITEM_PIPELINES
is correctly configured - Check that
IMAGES_STORE
directory is writable - Ensure
image_urls
contains a list of absolute URLs
Relative URL Issues
# Convert relative URLs to absolute
absolute_urls = [response.urljoin(url) for url in relative_urls]
Memory Issues with Large Images
# Add size limits in settings.py
IMAGES_MIN_HEIGHT = 100
IMAGES_MIN_WIDTH = 100
# Note: No built-in max size, implement in custom pipeline if needed
SSL Certificate Errors
# In settings.py - not recommended for production
DOWNLOADER_CLIENT_TLS_METHOD = 'TLSv1.2'
The ImagesPipeline is ideal for bulk image downloading with automatic deduplication and thumbnail generation, making it perfect for gallery scraping, product image collection, and content archival projects.