Scrapy provides a powerful built-in ImagesPipeline for downloading images automatically. This guide covers everything from basic setup to advanced image filtering and handling.
Step 1: Configure Scrapy Settings
First, enable the ImagesPipeline in your settings.py file:
# Enable the Images Pipeline
ITEM_PIPELINES = {
    'scrapy.pipelines.images.ImagesPipeline': 1,
}
# Configure image storage
IMAGES_STORE = 'images'  # Relative to project root
# IMAGES_STORE = '/absolute/path/to/images'  # Or absolute path
# Optional: Configure image quality and thumbnails
IMAGES_MIN_HEIGHT = 110
IMAGES_MIN_WIDTH = 110
IMAGES_THUMBS = {
    'small': (50, 50),
    'big': (270, 270),
}
# Optional: Set expiry days (default is 90)
IMAGES_EXPIRES = 30
Step 2: Define Your Item
Create an item class in items.py with required fields:
import scrapy
class ImageItem(scrapy.Item):
    # Required for ImagesPipeline
    image_urls = scrapy.Field()  # List of image URLs to download
    images = scrapy.Field()      # Pipeline populates this with metadata
    # Optional: Additional metadata
    title = scrapy.Field()
    alt_text = scrapy.Field()
    page_url = scrapy.Field()
Step 3: Create Your Spider
Build a spider that extracts image URLs and handles relative links:
import scrapy
from myproject.items import ImageItem
class ImageSpider(scrapy.Spider):
    name = 'image_spider'
    start_urls = ['https://example.com/gallery']
    def parse(self, response):
        # Extract all image URLs
        img_urls = response.css('img::attr(src)').getall()
        # Convert relative URLs to absolute URLs
        absolute_urls = [response.urljoin(url) for url in img_urls]
        # Create and yield item
        item = ImageItem()
        item['image_urls'] = absolute_urls
        item['page_url'] = response.url
        yield item
        # Follow pagination links
        next_page = response.css('a.next::attr(href)').get()
        if next_page:
            yield response.follow(next_page, self.parse)
Step 4: Advanced Spider with Filtering
For more control over which images to download:
import scrapy
from urllib.parse import urlparse
from myproject.items import ImageItem
class FilteredImageSpider(scrapy.Spider):
    name = 'filtered_image_spider'
    start_urls = ['https://example.com']
    def parse(self, response):
        # Extract images with additional metadata
        for img in response.css('img'):
            src = img.css('::attr(src)').get()
            alt = img.css('::attr(alt)').get()
            if self.is_valid_image(src):
                item = ImageItem()
                item['image_urls'] = [response.urljoin(src)]
                item['alt_text'] = alt or ''
                item['title'] = response.css('title::text').get()
                item['page_url'] = response.url
                yield item
    def is_valid_image(self, url):
        """Filter images by extension and size indicators"""
        if not url:
            return False
        # Check file extension
        parsed = urlparse(url.lower())
        valid_extensions = ['.jpg', '.jpeg', '.png', '.gif', '.webp']
        if not any(parsed.path.endswith(ext) for ext in valid_extensions):
            return False
        # Skip thumbnails and small images based on filename
        skip_patterns = ['thumb', 'icon', 'logo', 'avatar']
        if any(pattern in url.lower() for pattern in skip_patterns):
            return False
        return True
Step 5: Run Your Spider
Execute the spider from your project directory:
# Basic run
scrapy crawl image_spider
# Run with custom settings
scrapy crawl image_spider -s IMAGES_STORE=/custom/path
# Save items to JSON file
scrapy crawl image_spider -o images.json
Understanding Image Pipeline Behavior
The ImagesPipeline automatically:
- Downloads images from URLs in the 
image_urlsfield - Generates unique filenames using SHA1 hash of the image URL
 - Populates the 
imagesfield with metadata: 
   item['images'] = [
       {
           'url': 'https://example.com/image.jpg',
           'path': 'full/0a79c461a4062ac4dc05c8e7d6d78a2e4e5a7d4b.jpg',
           'checksum': '0a79c461a4062ac4dc05c8e7d6d78a2e4e5a7d4b'
       }
   ]
- Avoids duplicate downloads by checking the checksum
 - Creates thumbnails if configured
 
Advanced Configuration
Custom Image Pipeline
Create a custom pipeline for additional processing:
from scrapy.pipelines.images import ImagesPipeline
from scrapy import Request
import os
class CustomImagesPipeline(ImagesPipeline):
    def get_media_requests(self, item, info):
        # Filter URLs before downloading
        for image_url in item['image_urls']:
            if self.should_download(image_url):
                yield Request(image_url, meta={'item': item})
    def should_download(self, url):
        # Custom filtering logic
        return url.endswith(('.jpg', '.png', '.jpeg'))
    def file_path(self, request, response=None, info=None):
        # Custom filename generation
        item = request.meta['item']
        image_name = request.url.split('/')[-1]
        return f"{item.get('title', 'untitled')}/{image_name}"
Register the custom pipeline in settings.py:
ITEM_PIPELINES = {
    'myproject.pipelines.CustomImagesPipeline': 1,
}
Troubleshooting Common Issues
Images Not Downloading
- Verify 
ITEM_PIPELINESis correctly configured - Check that 
IMAGES_STOREdirectory is writable - Ensure 
image_urlscontains a list of absolute URLs 
Relative URL Issues
# Convert relative URLs to absolute
absolute_urls = [response.urljoin(url) for url in relative_urls]
Memory Issues with Large Images
# Add size limits in settings.py
IMAGES_MIN_HEIGHT = 100
IMAGES_MIN_WIDTH = 100
# Note: No built-in max size, implement in custom pipeline if needed
SSL Certificate Errors
# In settings.py - not recommended for production
DOWNLOADER_CLIENT_TLS_METHOD = 'TLSv1.2'
The ImagesPipeline is ideal for bulk image downloading with automatic deduplication and thumbnail generation, making it perfect for gallery scraping, product image collection, and content archival projects.