How do I scrape images with Scrapy?

Scrapy provides a powerful built-in ImagesPipeline for downloading images automatically. This guide covers everything from basic setup to advanced image filtering and handling.

Step 1: Configure Scrapy Settings

First, enable the ImagesPipeline in your settings.py file:

# Enable the Images Pipeline
ITEM_PIPELINES = {
    'scrapy.pipelines.images.ImagesPipeline': 1,
}

# Configure image storage
IMAGES_STORE = 'images'  # Relative to project root
# IMAGES_STORE = '/absolute/path/to/images'  # Or absolute path

# Optional: Configure image quality and thumbnails
IMAGES_MIN_HEIGHT = 110
IMAGES_MIN_WIDTH = 110
IMAGES_THUMBS = {
    'small': (50, 50),
    'big': (270, 270),
}

# Optional: Set expiry days (default is 90)
IMAGES_EXPIRES = 30

Step 2: Define Your Item

Create an item class in items.py with required fields:

import scrapy

class ImageItem(scrapy.Item):
    # Required for ImagesPipeline
    image_urls = scrapy.Field()  # List of image URLs to download
    images = scrapy.Field()      # Pipeline populates this with metadata

    # Optional: Additional metadata
    title = scrapy.Field()
    alt_text = scrapy.Field()
    page_url = scrapy.Field()

Step 3: Create Your Spider

Build a spider that extracts image URLs and handles relative links:

import scrapy
from myproject.items import ImageItem

class ImageSpider(scrapy.Spider):
    name = 'image_spider'
    start_urls = ['https://example.com/gallery']

    def parse(self, response):
        # Extract all image URLs
        img_urls = response.css('img::attr(src)').getall()

        # Convert relative URLs to absolute URLs
        absolute_urls = [response.urljoin(url) for url in img_urls]

        # Create and yield item
        item = ImageItem()
        item['image_urls'] = absolute_urls
        item['page_url'] = response.url

        yield item

        # Follow pagination links
        next_page = response.css('a.next::attr(href)').get()
        if next_page:
            yield response.follow(next_page, self.parse)

Step 4: Advanced Spider with Filtering

For more control over which images to download:

import scrapy
from urllib.parse import urlparse
from myproject.items import ImageItem

class FilteredImageSpider(scrapy.Spider):
    name = 'filtered_image_spider'
    start_urls = ['https://example.com']

    def parse(self, response):
        # Extract images with additional metadata
        for img in response.css('img'):
            src = img.css('::attr(src)').get()
            alt = img.css('::attr(alt)').get()

            if self.is_valid_image(src):
                item = ImageItem()
                item['image_urls'] = [response.urljoin(src)]
                item['alt_text'] = alt or ''
                item['title'] = response.css('title::text').get()
                item['page_url'] = response.url

                yield item

    def is_valid_image(self, url):
        """Filter images by extension and size indicators"""
        if not url:
            return False

        # Check file extension
        parsed = urlparse(url.lower())
        valid_extensions = ['.jpg', '.jpeg', '.png', '.gif', '.webp']

        if not any(parsed.path.endswith(ext) for ext in valid_extensions):
            return False

        # Skip thumbnails and small images based on filename
        skip_patterns = ['thumb', 'icon', 'logo', 'avatar']
        if any(pattern in url.lower() for pattern in skip_patterns):
            return False

        return True

Step 5: Run Your Spider

Execute the spider from your project directory:

# Basic run
scrapy crawl image_spider

# Run with custom settings
scrapy crawl image_spider -s IMAGES_STORE=/custom/path

# Save items to JSON file
scrapy crawl image_spider -o images.json

Understanding Image Pipeline Behavior

The ImagesPipeline automatically:

  1. Downloads images from URLs in the image_urls field
  2. Generates unique filenames using SHA1 hash of the image URL
  3. Populates the images field with metadata:
   item['images'] = [
       {
           'url': 'https://example.com/image.jpg',
           'path': 'full/0a79c461a4062ac4dc05c8e7d6d78a2e4e5a7d4b.jpg',
           'checksum': '0a79c461a4062ac4dc05c8e7d6d78a2e4e5a7d4b'
       }
   ]
  1. Avoids duplicate downloads by checking the checksum
  2. Creates thumbnails if configured

Advanced Configuration

Custom Image Pipeline

Create a custom pipeline for additional processing:

from scrapy.pipelines.images import ImagesPipeline
from scrapy import Request
import os

class CustomImagesPipeline(ImagesPipeline):

    def get_media_requests(self, item, info):
        # Filter URLs before downloading
        for image_url in item['image_urls']:
            if self.should_download(image_url):
                yield Request(image_url, meta={'item': item})

    def should_download(self, url):
        # Custom filtering logic
        return url.endswith(('.jpg', '.png', '.jpeg'))

    def file_path(self, request, response=None, info=None):
        # Custom filename generation
        item = request.meta['item']
        image_name = request.url.split('/')[-1]
        return f"{item.get('title', 'untitled')}/{image_name}"

Register the custom pipeline in settings.py:

ITEM_PIPELINES = {
    'myproject.pipelines.CustomImagesPipeline': 1,
}

Troubleshooting Common Issues

Images Not Downloading

  • Verify ITEM_PIPELINES is correctly configured
  • Check that IMAGES_STORE directory is writable
  • Ensure image_urls contains a list of absolute URLs

Relative URL Issues

# Convert relative URLs to absolute
absolute_urls = [response.urljoin(url) for url in relative_urls]

Memory Issues with Large Images

# Add size limits in settings.py
IMAGES_MIN_HEIGHT = 100
IMAGES_MIN_WIDTH = 100
# Note: No built-in max size, implement in custom pipeline if needed

SSL Certificate Errors

# In settings.py - not recommended for production
DOWNLOADER_CLIENT_TLS_METHOD = 'TLSv1.2'

The ImagesPipeline is ideal for bulk image downloading with automatic deduplication and thumbnail generation, making it perfect for gallery scraping, product image collection, and content archival projects.

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon