Table of contents

How do I scrape data from multiple pages in Scrapy?

Scraping data from multiple pages is one of Scrapy's core strengths. Whether you're dealing with pagination, following links, or crawling entire websites, Scrapy provides several powerful mechanisms to handle multi-page data extraction efficiently. This guide covers the most effective approaches for scraping data across multiple pages.

Understanding Multi-Page Scraping in Scrapy

Multi-page scraping involves navigating through multiple URLs to extract data. This can include:

  • Pagination: Moving through numbered pages of results
  • Link following: Discovering and following links on pages
  • URL generation: Creating URLs programmatically
  • Sitemap crawling: Using XML sitemaps to discover pages

Method 1: Following Links with response.follow()

The most common approach is to follow links found on pages using Scrapy's response.follow() method:

import scrapy

class MultiPageSpider(scrapy.Spider):
    name = 'multipage'
    start_urls = ['https://example.com/page1']

    def parse(self, response):
        # Extract data from current page
        for item in response.css('.item'):
            yield {
                'title': item.css('.title::text').get(),
                'price': item.css('.price::text').get(),
                'url': response.url
            }

        # Follow pagination links
        next_page = response.css('a.next-page::attr(href)').get()
        if next_page:
            yield response.follow(next_page, self.parse)

        # Follow detail page links
        for detail_link in response.css('.item a::attr(href)').getall():
            yield response.follow(detail_link, self.parse_detail)

    def parse_detail(self, response):
        yield {
            'title': response.css('h1::text').get(),
            'description': response.css('.description::text').getall(),
            'images': response.css('.gallery img::attr(src)').getall(),
        }

Method 2: Handling Pagination with URL Generation

For numbered pagination, you can generate URLs programmatically:

import scrapy

class PaginatedSpider(scrapy.Spider):
    name = 'paginated'

    def start_requests(self):
        base_url = 'https://example.com/products?page={}'
        for page_num in range(1, 101):  # Pages 1-100
            yield scrapy.Request(
                url=base_url.format(page_num),
                callback=self.parse,
                meta={'page': page_num}
            )

    def parse(self, response):
        page_num = response.meta['page']

        # Check if page has content
        items = response.css('.product')
        if not items:
            self.logger.info(f'No items found on page {page_num}')
            return

        for item in items:
            yield {
                'name': item.css('.name::text').get(),
                'price': item.css('.price::text').get(),
                'page': page_num
            }

Method 3: Advanced Link Following with Rules

For complex crawling patterns, use CrawlSpider with rules:

from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor

class AdvancedCrawlSpider(CrawlSpider):
    name = 'advanced_crawl'
    allowed_domains = ['example.com']
    start_urls = ['https://example.com/']

    rules = (
        # Follow pagination links
        Rule(
            LinkExtractor(restrict_css='.pagination a'),
            callback='parse_listing',
            follow=True
        ),
        # Follow category links
        Rule(
            LinkExtractor(restrict_css='.categories a'),
            callback='parse_listing',
            follow=True
        ),
        # Follow product detail links
        Rule(
            LinkExtractor(restrict_css='.product-link'),
            callback='parse_item',
            follow=False
        ),
    )

    def parse_listing(self, response):
        # Extract items from listing pages
        for item in response.css('.product-summary'):
            yield {
                'title': item.css('.title::text').get(),
                'summary': item.css('.summary::text').get(),
                'listing_url': response.url
            }

    def parse_item(self, response):
        # Extract detailed item information
        yield {
            'title': response.css('h1::text').get(),
            'price': response.css('.price::text').get(),
            'description': response.css('.description::text').getall(),
            'specifications': {
                spec.css('.label::text').get(): spec.css('.value::text').get()
                for spec in response.css('.spec-row')
            }
        }

Method 4: Dynamic Pagination Detection

For websites with dynamic pagination, implement smart detection:

import scrapy
import re

class SmartPaginationSpider(scrapy.Spider):
    name = 'smart_pagination'
    start_urls = ['https://example.com/products']

    def parse(self, response):
        # Extract items from current page
        items = response.css('.product')

        for item in items:
            yield {
                'name': item.css('.name::text').get(),
                'price': item.css('.price::text').get(),
            }

        # Smart pagination detection
        pagination_selectors = [
            'a[rel="next"]',           # Standard next link
            '.pagination .next',       # Common pagination class
            'a:contains("Next")',      # Text-based next link
            '.pager-next a',          # Drupal-style pagination
        ]

        next_url = None
        for selector in pagination_selectors:
            next_url = response.css(f'{selector}::attr(href)').get()
            if next_url:
                break

        # Alternative: Extract from JavaScript
        if not next_url:
            js_next = re.search(r'nextPageUrl["\']:\s*["\']([^"\']+)', response.text)
            if js_next:
                next_url = js_next.group(1)

        if next_url and len(items) > 0:  # Only follow if current page has items
            yield response.follow(next_url, self.parse)

Method 5: Handling AJAX Pagination

For AJAX-loaded content, you can make direct API calls:

import scrapy
import json

class AjaxPaginationSpider(scrapy.Spider):
    name = 'ajax_pagination'
    start_urls = ['https://example.com/api/products?page=1']

    def parse(self, response):
        data = json.loads(response.text)

        # Extract items from API response
        for item in data.get('products', []):
            yield {
                'id': item.get('id'),
                'name': item.get('name'),
                'price': item.get('price'),
            }

        # Check for next page
        current_page = data.get('current_page', 1)
        total_pages = data.get('total_pages', 1)

        if current_page < total_pages:
            next_page_url = f'https://example.com/api/products?page={current_page + 1}'
            yield scrapy.Request(next_page_url, callback=self.parse)

Best Practices for Multi-Page Scraping

1. Implement Proper Rate Limiting

# In settings.py
DOWNLOAD_DELAY = 1  # 1 second delay between requests
RANDOMIZE_DOWNLOAD_DELAY = 0.5  # Randomize delay (0.5 * DOWNLOAD_DELAY to 1.5 * DOWNLOAD_DELAY)
CONCURRENT_REQUESTS = 16  # Number of concurrent requests
CONCURRENT_REQUESTS_PER_DOMAIN = 8  # Per domain limit

2. Handle Duplicate URLs

# In settings.py
DUPEFILTER_CLASS = 'scrapy.dupefilters.RFPDupeFilter'

# Or use custom duplicate filtering
class CustomDupeFilter:
    def __init__(self):
        self.seen_urls = set()

    def request_seen(self, request):
        if request.url in self.seen_urls:
            return True
        self.seen_urls.add(request.url)
        return False

3. Implement Robust Error Handling

class RobustMultiPageSpider(scrapy.Spider):
    name = 'robust_multipage'

    def parse(self, response):
        # Check for valid response
        if response.status != 200:
            self.logger.warning(f'Non-200 response: {response.status} for {response.url}')
            return

        # Check for content
        if not response.css('.content'):
            self.logger.warning(f'No content found on {response.url}')
            return

        # Extract data with error handling
        try:
            for item in response.css('.item'):
                title = item.css('.title::text').get()
                if title:  # Only yield if we have essential data
                    yield {
                        'title': title.strip(),
                        'price': item.css('.price::text').get(),
                        'url': response.url
                    }
        except Exception as e:
            self.logger.error(f'Error parsing {response.url}: {e}')

        # Follow next page with retry logic
        next_page = response.css('a.next::attr(href)').get()
        if next_page:
            yield scrapy.Request(
                response.urljoin(next_page),
                callback=self.parse,
                dont_filter=False,
                errback=self.handle_error
            )

    def handle_error(self, failure):
        self.logger.error(f'Request failed: {failure}')

Advanced Techniques

1. Using Meta Data for State Management

def parse(self, response):
    # Pass data between requests
    category = response.meta.get('category', 'unknown')
    page_num = response.meta.get('page', 1)

    for item in response.css('.item'):
        yield {
            'title': item.css('.title::text').get(),
            'category': category,
            'page': page_num
        }

    # Pass meta data to next request
    next_page = response.css('.next::attr(href)').get()
    if next_page:
        yield response.follow(
            next_page,
            callback=self.parse,
            meta={
                'category': category,
                'page': page_num + 1
            }
        )

2. Parallel Processing with Multiple Start URLs

class ParallelMultiPageSpider(scrapy.Spider):
    name = 'parallel_multipage'

    def start_requests(self):
        categories = ['electronics', 'books', 'clothing', 'home']

        for category in categories:
            yield scrapy.Request(
                f'https://example.com/{category}',
                callback=self.parse_category,
                meta={'category': category}
            )

    def parse_category(self, response):
        category = response.meta['category']

        # Extract items
        for item in response.css('.product'):
            yield {
                'name': item.css('.name::text').get(),
                'category': category,
                'url': response.url
            }

        # Follow pagination within category
        next_page = response.css('.pagination .next::attr(href)').get()
        if next_page:
            yield response.follow(
                next_page,
                callback=self.parse_category,
                meta={'category': category}
            )

Integration with Other Tools

While Scrapy excels at multi-page scraping, you might also consider other tools for specific scenarios. For JavaScript-heavy sites, handling AJAX requests using Puppeteer can be more effective. Additionally, when dealing with complex navigation patterns, running multiple pages in parallel with Puppeteer offers excellent performance for browser automation tasks.

Monitoring and Optimization

1. Track Scraping Progress

# Custom extension for progress tracking
class ProgressExtension:
    def __init__(self, crawler):
        self.crawler = crawler
        self.pages_scraped = 0

    def spider_opened(self, spider):
        spider.logger.info('Spider opened: %s' % spider.name)

    def response_received(self, response, request, spider):
        self.pages_scraped += 1
        if self.pages_scraped % 100 == 0:
            spider.logger.info(f'Scraped {self.pages_scraped} pages')

2. Memory Management

# In settings.py
MEMUSAGE_ENABLED = True
MEMUSAGE_LIMIT_MB = 2048
MEMUSAGE_WARNING_MB = 1024

Conclusion

Scraping data from multiple pages in Scrapy is highly flexible and can be accomplished through various methods depending on your specific requirements. Whether you're dealing with simple pagination, complex link structures, or AJAX-loaded content, Scrapy provides the tools necessary for efficient multi-page data extraction.

Key takeaways: - Use response.follow() for simple link following - Implement CrawlSpider with rules for complex crawling patterns - Generate URLs programmatically for predictable pagination - Always implement proper error handling and rate limiting - Monitor your scraping progress and optimize for performance

Remember to respect robots.txt files and implement appropriate delays to avoid overwhelming target servers. With these techniques, you can efficiently scrape data from websites with hundreds or thousands of pages while maintaining reliability and performance.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon