How do I scrape data from multiple pages in Scrapy?
Scraping data from multiple pages is one of Scrapy's core strengths. Whether you're dealing with pagination, following links, or crawling entire websites, Scrapy provides several powerful mechanisms to handle multi-page data extraction efficiently. This guide covers the most effective approaches for scraping data across multiple pages.
Understanding Multi-Page Scraping in Scrapy
Multi-page scraping involves navigating through multiple URLs to extract data. This can include:
- Pagination: Moving through numbered pages of results
- Link following: Discovering and following links on pages
- URL generation: Creating URLs programmatically
- Sitemap crawling: Using XML sitemaps to discover pages
Method 1: Following Links with response.follow()
The most common approach is to follow links found on pages using Scrapy's response.follow()
method:
import scrapy
class MultiPageSpider(scrapy.Spider):
name = 'multipage'
start_urls = ['https://example.com/page1']
def parse(self, response):
# Extract data from current page
for item in response.css('.item'):
yield {
'title': item.css('.title::text').get(),
'price': item.css('.price::text').get(),
'url': response.url
}
# Follow pagination links
next_page = response.css('a.next-page::attr(href)').get()
if next_page:
yield response.follow(next_page, self.parse)
# Follow detail page links
for detail_link in response.css('.item a::attr(href)').getall():
yield response.follow(detail_link, self.parse_detail)
def parse_detail(self, response):
yield {
'title': response.css('h1::text').get(),
'description': response.css('.description::text').getall(),
'images': response.css('.gallery img::attr(src)').getall(),
}
Method 2: Handling Pagination with URL Generation
For numbered pagination, you can generate URLs programmatically:
import scrapy
class PaginatedSpider(scrapy.Spider):
name = 'paginated'
def start_requests(self):
base_url = 'https://example.com/products?page={}'
for page_num in range(1, 101): # Pages 1-100
yield scrapy.Request(
url=base_url.format(page_num),
callback=self.parse,
meta={'page': page_num}
)
def parse(self, response):
page_num = response.meta['page']
# Check if page has content
items = response.css('.product')
if not items:
self.logger.info(f'No items found on page {page_num}')
return
for item in items:
yield {
'name': item.css('.name::text').get(),
'price': item.css('.price::text').get(),
'page': page_num
}
Method 3: Advanced Link Following with Rules
For complex crawling patterns, use CrawlSpider
with rules:
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
class AdvancedCrawlSpider(CrawlSpider):
name = 'advanced_crawl'
allowed_domains = ['example.com']
start_urls = ['https://example.com/']
rules = (
# Follow pagination links
Rule(
LinkExtractor(restrict_css='.pagination a'),
callback='parse_listing',
follow=True
),
# Follow category links
Rule(
LinkExtractor(restrict_css='.categories a'),
callback='parse_listing',
follow=True
),
# Follow product detail links
Rule(
LinkExtractor(restrict_css='.product-link'),
callback='parse_item',
follow=False
),
)
def parse_listing(self, response):
# Extract items from listing pages
for item in response.css('.product-summary'):
yield {
'title': item.css('.title::text').get(),
'summary': item.css('.summary::text').get(),
'listing_url': response.url
}
def parse_item(self, response):
# Extract detailed item information
yield {
'title': response.css('h1::text').get(),
'price': response.css('.price::text').get(),
'description': response.css('.description::text').getall(),
'specifications': {
spec.css('.label::text').get(): spec.css('.value::text').get()
for spec in response.css('.spec-row')
}
}
Method 4: Dynamic Pagination Detection
For websites with dynamic pagination, implement smart detection:
import scrapy
import re
class SmartPaginationSpider(scrapy.Spider):
name = 'smart_pagination'
start_urls = ['https://example.com/products']
def parse(self, response):
# Extract items from current page
items = response.css('.product')
for item in items:
yield {
'name': item.css('.name::text').get(),
'price': item.css('.price::text').get(),
}
# Smart pagination detection
pagination_selectors = [
'a[rel="next"]', # Standard next link
'.pagination .next', # Common pagination class
'a:contains("Next")', # Text-based next link
'.pager-next a', # Drupal-style pagination
]
next_url = None
for selector in pagination_selectors:
next_url = response.css(f'{selector}::attr(href)').get()
if next_url:
break
# Alternative: Extract from JavaScript
if not next_url:
js_next = re.search(r'nextPageUrl["\']:\s*["\']([^"\']+)', response.text)
if js_next:
next_url = js_next.group(1)
if next_url and len(items) > 0: # Only follow if current page has items
yield response.follow(next_url, self.parse)
Method 5: Handling AJAX Pagination
For AJAX-loaded content, you can make direct API calls:
import scrapy
import json
class AjaxPaginationSpider(scrapy.Spider):
name = 'ajax_pagination'
start_urls = ['https://example.com/api/products?page=1']
def parse(self, response):
data = json.loads(response.text)
# Extract items from API response
for item in data.get('products', []):
yield {
'id': item.get('id'),
'name': item.get('name'),
'price': item.get('price'),
}
# Check for next page
current_page = data.get('current_page', 1)
total_pages = data.get('total_pages', 1)
if current_page < total_pages:
next_page_url = f'https://example.com/api/products?page={current_page + 1}'
yield scrapy.Request(next_page_url, callback=self.parse)
Best Practices for Multi-Page Scraping
1. Implement Proper Rate Limiting
# In settings.py
DOWNLOAD_DELAY = 1 # 1 second delay between requests
RANDOMIZE_DOWNLOAD_DELAY = 0.5 # Randomize delay (0.5 * DOWNLOAD_DELAY to 1.5 * DOWNLOAD_DELAY)
CONCURRENT_REQUESTS = 16 # Number of concurrent requests
CONCURRENT_REQUESTS_PER_DOMAIN = 8 # Per domain limit
2. Handle Duplicate URLs
# In settings.py
DUPEFILTER_CLASS = 'scrapy.dupefilters.RFPDupeFilter'
# Or use custom duplicate filtering
class CustomDupeFilter:
def __init__(self):
self.seen_urls = set()
def request_seen(self, request):
if request.url in self.seen_urls:
return True
self.seen_urls.add(request.url)
return False
3. Implement Robust Error Handling
class RobustMultiPageSpider(scrapy.Spider):
name = 'robust_multipage'
def parse(self, response):
# Check for valid response
if response.status != 200:
self.logger.warning(f'Non-200 response: {response.status} for {response.url}')
return
# Check for content
if not response.css('.content'):
self.logger.warning(f'No content found on {response.url}')
return
# Extract data with error handling
try:
for item in response.css('.item'):
title = item.css('.title::text').get()
if title: # Only yield if we have essential data
yield {
'title': title.strip(),
'price': item.css('.price::text').get(),
'url': response.url
}
except Exception as e:
self.logger.error(f'Error parsing {response.url}: {e}')
# Follow next page with retry logic
next_page = response.css('a.next::attr(href)').get()
if next_page:
yield scrapy.Request(
response.urljoin(next_page),
callback=self.parse,
dont_filter=False,
errback=self.handle_error
)
def handle_error(self, failure):
self.logger.error(f'Request failed: {failure}')
Advanced Techniques
1. Using Meta Data for State Management
def parse(self, response):
# Pass data between requests
category = response.meta.get('category', 'unknown')
page_num = response.meta.get('page', 1)
for item in response.css('.item'):
yield {
'title': item.css('.title::text').get(),
'category': category,
'page': page_num
}
# Pass meta data to next request
next_page = response.css('.next::attr(href)').get()
if next_page:
yield response.follow(
next_page,
callback=self.parse,
meta={
'category': category,
'page': page_num + 1
}
)
2. Parallel Processing with Multiple Start URLs
class ParallelMultiPageSpider(scrapy.Spider):
name = 'parallel_multipage'
def start_requests(self):
categories = ['electronics', 'books', 'clothing', 'home']
for category in categories:
yield scrapy.Request(
f'https://example.com/{category}',
callback=self.parse_category,
meta={'category': category}
)
def parse_category(self, response):
category = response.meta['category']
# Extract items
for item in response.css('.product'):
yield {
'name': item.css('.name::text').get(),
'category': category,
'url': response.url
}
# Follow pagination within category
next_page = response.css('.pagination .next::attr(href)').get()
if next_page:
yield response.follow(
next_page,
callback=self.parse_category,
meta={'category': category}
)
Integration with Other Tools
While Scrapy excels at multi-page scraping, you might also consider other tools for specific scenarios. For JavaScript-heavy sites, handling AJAX requests using Puppeteer can be more effective. Additionally, when dealing with complex navigation patterns, running multiple pages in parallel with Puppeteer offers excellent performance for browser automation tasks.
Monitoring and Optimization
1. Track Scraping Progress
# Custom extension for progress tracking
class ProgressExtension:
def __init__(self, crawler):
self.crawler = crawler
self.pages_scraped = 0
def spider_opened(self, spider):
spider.logger.info('Spider opened: %s' % spider.name)
def response_received(self, response, request, spider):
self.pages_scraped += 1
if self.pages_scraped % 100 == 0:
spider.logger.info(f'Scraped {self.pages_scraped} pages')
2. Memory Management
# In settings.py
MEMUSAGE_ENABLED = True
MEMUSAGE_LIMIT_MB = 2048
MEMUSAGE_WARNING_MB = 1024
Conclusion
Scraping data from multiple pages in Scrapy is highly flexible and can be accomplished through various methods depending on your specific requirements. Whether you're dealing with simple pagination, complex link structures, or AJAX-loaded content, Scrapy provides the tools necessary for efficient multi-page data extraction.
Key takeaways:
- Use response.follow()
for simple link following
- Implement CrawlSpider
with rules for complex crawling patterns
- Generate URLs programmatically for predictable pagination
- Always implement proper error handling and rate limiting
- Monitor your scraping progress and optimize for performance
Remember to respect robots.txt files and implement appropriate delays to avoid overwhelming target servers. With these techniques, you can efficiently scrape data from websites with hundreds or thousands of pages while maintaining reliability and performance.