Handling pagination in Scrapy is usually done through two steps:
- First, you need to scrape the data on the current page.
- Then, you use the
Link Extractor
to follow the link to the next page.
Let's assume we're scraping a blog where each page contains several posts and there's a 'next' button to go to the next page.
Here is a sample code snippet to illustrate how to handle pagination:
import scrapy
from scrapy.linkextractors import LinkExtractor
class BlogSpider(scrapy.Spider):
name = 'blogspider'
start_urls = ['http://blog.example.com']
def parse(self, response):
for post in response.css('div.post'):
yield {'title': post.css('h2 a::text').get()}
next_page = response.css('a.next::attr(href)').get()
if next_page is not None:
yield response.follow(next_page, self.parse)
This code will start at the blog homepage, extract the title of each post, and then follow the link to the next page (the 'next' button), continuing this process until there are no more pages.
The 'next' button is identified by its CSS class ('a.next'), and response.follow
is used to create a new request to the next page. self.parse
is passed as the callback to this request, creating a loop: Scrapy will call the parse
method for each new page, repeating the extraction process for the new set of posts.
If the 'next' button is not easily identified by a CSS class or any other attribute, you can use Scrapy's LinkExtractor
. Here's how you can do it:
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
class BlogSpider(CrawlSpider):
name = 'blogspider'
start_urls = ['http://blog.example.com']
rules = (
Rule(LinkExtractor(allow=(), restrict_css=('a.next',)), callback="parse_item", follow=True),
)
def parse_item(self, response):
for post in response.css('div.post'):
yield {'title': post.css('h2 a::text').get()}
In this version, the LinkExtractor
will follow every link that is part of the 'next' button (identified by the CSS class 'a.next').
Please note that you need to adjust the CSS selectors ('div.post', 'h2 a::text' and 'a.next') to match the actual CSS in the website you are scraping.