In Scrapy, you can follow links by using the LinkExtractor
class from the scrapy.linkextractors
module. This class constructs a selector object, which you can use to extract links from a webpage.
Here's an example of how to use LinkExtractor
in a spider:
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
class MySpider(CrawlSpider):
name = 'example.com'
start_urls = ['http://www.example.com']
rules = (
Rule(LinkExtractor(allow=('category\.php', )), callback='parse_item'),
)
def parse_item(self, response):
# Your parsing code here
In this example, the LinkExtractor
is configured to follow any link that includes the string 'category.php'. The allow
parameter accepts a regular expression, which is used to filter the links that the LinkExtractor
will follow.
In the parse_item
method, you can add your parsing code. This function will be called for each webpage that the spider crawls.
If you want to follow the links in the order they appear on the page, you can use the priority
parameter of the Rule
class. Links with higher priority will be followed first.
rules = (
Rule(LinkExtractor(allow=('category\.php', )), callback='parse_item', priority=1),
)
In this example, the spider will follow the links that include 'category.php' before any other links.
Also, You can use response.follow
to follow links.
def parse(self, response):
for href in response.css('a::attr(href)'):
yield response.follow(href, self.parse)
In this case, the response.follow
method returns a Request
instance that you can yield to make Scrapy download the linked page and call the specified callback method (self.parse
) with the response to that page.
The response.follow
method can handle relative URLs correctly and it can receive a selector in addition to a string. That means you can write the previous example more concisely, like this:
def parse(self, response):
for href in response.css('a::attr(href)'):
yield response.follow(href, callback=self.parse)
In this case, response.follow
directly uses the result (a relative URL) of the href
selector.
Remember that you need to be careful with following links and crawling websites, to avoid overloading the servers or breaking any terms of service. Always check the robots.txt
file and respect the website's crawling policies.