How do I follow links in Scrapy?

In Scrapy, you can follow links by using the LinkExtractor class from the scrapy.linkextractors module. This class constructs a selector object, which you can use to extract links from a webpage.

Here's an example of how to use LinkExtractor in a spider:

from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule

class MySpider(CrawlSpider):
    name = 'example.com'
    start_urls = ['http://www.example.com']

    rules = (
        Rule(LinkExtractor(allow=('category\.php', )), callback='parse_item'),
    )

    def parse_item(self, response):
        # Your parsing code here

In this example, the LinkExtractor is configured to follow any link that includes the string 'category.php'. The allow parameter accepts a regular expression, which is used to filter the links that the LinkExtractor will follow.

In the parse_item method, you can add your parsing code. This function will be called for each webpage that the spider crawls.

If you want to follow the links in the order they appear on the page, you can use the priority parameter of the Rule class. Links with higher priority will be followed first.

rules = (
    Rule(LinkExtractor(allow=('category\.php', )), callback='parse_item', priority=1),
)

In this example, the spider will follow the links that include 'category.php' before any other links.

Also, You can use response.follow to follow links.

def parse(self, response):
    for href in response.css('a::attr(href)'):
        yield response.follow(href, self.parse)

In this case, the response.follow method returns a Request instance that you can yield to make Scrapy download the linked page and call the specified callback method (self.parse) with the response to that page.

The response.follow method can handle relative URLs correctly and it can receive a selector in addition to a string. That means you can write the previous example more concisely, like this:

def parse(self, response):
    for href in response.css('a::attr(href)'):
        yield response.follow(href, callback=self.parse)

In this case, response.follow directly uses the result (a relative URL) of the href selector.

Remember that you need to be careful with following links and crawling websites, to avoid overloading the servers or breaking any terms of service. Always check the robots.txt file and respect the website's crawling policies.

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon