How do I use CSS selectors in Scrapy?

Scrapy is a powerful web scraping framework that allows developers to write spiders in Python to crawl websites and extract data. While Scrapy supports both CSS and XPath selectors, in this article we'll focus on how to use CSS selectors in Scrapy.

CSS selectors are patterns used to select the elements you want to style. Here, we will use them to select the HTML elements we want to scrape.

Scrapy provides the css method to use CSS selectors in your spider. This method returns a selector list of Selector objects for each element that matches the pattern. You can chain CSS selectors together to dig into deeper levels of the HTML.

Here is an example of a Scrapy spider using CSS selectors:

import scrapy

class MySpider(scrapy.Spider):
    name = 'myspider'
    start_urls = ['http://example.com']

    def parse(self, response):
        # extract the title of the page
        title = response.css('title::text').get()
        print('Title: ', title)

        # extract all the links on the page
        for link in response.css('a::attr(href)').getall():
            print('Link: ', link)

In this example, the css method is used to select the title of the webpage and all the links on the page. The ::text pseudo-element is used to select the inner text of an element, and the ::attr(name) pseudo-element is used to select the value of an attribute of an element.

For example, a::attr(href) will select the href attribute of all a elements (links), effectively extracting all the URLs on the page.

Remember that:

  • getall() returns a list with all matches.
  • get() returns the first match. If no elements match the selector, it returns None.

You can also use CSS selectors within the Scrapy shell to test your selectors before incorporating them into your spider. To start the Scrapy shell, use the scrapy shell command followed by a URL:

scrapy shell 'http://example.com'

Then, you can use the response.css command to test your CSS selectors:

>>> response.css('title::text').get()
'Example Domain'

In a nutshell, CSS selectors in Scrapy are a powerful tool that allows you to pinpoint the exact content you want to scrape from a webpage. They are easy to use, highly readable, and can be chained together to navigate complex HTML structures.

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon