Scrapy is a powerful web scraping framework that allows developers to write spiders in Python to crawl websites and extract data. While Scrapy supports both CSS and XPath selectors, in this article we'll focus on how to use CSS selectors in Scrapy.
CSS selectors are patterns used to select the elements you want to style. Here, we will use them to select the HTML elements we want to scrape.
Scrapy provides the css
method to use CSS selectors in your spider. This method returns a selector list of Selector objects for each element that matches the pattern. You can chain CSS selectors together to dig into deeper levels of the HTML.
Here is an example of a Scrapy spider using CSS selectors:
import scrapy
class MySpider(scrapy.Spider):
name = 'myspider'
start_urls = ['http://example.com']
def parse(self, response):
# extract the title of the page
title = response.css('title::text').get()
print('Title: ', title)
# extract all the links on the page
for link in response.css('a::attr(href)').getall():
print('Link: ', link)
In this example, the css
method is used to select the title of the webpage and all the links on the page. The ::text
pseudo-element is used to select the inner text of an element, and the ::attr(name)
pseudo-element is used to select the value of an attribute of an element.
For example, a::attr(href)
will select the href attribute of all a elements (links), effectively extracting all the URLs on the page.
Remember that:
getall()
returns a list with all matches.get()
returns the first match. If no elements match the selector, it returnsNone
.
You can also use CSS selectors within the Scrapy shell to test your selectors before incorporating them into your spider. To start the Scrapy shell, use the scrapy shell
command followed by a URL:
scrapy shell 'http://example.com'
Then, you can use the response.css
command to test your CSS selectors:
>>> response.css('title::text').get()
'Example Domain'
In a nutshell, CSS selectors in Scrapy are a powerful tool that allows you to pinpoint the exact content you want to scrape from a webpage. They are easy to use, highly readable, and can be chained together to navigate complex HTML structures.