What is Scrapy?

Scrapy is an open-source and collaborative web crawling framework written in Python. It is used to extract the data from the web page with the help of selectors based on XPath. Scrapy is also able to handle different types of requests and can build and scale large crawling projects, ensuring an efficient way to extract data.

Scrapy is not only able to scrape data from the websites, but it can also be used to extract data using APIs or as a general-purpose web crawler.

Key Features of Scrapy

  • Scalability: Scrapy supports a large number of simultaneous requests which makes it very efficient.

  • Built-in Selectors: It has built-in support for selecting and extracting data from sources either by using XPath or CSS expressions.

  • Extensible: It provides a lot of built-in extensions and middleware for cookies and sessions handling, HTTP features, etc.

  • Robust: It handles errors and exceptions and enables you to retry failed requests, and it is resilient to errors by nature.

  • Portable: Scrapy is written in pure Python, it can run on any platform which supports Python.

Example of Scrapy in Python

Let's say we want to extract quotes from http://quotes.toscrape.com. Here is an example of a Spider in Scrapy which extracts and prints the quotes:

import scrapy

class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        'http://quotes.toscrape.com/page/1/',
    ]

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').get(),
                'author': quote.css('span small::text').get(),
                'tags': quote.css('div.tags a.tag::text').getall(),
            }
        next_page = response.css('li.next a::attr(href)').get()
        if next_page is not None:
            yield response.follow(next_page, self.parse)

The Spider subclasses scrapy.Spider and defines some attributes and methods:

  • name: identifies the Spider. It must be unique within a project.

  • start_urls: a list of URLs where the Spider will begin to crawl from.

  • parse(): a method that will be called to handle the response downloaded for each of the requests made.

The parse() method usually parses the response, extracting the scraped data as dicts and also finding new URLs to follow and creating new requests (Request) from them.

Note that Scrapy is only available in Python. For web scraping in JavaScript, you can use libraries such as Cheerio, Puppeteer, or jsdom.

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon