Scrapy is an open-source web crawling framework that allows you to write spiders to scrape web content and use it in your applications. Here's how you write a Scrapy spider:
To create a Scrapy project, navigate to your project directory and run the following command:
scrapy startproject myproject
This will create a new Scrapy project named "myproject". Now, navigate to the "spiders" directory inside the new project:
cd myproject/myproject/spiders
A Scrapy spider is a Python class that subclasses scrapy.Spider
. Here is a basic example:
import scrapy
class MySpider(scrapy.Spider):
name = 'my_spider'
start_urls = ['http://example.com']
def parse(self, response):
self.log('Visited %s' % response.url)
for quote in response.css('div.quote'):
yield {
'text': quote.css('span.text::text').get(),
'author': quote.css('span small::text').get(),
}
In this example, name
is the name of the spider, which Scrapy uses to start the spider. start_urls
is a list of URLs to start crawling from. parse
is a method that will be called to handle the response downloaded for each of the requests made.
To run the spider, you would use the scrapy crawl
command, as follows:
scrapy crawl my_spider