Scrapy is a popular open-source Python framework used for web scraping. An "Item" in Scrapy is a simple Python class where you define the type of data you want to scrape from a website. Each Item field corresponds to a particular data point you want to scrape.
Here's how you can define an item in Scrapy:
- First, you need to import
Item
andField
from Scrapy module:
from scrapy.item import Item, Field
- Then, you can define your item class. For instance, if you're scraping a book store and you want to extract the book name and its price, you can define your item like this:
class BookItem(Item):
name = Field()
price = Field()
In the above code, BookItem
is the name of our item, and name
and price
are the fields in this item. Each field is defined as a Field()
object. You don't need to specify the data type for these fields because Scrapy items are similar to Python dictionaries, so they can contain any type of data.
When you scrape the data, you'd typically create an instance of your item class, and then fill it with data. Here's an example:
def parse(self, response):
for book in response.css('div.book'):
item = BookItem()
item['name'] = book.css('h1 ::text').get()
item['price'] = book.css('p.price ::text').get()
yield item
In this code, parse
is a method in your Scrapy spider that processes the downloaded web page, extracts the data, and yields the items. The response
parameter is an instance of TextResponse
that holds the page content and has further helpful methods to handle it.
The response.css
method lets you use CSS selectors to select page elements, and the ::text
pseudo-element lets you get the text inside these elements. The get()
method returns the first match.