How do I create a spider in Scrapy?
Creating a spider in Scrapy is the fundamental step to start web scraping with this powerful Python framework. A spider is a class that defines how a website should be scraped, including which URLs to start with, how to follow links, and how to extract data from pages.
What is a Scrapy Spider?
A Scrapy spider is a Python class that inherits from scrapy.Spider
and defines the scraping logic for extracting data from websites. Each spider contains:
- A unique name identifier
- Starting URLs to begin scraping
- Rules for following links (optional)
- Methods to parse responses and extract data
Prerequisites
Before creating a spider, ensure you have Scrapy installed and a project set up:
# Install Scrapy
pip install scrapy
# Create a new Scrapy project
scrapy startproject myproject
cd myproject
Basic Spider Structure
Here's the minimal structure of a Scrapy spider:
import scrapy
class MySpider(scrapy.Spider):
name = 'example'
start_urls = ['https://example.com']
def parse(self, response):
# Extract data here
pass
Creating Your First Spider
Method 1: Using Scrapy's genspider Command
The quickest way to create a spider is using Scrapy's built-in command:
# Generate a basic spider
scrapy genspider quotes quotes.toscrape.com
# Generate a spider with a specific template
scrapy genspider -t crawl quotes quotes.toscrape.com
This creates a spider file in the spiders/
directory with basic structure already set up.
Method 2: Manual Creation
Create a new Python file in the spiders/
directory:
# spiders/quotes_spider.py
import scrapy
class QuotesSpider(scrapy.Spider):
name = 'quotes'
allowed_domains = ['quotes.toscrape.com']
start_urls = [
'https://quotes.toscrape.com/page/1/',
'https://quotes.toscrape.com/page/2/',
]
def parse(self, response):
# Extract quotes
for quote in response.css('div.quote'):
yield {
'text': quote.css('span.text::text').get(),
'author': quote.css('small.author::text').get(),
'tags': quote.css('div.tags a.tag::text').getall(),
}
# Follow pagination links
next_page = response.css('li.next a::attr(href)').get()
if next_page is not None:
yield response.follow(next_page, self.parse)
Advanced Spider Features
Handling Different Response Types
import scrapy
import json
class AdvancedSpider(scrapy.Spider):
name = 'advanced'
start_urls = ['https://api.example.com/data']
def parse(self, response):
# Handle JSON responses
if response.headers.get('content-type', b'').startswith(b'application/json'):
data = json.loads(response.text)
for item in data['results']:
yield self.parse_item(item)
# Handle HTML responses
else:
for selector in response.css('div.item'):
yield self.parse_html_item(selector)
def parse_item(self, item_data):
return {
'id': item_data.get('id'),
'name': item_data.get('name'),
'description': item_data.get('description')
}
def parse_html_item(self, selector):
return {
'title': selector.css('h2::text').get(),
'price': selector.css('.price::text').re_first(r'[\d.]+'),
'availability': selector.css('.stock::text').get()
}
Using Spider Arguments
Pass arguments to your spider for dynamic behavior:
class ParameterizedSpider(scrapy.Spider):
name = 'parameterized'
def __init__(self, category=None, max_pages=None, *args, **kwargs):
super(ParameterizedSpider, self).__init__(*args, **kwargs)
self.start_urls = [f'https://example.com/category/{category}']
self.max_pages = int(max_pages) if max_pages else 10
self.page_count = 0
def parse(self, response):
# Extract data
for item in response.css('div.product'):
yield {
'name': item.css('h3::text').get(),
'price': item.css('.price::text').get()
}
# Respect max_pages limit
if self.page_count < self.max_pages:
next_page = response.css('a.next::attr(href)').get()
if next_page:
self.page_count += 1
yield response.follow(next_page, self.parse)
Run with arguments:
scrapy crawl parameterized -a category=electronics -a max_pages=5
Spider Templates
Scrapy provides several spider templates for different use cases:
Basic Spider (Default)
import scrapy
class BasicSpider(scrapy.Spider):
name = 'basic'
allowed_domains = ['example.com']
start_urls = ['https://example.com']
def parse(self, response):
pass
CrawlSpider for Following Links
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
class CrawlSpider(CrawlSpider):
name = 'crawl'
allowed_domains = ['example.com']
start_urls = ['https://example.com']
rules = (
Rule(LinkExtractor(allow=r'/product/'), callback='parse_item'),
Rule(LinkExtractor(allow=r'/category/'), follow=True),
)
def parse_item(self, response):
yield {
'name': response.css('h1::text').get(),
'description': response.css('.description::text').get()
}
Best Practices for Spider Creation
1. Proper Error Handling
def parse(self, response):
try:
for item in response.css('div.item'):
data = {
'title': item.css('h2::text').get(),
'price': item.css('.price::text').re_first(r'[\d.]+')
}
# Validate required fields
if data['title'] and data['price']:
yield data
else:
self.logger.warning(f"Incomplete data: {data}")
except Exception as e:
self.logger.error(f"Error parsing response: {e}")
2. Implementing Delays and Rate Limiting
Configure in settings.py
:
# settings.py
DOWNLOAD_DELAY = 1 # 1 second delay between requests
RANDOMIZE_DOWNLOAD_DELAY = 0.5 # 0.5 * to 1.5 * DOWNLOAD_DELAY
CONCURRENT_REQUESTS = 16
CONCURRENT_REQUESTS_PER_DOMAIN = 8
3. User Agent Rotation
# settings.py
USER_AGENT_LIST = [
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36',
'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36'
]
DOWNLOADER_MIDDLEWARES = {
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
'scrapy_user_agents.middlewares.RandomUserAgentMiddleware': 400,
}
Running Your Spider
Once created, run your spider with these commands:
# Basic run
scrapy crawl quotes
# Save output to file
scrapy crawl quotes -o quotes.json
scrapy crawl quotes -o quotes.csv
scrapy crawl quotes -o quotes.xml
# Run with custom settings
scrapy crawl quotes -s DOWNLOAD_DELAY=2
# Run with logging
scrapy crawl quotes -L INFO
Common Spider Patterns
Handling Forms and POST Requests
def parse(self, response):
# Submit a form
return scrapy.FormRequest.from_response(
response,
formdata={'username': 'admin', 'password': 'secret'},
callback=self.after_login
)
def after_login(self, response):
# Continue scraping after login
yield response.follow('/protected-page', self.parse_protected)
Managing Sessions and Cookies
def start_requests(self):
# Start with login
return [scrapy.Request('https://example.com/login',
callback=self.login)]
def login(self, response):
# Perform login and maintain session
return scrapy.FormRequest.from_response(
response,
formdata={'user': 'admin', 'pass': 'secret'},
callback=self.after_login
)
Integration with Other Tools
While Scrapy is excellent for systematic web scraping, you might also consider tools like Puppeteer for JavaScript-heavy sites. For complex scenarios involving dynamic content loading, you might want to explore how to handle authentication in Puppeteer for sites requiring login flows.
For monitoring network requests during scraping, similar concepts apply across tools - you can learn about monitoring network requests in Puppeteer which can help understand request patterns that you might want to replicate in Scrapy.
Troubleshooting Common Issues
Spider Not Found Error
Ensure your spider class:
- Inherits from scrapy.Spider
- Has a unique name
attribute
- Is in the spiders/
directory
- Has valid Python syntax
Import Errors
Check that: - Scrapy is properly installed - Your project structure is correct - All required modules are imported
No Data Extracted
Verify: - Your CSS/XPath selectors are correct - The target website structure hasn't changed - Response status codes are successful (200)
Conclusion
Creating a spider in Scrapy involves defining a class that inherits from scrapy.Spider
, specifying starting URLs, and implementing parsing logic. With proper error handling, rate limiting, and following best practices, you can build robust web scrapers that efficiently extract data from websites.
Start with simple spiders and gradually add complexity as needed. Remember to respect website terms of service and implement appropriate delays to avoid overwhelming target servers.