How do I create a new Scrapy project?
Creating a new Scrapy project is the first step in building powerful web scraping applications. Scrapy is a Python framework designed for extracting data from websites efficiently and at scale. This comprehensive guide will walk you through the entire process of setting up a new Scrapy project from scratch.
Prerequisites
Before creating a Scrapy project, ensure you have the following installed:
- Python 3.7 or higher
- pip (Python package installer)
- Virtual environment (recommended)
Installing Scrapy
First, create a virtual environment and install Scrapy:
# Create a virtual environment
python -m venv scrapy_env
# Activate the virtual environment
# On Windows:
scrapy_env\Scripts\activate
# On macOS/Linux:
source scrapy_env/bin/activate
# Install Scrapy
pip install scrapy
Creating a New Scrapy Project
Step 1: Generate the Project Structure
Use the scrapy startproject
command to create a new project:
scrapy startproject myproject
This command creates a directory structure like this:
myproject/
scrapy.cfg # deploy configuration file
myproject/ # project's Python module
__init__.py
items.py # project items definition file
middlewares.py # project middlewares file
pipelines.py # project pipelines file
settings.py # project settings file
spiders/ # directory for spiders
__init__.py
Step 2: Navigate to Your Project Directory
cd myproject
Understanding the Project Structure
scrapy.cfg
The main configuration file that defines the project settings module and deployment settings.
[settings]
default = myproject.settings
[deploy]
#url = http://localhost:6800/
project = myproject
settings.py
Contains all project settings and configurations:
# Scrapy settings for myproject project
BOT_NAME = 'myproject'
SPIDER_MODULES = ['myproject.spiders']
NEWSPIDER_MODULE = 'myproject.spiders'
# Obey robots.txt rules
ROBOTSTXT_OBEY = True
# Configure delays for requests
DOWNLOAD_DELAY = 1
RANDOMIZE_DOWNLOAD_DELAY = 0.5
# Configure user agent
USER_AGENT = 'myproject (+http://www.yourdomain.com)'
# Configure pipelines
ITEM_PIPELINES = {
'myproject.pipelines.MyprojectPipeline': 300,
}
items.py
Defines the data structures for extracted items:
import scrapy
class MyprojectItem(scrapy.Item):
title = scrapy.Field()
price = scrapy.Field()
description = scrapy.Field()
url = scrapy.Field()
pipelines.py
Processes scraped items:
from itemadapter import ItemAdapter
class MyprojectPipeline:
def process_item(self, item, spider):
# Process the item
adapter = ItemAdapter(item)
# Clean the data
if adapter.get('price'):
adapter['price'] = adapter['price'].strip()
return item
middlewares.py
Custom middleware for request/response processing:
from scrapy import signals
from itemadapter import is_item, ItemAdapter
class MyprojectSpiderMiddleware:
@classmethod
def from_crawler(cls, crawler):
s = cls()
crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
return s
def process_spider_input(self, response, spider):
return None
def spider_opened(self, spider):
spider.logger.info('Spider opened: %s' % spider.name)
Creating Your First Spider
Step 1: Generate a Spider
cd myproject
scrapy genspider example example.com
This creates a basic spider file in the spiders/
directory:
import scrapy
class ExampleSpider(scrapy.Spider):
name = 'example'
allowed_domains = ['example.com']
start_urls = ['http://example.com/']
def parse(self, response):
# Extract data from the response
for quote in response.css('div.quote'):
yield {
'text': quote.css('span.text::text').get(),
'author': quote.css('span small::text').get(),
'tags': quote.css('div.tags a.tag::text').getall(),
}
# Follow pagination links
next_page = response.css('li.next a::attr(href)').get()
if next_page is not None:
yield response.follow(next_page, self.parse)
Step 2: Customize Your Spider
Here's a more comprehensive spider example:
import scrapy
from myproject.items import MyprojectItem
class ProductSpider(scrapy.Spider):
name = 'products'
allowed_domains = ['example-store.com']
start_urls = ['https://example-store.com/products']
custom_settings = {
'DOWNLOAD_DELAY': 2,
'CONCURRENT_REQUESTS': 1,
}
def parse(self, response):
# Extract product URLs
product_urls = response.css('div.product a::attr(href)').getall()
for url in product_urls:
yield response.follow(url, self.parse_product)
# Handle pagination
next_page = response.css('a.next-page::attr(href)').get()
if next_page:
yield response.follow(next_page, self.parse)
def parse_product(self, response):
item = MyprojectItem()
item['title'] = response.css('h1.product-title::text').get()
item['price'] = response.css('span.price::text').re_first(r'\$(.+)')
item['description'] = response.css('div.description::text').get()
item['url'] = response.url
yield item
Running Your Spider
Basic Execution
# Run the spider
scrapy crawl example
# Save output to file
scrapy crawl example -o products.json
scrapy crawl example -o products.csv
scrapy crawl example -o products.xml
Advanced Options
# Set custom settings
scrapy crawl example -s DOWNLOAD_DELAY=3
# Override user agent
scrapy crawl example -s USER_AGENT='Custom Bot 1.0'
# Enable debug mode
scrapy crawl example -L DEBUG
Project Configuration Best Practices
1. Environment-Specific Settings
Create separate settings files for different environments:
# settings/base.py
BOT_NAME = 'myproject'
SPIDER_MODULES = ['myproject.spiders']
# settings/development.py
from .base import *
DOWNLOAD_DELAY = 3
CONCURRENT_REQUESTS = 1
# settings/production.py
from .base import *
DOWNLOAD_DELAY = 1
CONCURRENT_REQUESTS = 8
2. Custom Item Pipelines
# pipelines.py
import json
import sqlite3
class JsonWriterPipeline:
def open_spider(self, spider):
self.file = open('items.jl', 'w')
def close_spider(self, spider):
self.file.close()
def process_item(self, item, spider):
line = json.dumps(ItemAdapter(item).asdict()) + "\n"
self.file.write(line)
return item
class DatabasePipeline:
def open_spider(self, spider):
self.connection = sqlite3.connect('items.db')
self.cursor = self.connection.cursor()
self.cursor.execute('''
CREATE TABLE IF NOT EXISTS items (
id INTEGER PRIMARY KEY,
title TEXT,
price TEXT,
url TEXT
)
''')
def close_spider(self, spider):
self.connection.close()
def process_item(self, item, spider):
adapter = ItemAdapter(item)
self.cursor.execute('''
INSERT INTO items (title, price, url) VALUES (?, ?, ?)
''', (adapter['title'], adapter['price'], adapter['url']))
self.connection.commit()
return item
Advanced Project Features
Using Scrapy Shell for Development
The Scrapy shell is invaluable for testing selectors:
# Open shell with a URL
scrapy shell "https://example.com"
# Test CSS selectors
response.css('title::text').get()
response.css('div.content p::text').getall()
# Test XPath selectors
response.xpath('//title/text()').get()
Handling JavaScript-Heavy Sites
For sites requiring JavaScript execution, consider integrating with browser automation tools. While Scrapy excels at static content, you might need to combine it with tools like Puppeteer for handling dynamic content when dealing with single-page applications.
Deployment Considerations
When preparing your Scrapy project for production:
- Configure logging properly
- Set appropriate delays and concurrent requests
- Implement robust error handling
- Use rotating user agents and proxies
- Respect robots.txt and rate limits
Troubleshooting Common Issues
Import Errors
Ensure your project is properly structured and all __init__.py
files are present.
Spider Not Found
Verify the spider name matches the name
attribute in your spider class.
Permission Denied
Check robots.txt compliance and ensure ROBOTSTXT_OBEY
is set appropriately.
Next Steps
After creating your Scrapy project:
- Study the target website's structure
- Design appropriate item schemas
- Implement data validation and cleaning
- Configure pipelines for data storage
- Test thoroughly with different scenarios
- Optimize for performance and reliability
Creating a Scrapy project is just the beginning of your web scraping journey. The framework's flexibility allows you to build sophisticated scraping solutions that can handle complex websites and large-scale data extraction tasks efficiently.