How do I create a new Scrapy project?

Creating a new Scrapy project is the first step in building powerful web scraping applications. Scrapy is a Python framework designed for extracting data from websites efficiently and at scale. This comprehensive guide will walk you through the entire process of setting up a new Scrapy project from scratch.

Prerequisites

Before creating a Scrapy project, ensure you have the following installed:

Python 3.7 or higher
pip (Python package installer)
Virtual environment (recommended)

Installing Scrapy

First, create a virtual environment and install Scrapy:

# Create a virtual environment
python -m venv scrapy_env

# Activate the virtual environment
# On Windows:
scrapy_env\Scripts\activate
# On macOS/Linux:
source scrapy_env/bin/activate

# Install Scrapy
pip install scrapy

Creating a New Scrapy Project

Step 1: Generate the Project Structure

Use the scrapy startproject command to create a new project:

scrapy startproject myproject

This command creates a directory structure like this:

myproject/
    scrapy.cfg            # deploy configuration file
    myproject/            # project's Python module
        __init__.py
        items.py          # project items definition file
        middlewares.py    # project middlewares file
        pipelines.py      # project pipelines file
        settings.py       # project settings file
        spiders/          # directory for spiders
            __init__.py

Step 2: Navigate to Your Project Directory

cd myproject

Understanding the Project Structure

scrapy.cfg

The main configuration file that defines the project settings module and deployment settings.

[settings]
default = myproject.settings

[deploy]
#url = http://localhost:6800/
project = myproject

settings.py

Contains all project settings and configurations:

# Scrapy settings for myproject project

BOT_NAME = 'myproject'

SPIDER_MODULES = ['myproject.spiders']
NEWSPIDER_MODULE = 'myproject.spiders'

# Obey robots.txt rules
ROBOTSTXT_OBEY = True

# Configure delays for requests
DOWNLOAD_DELAY = 1
RANDOMIZE_DOWNLOAD_DELAY = 0.5

# Configure user agent
USER_AGENT = 'myproject (+http://www.yourdomain.com)'

# Configure pipelines
ITEM_PIPELINES = {
    'myproject.pipelines.MyprojectPipeline': 300,
}

items.py

Defines the data structures for extracted items:

import scrapy

class MyprojectItem(scrapy.Item):
    title = scrapy.Field()
    price = scrapy.Field()
    description = scrapy.Field()
    url = scrapy.Field()

pipelines.py

Processes scraped items:

from itemadapter import ItemAdapter

class MyprojectPipeline:
    def process_item(self, item, spider):
        # Process the item
        adapter = ItemAdapter(item)

        # Clean the data
        if adapter.get('price'):
            adapter['price'] = adapter['price'].strip()

        return item

middlewares.py

Custom middleware for request/response processing:

from scrapy import signals
from itemadapter import is_item, ItemAdapter

class MyprojectSpiderMiddleware:
    @classmethod
    def from_crawler(cls, crawler):
        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        return s

    def process_spider_input(self, response, spider):
        return None

    def spider_opened(self, spider):
        spider.logger.info('Spider opened: %s' % spider.name)

Creating Your First Spider

Step 1: Generate a Spider

cd myproject
scrapy genspider example example.com

This creates a basic spider file in the spiders/ directory:

import scrapy

class ExampleSpider(scrapy.Spider):
    name = 'example'
    allowed_domains = ['example.com']
    start_urls = ['http://example.com/']

    def parse(self, response):
        # Extract data from the response
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').get(),
                'author': quote.css('span small::text').get(),
                'tags': quote.css('div.tags a.tag::text').getall(),
            }

        # Follow pagination links
        next_page = response.css('li.next a::attr(href)').get()
        if next_page is not None:
            yield response.follow(next_page, self.parse)

Step 2: Customize Your Spider

Here's a more comprehensive spider example:

import scrapy
from myproject.items import MyprojectItem

class ProductSpider(scrapy.Spider):
    name = 'products'
    allowed_domains = ['example-store.com']
    start_urls = ['https://example-store.com/products']

    custom_settings = {
        'DOWNLOAD_DELAY': 2,
        'CONCURRENT_REQUESTS': 1,
    }

    def parse(self, response):
        # Extract product URLs
        product_urls = response.css('div.product a::attr(href)').getall()

        for url in product_urls:
            yield response.follow(url, self.parse_product)

        # Handle pagination
        next_page = response.css('a.next-page::attr(href)').get()
        if next_page:
            yield response.follow(next_page, self.parse)

    def parse_product(self, response):
        item = MyprojectItem()

        item['title'] = response.css('h1.product-title::text').get()
        item['price'] = response.css('span.price::text').re_first(r'\$(.+)')
        item['description'] = response.css('div.description::text').get()
        item['url'] = response.url

        yield item

Running Your Spider

Basic Execution

# Run the spider
scrapy crawl example

# Save output to file
scrapy crawl example -o products.json
scrapy crawl example -o products.csv
scrapy crawl example -o products.xml

Advanced Options

# Set custom settings
scrapy crawl example -s DOWNLOAD_DELAY=3

# Override user agent
scrapy crawl example -s USER_AGENT='Custom Bot 1.0'

# Enable debug mode
scrapy crawl example -L DEBUG

Project Configuration Best Practices

1. Environment-Specific Settings

Create separate settings files for different environments:

# settings/base.py
BOT_NAME = 'myproject'
SPIDER_MODULES = ['myproject.spiders']

# settings/development.py
from .base import *
DOWNLOAD_DELAY = 3
CONCURRENT_REQUESTS = 1

# settings/production.py
from .base import *
DOWNLOAD_DELAY = 1
CONCURRENT_REQUESTS = 8

2. Custom Item Pipelines

# pipelines.py
import json
import sqlite3

class JsonWriterPipeline:
    def open_spider(self, spider):
        self.file = open('items.jl', 'w')

    def close_spider(self, spider):
        self.file.close()

    def process_item(self, item, spider):
        line = json.dumps(ItemAdapter(item).asdict()) + "\n"
        self.file.write(line)
        return item

class DatabasePipeline:
    def open_spider(self, spider):
        self.connection = sqlite3.connect('items.db')
        self.cursor = self.connection.cursor()
        self.cursor.execute('''
            CREATE TABLE IF NOT EXISTS items (
                id INTEGER PRIMARY KEY,
                title TEXT,
                price TEXT,
                url TEXT
            )
        ''')

    def close_spider(self, spider):
        self.connection.close()

    def process_item(self, item, spider):
        adapter = ItemAdapter(item)
        self.cursor.execute('''
            INSERT INTO items (title, price, url) VALUES (?, ?, ?)
        ''', (adapter['title'], adapter['price'], adapter['url']))
        self.connection.commit()
        return item

Advanced Project Features

Using Scrapy Shell for Development

The Scrapy shell is invaluable for testing selectors:

# Open shell with a URL
scrapy shell "https://example.com"

# Test CSS selectors
response.css('title::text').get()
response.css('div.content p::text').getall()

# Test XPath selectors
response.xpath('//title/text()').get()

Handling JavaScript-Heavy Sites

For sites requiring JavaScript execution, consider integrating with browser automation tools. While Scrapy excels at static content, you might need to combine it with tools like Puppeteer for handling dynamic content when dealing with single-page applications.

Deployment Considerations

When preparing your Scrapy project for production:

Configure logging properly
Set appropriate delays and concurrent requests
Implement robust error handling
Use rotating user agents and proxies
Respect robots.txt and rate limits

Troubleshooting Common Issues

Import Errors

Ensure your project is properly structured and all __init__.py files are present.

Spider Not Found

Verify the spider name matches the name attribute in your spider class.

Permission Denied

Check robots.txt compliance and ensure ROBOTSTXT_OBEY is set appropriately.

Next Steps

After creating your Scrapy project:

Study the target website's structure
Design appropriate item schemas
Implement data validation and cleaning
Configure pipelines for data storage
Test thoroughly with different scenarios
Optimize for performance and reliability

Creating a Scrapy project is just the beginning of your web scraping journey. The framework's flexibility allows you to build sophisticated scraping solutions that can handle complex websites and large-scale data extraction tasks efficiently.

Table of contents