Table of contents

How do I create a spider in Scrapy?

Creating a spider in Scrapy is the fundamental step to start web scraping with this powerful Python framework. A spider is a class that defines how a website should be scraped, including which URLs to start with, how to follow links, and how to extract data from pages.

What is a Scrapy Spider?

A Scrapy spider is a Python class that inherits from scrapy.Spider and defines the scraping logic for extracting data from websites. Each spider contains:

  • A unique name identifier
  • Starting URLs to begin scraping
  • Rules for following links (optional)
  • Methods to parse responses and extract data

Prerequisites

Before creating a spider, ensure you have Scrapy installed and a project set up:

# Install Scrapy
pip install scrapy

# Create a new Scrapy project
scrapy startproject myproject
cd myproject

Basic Spider Structure

Here's the minimal structure of a Scrapy spider:

import scrapy

class MySpider(scrapy.Spider):
    name = 'example'
    start_urls = ['https://example.com']

    def parse(self, response):
        # Extract data here
        pass

Creating Your First Spider

Method 1: Using Scrapy's genspider Command

The quickest way to create a spider is using Scrapy's built-in command:

# Generate a basic spider
scrapy genspider quotes quotes.toscrape.com

# Generate a spider with a specific template
scrapy genspider -t crawl quotes quotes.toscrape.com

This creates a spider file in the spiders/ directory with basic structure already set up.

Method 2: Manual Creation

Create a new Python file in the spiders/ directory:

# spiders/quotes_spider.py
import scrapy

class QuotesSpider(scrapy.Spider):
    name = 'quotes'
    allowed_domains = ['quotes.toscrape.com']
    start_urls = [
        'https://quotes.toscrape.com/page/1/',
        'https://quotes.toscrape.com/page/2/',
    ]

    def parse(self, response):
        # Extract quotes
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').get(),
                'author': quote.css('small.author::text').get(),
                'tags': quote.css('div.tags a.tag::text').getall(),
            }

        # Follow pagination links
        next_page = response.css('li.next a::attr(href)').get()
        if next_page is not None:
            yield response.follow(next_page, self.parse)

Advanced Spider Features

Handling Different Response Types

import scrapy
import json

class AdvancedSpider(scrapy.Spider):
    name = 'advanced'
    start_urls = ['https://api.example.com/data']

    def parse(self, response):
        # Handle JSON responses
        if response.headers.get('content-type', b'').startswith(b'application/json'):
            data = json.loads(response.text)
            for item in data['results']:
                yield self.parse_item(item)

        # Handle HTML responses
        else:
            for selector in response.css('div.item'):
                yield self.parse_html_item(selector)

    def parse_item(self, item_data):
        return {
            'id': item_data.get('id'),
            'name': item_data.get('name'),
            'description': item_data.get('description')
        }

    def parse_html_item(self, selector):
        return {
            'title': selector.css('h2::text').get(),
            'price': selector.css('.price::text').re_first(r'[\d.]+'),
            'availability': selector.css('.stock::text').get()
        }

Using Spider Arguments

Pass arguments to your spider for dynamic behavior:

class ParameterizedSpider(scrapy.Spider):
    name = 'parameterized'

    def __init__(self, category=None, max_pages=None, *args, **kwargs):
        super(ParameterizedSpider, self).__init__(*args, **kwargs)
        self.start_urls = [f'https://example.com/category/{category}']
        self.max_pages = int(max_pages) if max_pages else 10
        self.page_count = 0

    def parse(self, response):
        # Extract data
        for item in response.css('div.product'):
            yield {
                'name': item.css('h3::text').get(),
                'price': item.css('.price::text').get()
            }

        # Respect max_pages limit
        if self.page_count < self.max_pages:
            next_page = response.css('a.next::attr(href)').get()
            if next_page:
                self.page_count += 1
                yield response.follow(next_page, self.parse)

Run with arguments:

scrapy crawl parameterized -a category=electronics -a max_pages=5

Spider Templates

Scrapy provides several spider templates for different use cases:

Basic Spider (Default)

import scrapy

class BasicSpider(scrapy.Spider):
    name = 'basic'
    allowed_domains = ['example.com']
    start_urls = ['https://example.com']

    def parse(self, response):
        pass

CrawlSpider for Following Links

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule

class CrawlSpider(CrawlSpider):
    name = 'crawl'
    allowed_domains = ['example.com']
    start_urls = ['https://example.com']

    rules = (
        Rule(LinkExtractor(allow=r'/product/'), callback='parse_item'),
        Rule(LinkExtractor(allow=r'/category/'), follow=True),
    )

    def parse_item(self, response):
        yield {
            'name': response.css('h1::text').get(),
            'description': response.css('.description::text').get()
        }

Best Practices for Spider Creation

1. Proper Error Handling

def parse(self, response):
    try:
        for item in response.css('div.item'):
            data = {
                'title': item.css('h2::text').get(),
                'price': item.css('.price::text').re_first(r'[\d.]+')
            }

            # Validate required fields
            if data['title'] and data['price']:
                yield data
            else:
                self.logger.warning(f"Incomplete data: {data}")

    except Exception as e:
        self.logger.error(f"Error parsing response: {e}")

2. Implementing Delays and Rate Limiting

Configure in settings.py:

# settings.py
DOWNLOAD_DELAY = 1  # 1 second delay between requests
RANDOMIZE_DOWNLOAD_DELAY = 0.5  # 0.5 * to 1.5 * DOWNLOAD_DELAY
CONCURRENT_REQUESTS = 16
CONCURRENT_REQUESTS_PER_DOMAIN = 8

3. User Agent Rotation

# settings.py
USER_AGENT_LIST = [
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36',
    'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36'
]

DOWNLOADER_MIDDLEWARES = {
    'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
    'scrapy_user_agents.middlewares.RandomUserAgentMiddleware': 400,
}

Running Your Spider

Once created, run your spider with these commands:

# Basic run
scrapy crawl quotes

# Save output to file
scrapy crawl quotes -o quotes.json
scrapy crawl quotes -o quotes.csv
scrapy crawl quotes -o quotes.xml

# Run with custom settings
scrapy crawl quotes -s DOWNLOAD_DELAY=2

# Run with logging
scrapy crawl quotes -L INFO

Common Spider Patterns

Handling Forms and POST Requests

def parse(self, response):
    # Submit a form
    return scrapy.FormRequest.from_response(
        response,
        formdata={'username': 'admin', 'password': 'secret'},
        callback=self.after_login
    )

def after_login(self, response):
    # Continue scraping after login
    yield response.follow('/protected-page', self.parse_protected)

Managing Sessions and Cookies

def start_requests(self):
    # Start with login
    return [scrapy.Request('https://example.com/login', 
                          callback=self.login)]

def login(self, response):
    # Perform login and maintain session
    return scrapy.FormRequest.from_response(
        response,
        formdata={'user': 'admin', 'pass': 'secret'},
        callback=self.after_login
    )

Integration with Other Tools

While Scrapy is excellent for systematic web scraping, you might also consider tools like Puppeteer for JavaScript-heavy sites. For complex scenarios involving dynamic content loading, you might want to explore how to handle authentication in Puppeteer for sites requiring login flows.

For monitoring network requests during scraping, similar concepts apply across tools - you can learn about monitoring network requests in Puppeteer which can help understand request patterns that you might want to replicate in Scrapy.

Troubleshooting Common Issues

Spider Not Found Error

Ensure your spider class: - Inherits from scrapy.Spider - Has a unique name attribute - Is in the spiders/ directory - Has valid Python syntax

Import Errors

Check that: - Scrapy is properly installed - Your project structure is correct - All required modules are imported

No Data Extracted

Verify: - Your CSS/XPath selectors are correct - The target website structure hasn't changed - Response status codes are successful (200)

Conclusion

Creating a spider in Scrapy involves defining a class that inherits from scrapy.Spider, specifying starting URLs, and implementing parsing logic. With proper error handling, rate limiting, and following best practices, you can build robust web scrapers that efficiently extract data from websites.

Start with simple spiders and gradually add complexity as needed. Remember to respect website terms of service and implement appropriate delays to avoid overwhelming target servers.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon