Table of contents

What are Scrapy selectors and how do I use them?

Scrapy selectors are powerful tools that allow you to extract data from HTML and XML documents using CSS selectors or XPath expressions. They serve as the primary mechanism for navigating and extracting content from web pages in Scrapy, providing a unified interface that works with both CSS and XPath syntax.

Understanding Scrapy Selectors

Scrapy selectors are wrapper objects around the parsel library that provide methods for extracting data from HTML/XML documents. They offer a consistent API regardless of whether you're using CSS selectors or XPath expressions, making them flexible and developer-friendly.

Core Selector Methods

Scrapy selectors provide several key methods for data extraction:

  • css() - Select elements using CSS selectors
  • xpath() - Select elements using XPath expressions
  • get() - Extract the first matching element as a string
  • getall() - Extract all matching elements as a list of strings
  • attrib - Access element attributes
  • re() - Apply regular expressions to extracted text

Basic Selector Usage

CSS Selectors

CSS selectors provide an intuitive way to select elements based on their CSS properties:

import scrapy

class ExampleSpider(scrapy.Spider):
    name = 'example'
    start_urls = ['https://example.com']

    def parse(self, response):
        # Select all paragraph elements
        paragraphs = response.css('p::text').getall()

        # Select element by class
        title = response.css('.main-title::text').get()

        # Select element by ID
        header = response.css('#header h1::text').get()

        # Select with attribute selectors
        links = response.css('a[href*="example"]::attr(href)').getall()

        yield {
            'paragraphs': paragraphs,
            'title': title,
            'header': header,
            'links': links
        }

XPath Selectors

XPath provides more powerful selection capabilities, especially for complex document structures:

def parse(self, response):
    # Select text content
    title = response.xpath('//h1/text()').get()

    # Select with conditions
    price = response.xpath('//span[@class="price"]/text()').get()

    # Select following siblings
    description = response.xpath('//h2[text()="Description"]/following-sibling::p/text()').get()

    # Select with contains() function
    category = response.xpath('//div[contains(@class, "category")]/text()').get()

    # Complex selection with multiple conditions
    product_info = response.xpath(
        '//div[@class="product" and @data-available="true"]//span[@class="name"]/text()'
    ).getall()

    yield {
        'title': title,
        'price': price,
        'description': description,
        'category': category,
        'products': product_info
    }

Advanced Selector Techniques

Combining Selectors

You can chain selectors to narrow down your selection:

def parse(self, response):
    # Chain CSS selectors
    product_names = response.css('.product-list').css('.product-item').css('.name::text').getall()

    # Chain XPath selectors
    prices = response.xpath('//div[@class="products"]').xpath('.//span[@class="price"]/text()').getall()

    # Mix CSS and XPath
    descriptions = response.css('.product-item').xpath('.//p[@class="description"]/text()').getall()

Using Regular Expressions

Apply regular expressions to extracted text for further processing:

def parse(self, response):
    # Extract and clean phone numbers
    raw_phone = response.css('.contact-info::text').get()
    clean_phone = response.css('.contact-info::text').re_first(r'\d{3}-\d{3}-\d{4}')

    # Extract all email addresses
    emails = response.css('body::text').re(r'[\w\.-]+@[\w\.-]+\.\w+')

    # Extract numbers from price strings
    price_text = response.css('.price::text').get()  # "$19.99"
    price_number = response.css('.price::text').re_first(r'\d+\.\d+')  # "19.99"

Attribute Extraction

Extract element attributes using the attrib property or ::attr() pseudo-selector:

def parse(self, response):
    # Extract href attributes
    links = response.css('a::attr(href)').getall()

    # Extract image sources
    image_urls = response.css('img::attr(src)').getall()

    # Extract data attributes
    product_ids = response.css('.product::attr(data-id)').getall()

    # Using attrib property
    for link in response.css('a'):
        url = link.attrib['href']
        text = link.css('::text').get()
        yield {'url': url, 'text': text}

Working with Tables and Lists

Extracting Table Data

def parse(self, response):
    table_rows = response.css('table tr')

    for row in table_rows[1:]:  # Skip header row
        cells = row.css('td::text').getall()
        if len(cells) >= 3:
            yield {
                'name': cells[0],
                'price': cells[1],
                'availability': cells[2]
            }

Processing Lists

def parse(self, response):
    # Extract list items
    list_items = response.css('ul.features li::text').getall()

    # Extract nested list data
    categories = []
    for category in response.css('.category-list .category'):
        category_name = category.css('.category-name::text').get()
        subcategories = category.css('.subcategory::text').getall()
        categories.append({
            'name': category_name,
            'subcategories': subcategories
        })

Error Handling and Best Practices

Safe Data Extraction

Always handle cases where selectors might not find matching elements:

def parse(self, response):
    # Use get() with default values
    title = response.css('h1::text').get(default='No title')

    # Check if selector found elements
    price_selector = response.css('.price::text')
    if price_selector:
        price = price_selector.get()
    else:
        price = None

    # Use getall() safely
    tags = response.css('.tag::text').getall() or []

Performance Optimization

For better performance, especially when processing large documents:

def parse(self, response):
    # Cache frequently used selectors
    product_container = response.css('.product-container')

    for product in product_container.css('.product'):
        # Work within the cached selector context
        name = product.css('.name::text').get()
        price = product.css('.price::text').get()

        yield {'name': name, 'price': price}

Selector Testing and Debugging

Using Scrapy Shell

Test your selectors interactively using Scrapy shell:

scrapy shell "https://example.com"
# In the shell
>>> response.css('h1::text').get()
'Example Title'

>>> response.xpath('//div[@class="content"]//p/text()').getall()
['Paragraph 1', 'Paragraph 2', 'Paragraph 3']

>>> len(response.css('.product'))
25

Debugging Complex Selectors

def parse(self, response):
    # Debug by checking intermediate results
    products = response.css('.product')
    self.logger.info(f'Found {len(products)} products')

    for i, product in enumerate(products):
        name = product.css('.name::text').get()
        if not name:
            self.logger.warning(f'No name found for product {i}')
            # Inspect the HTML structure
            self.logger.debug(f'Product HTML: {product.get()}')

Integration with Other Tools

While Scrapy selectors are powerful for server-side scraping, you might also need browser automation for JavaScript-heavy sites. For such cases, tools like Puppeteer for handling dynamic content can complement your Scrapy workflow.

When dealing with complex single-page applications, you might need to handle JavaScript-rendered content before applying Scrapy selectors to the resulting HTML.

Common Patterns and Examples

E-commerce Product Scraping

def parse_product(self, response):
    yield {
        'name': response.css('h1.product-title::text').get(),
        'price': response.css('.price .amount::text').re_first(r'\d+\.\d+'),
        'rating': response.css('.rating::attr(data-rating)').get(),
        'availability': response.css('.stock-status::text').get(),
        'images': response.css('.product-images img::attr(src)').getall(),
        'features': response.css('.features li::text').getall(),
        'description': ' '.join(response.css('.description p::text').getall())
    }

News Article Extraction

def parse_article(self, response):
    # Extract article metadata
    published_date = response.css('time::attr(datetime)').get()
    author = response.css('.author-name::text').get()

    # Extract article content
    title = response.css('h1.article-title::text').get()
    paragraphs = response.css('.article-content p::text').getall()
    content = '\n'.join(paragraphs)

    # Extract related articles
    related_links = response.css('.related-articles a::attr(href)').getall()

    yield {
        'title': title,
        'author': author,
        'published_date': published_date,
        'content': content,
        'related_articles': related_links
    }

Best Practices for Production Use

Selector Robustness

Write selectors that are resilient to minor HTML changes:

def parse(self, response):
    # Multiple fallback selectors
    title = (response.css('h1.title::text').get() or 
             response.css('h1::text').get() or 
             response.css('.main-title::text').get() or
             'No title found')

    # Use multiple attributes for finding elements
    price = (response.css('[data-price]::attr(data-price)').get() or
             response.css('.price::text').re_first(r'\$?(\d+\.\d+)') or
             response.css('.cost::text').re_first(r'\$?(\d+\.\d+)'))

Data Validation

Always validate extracted data before yielding:

def parse(self, response):
    for product in response.css('.product'):
        name = product.css('.name::text').get()
        price = product.css('.price::text').re_first(r'\d+\.\d+')

        # Only yield if we have required data
        if name and price:
            try:
                price_float = float(price)
                yield {
                    'name': name.strip(),
                    'price': price_float,
                    'url': response.url
                }
            except ValueError:
                self.logger.warning(f'Invalid price format: {price}')

Conclusion

Scrapy selectors provide a robust and flexible way to extract data from web pages. By mastering both CSS selectors and XPath expressions, you can handle virtually any data extraction scenario. Remember to always test your selectors thoroughly, handle edge cases gracefully, and optimize for performance when processing large amounts of data.

The key to effective web scraping with Scrapy selectors is understanding the HTML structure of your target pages and choosing the most appropriate selection method for each piece of data you need to extract. With practice, you'll develop an intuition for which selector type works best in different situations.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon