Table of contents

How do I handle CAPTCHA in Scrapy?

CAPTCHA (Completely Automated Public Turing test to tell Computers and Humans Apart) is a common anti-bot mechanism that websites use to prevent automated scraping. While Scrapy is excellent for web scraping, handling CAPTCHAs requires additional strategies since they're specifically designed to block automated tools. This guide covers multiple approaches to overcome CAPTCHA challenges in your Scrapy projects.

Understanding CAPTCHA Types

Before implementing solutions, it's important to understand the different types of CAPTCHAs you might encounter:

  • Image-based CAPTCHAs: Text distorted in images
  • reCAPTCHA v2: "I'm not a robot" checkbox with image challenges
  • reCAPTCHA v3: Invisible scoring system
  • hCaptcha: Similar to reCAPTCHA but privacy-focused
  • Audio CAPTCHAs: Audio-based challenges
  • Math CAPTCHAs: Simple arithmetic problems

Method 1: CAPTCHA Solving Services Integration

The most practical approach for production environments is integrating third-party CAPTCHA solving services into your Scrapy spider.

Using 2captcha Service

First, install the required dependency:

pip install 2captcha-python

Here's a Scrapy spider that integrates 2captcha for solving image CAPTCHAs:

import scrapy
import base64
from twocaptcha import TwoCaptcha
from scrapy.http import FormRequest
import time

class CaptchaSpider(scrapy.Spider):
    name = 'captcha_spider'

    def __init__(self):
        # Initialize 2captcha solver with your API key
        self.solver = TwoCaptcha('YOUR_2CAPTCHA_API_KEY')

    def start_requests(self):
        urls = ['https://example.com/login']
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse_login)

    def parse_login(self, response):
        # Extract CAPTCHA image
        captcha_img = response.css('img.captcha-image::attr(src)').get()

        if captcha_img:
            # Download CAPTCHA image
            captcha_url = response.urljoin(captcha_img)
            yield scrapy.Request(
                url=captcha_url,
                callback=self.solve_captcha,
                meta={'response': response}
            )

    def solve_captcha(self, response):
        original_response = response.meta['response']

        # Encode image to base64
        image_data = base64.b64encode(response.body).decode('utf-8')

        try:
            # Solve CAPTCHA
            result = self.solver.normal(image_data)
            captcha_solution = result['code']

            # Submit form with CAPTCHA solution
            return FormRequest.from_response(
                original_response,
                formdata={
                    'username': 'your_username',
                    'password': 'your_password',
                    'captcha': captcha_solution
                },
                callback=self.after_login
            )

        except Exception as e:
            self.logger.error(f"CAPTCHA solving failed: {e}")
            return None

    def after_login(self, response):
        # Process the page after successful login
        if "dashboard" in response.url or "welcome" in response.text.lower():
            self.logger.info("Login successful!")
            # Continue scraping protected content
            yield scrapy.Request(
                url='https://example.com/protected-data',
                callback=self.parse_data
            )
        else:
            self.logger.error("Login failed, CAPTCHA might be incorrect")

    def parse_data(self, response):
        # Extract your target data here
        for item in response.css('div.item'):
            yield {
                'title': item.css('h2::text').get(),
                'description': item.css('p::text').get()
            }

Handling reCAPTCHA v2

For reCAPTCHA v2 challenges, you'll need to extract the site key and use a specialized solving method:

import scrapy
from twocaptcha import TwoCaptcha

class RecaptchaSpider(scrapy.Spider):
    name = 'recaptcha_spider'

    def __init__(self):
        self.solver = TwoCaptcha('YOUR_2CAPTCHA_API_KEY')

    def parse_recaptcha(self, response):
        # Extract reCAPTCHA site key
        site_key = response.css('[data-sitekey]::attr(data-sitekey)').get()
        page_url = response.url

        if site_key:
            try:
                # Solve reCAPTCHA v2
                result = self.solver.recaptcha(
                    sitekey=site_key,
                    url=page_url
                )

                recaptcha_response = result['code']

                # Submit form with reCAPTCHA response
                return FormRequest.from_response(
                    response,
                    formdata={
                        'g-recaptcha-response': recaptcha_response,
                        'other_field': 'value'
                    },
                    callback=self.after_recaptcha
                )

            except Exception as e:
                self.logger.error(f"reCAPTCHA solving failed: {e}")

        return None

    def after_recaptcha(self, response):
        # Process response after reCAPTCHA verification
        if response.status == 200:
            self.logger.info("reCAPTCHA verification successful")
            # Continue with data extraction

Method 2: Browser Automation with Selenium

For complex CAPTCHAs or when you need human-like interaction, integrate Selenium with Scrapy. While this approach is slower, it provides more flexibility for handling dynamic content and JavaScript-heavy pages.

import scrapy
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from scrapy.http import HtmlResponse

class SeleniumCaptchaSpider(scrapy.Spider):
    name = 'selenium_captcha_spider'

    def __init__(self):
        # Configure Chrome options
        chrome_options = webdriver.ChromeOptions()
        chrome_options.add_argument('--headless')  # Remove for debugging
        chrome_options.add_argument('--no-sandbox')
        chrome_options.add_argument('--disable-dev-shm-usage')

        self.driver = webdriver.Chrome(options=chrome_options)

    def start_requests(self):
        # Use Selenium to handle initial page load
        self.driver.get('https://example.com/captcha-page')

        # Wait for CAPTCHA to load
        WebDriverWait(self.driver, 10).until(
            EC.presence_of_element_located((By.CLASS_NAME, "captcha-container"))
        )

        # Take screenshot for manual inspection (optional)
        self.driver.save_screenshot('captcha_page.png')

        # Get page source and create Scrapy response
        page_source = self.driver.page_source
        response = HtmlResponse(
            url=self.driver.current_url,
            body=page_source,
            encoding='utf-8'
        )

        # Process with CAPTCHA solving logic
        yield from self.solve_selenium_captcha(response)

    def solve_selenium_captcha(self, response):
        # Extract CAPTCHA image using Selenium
        try:
            captcha_element = self.driver.find_element(By.CSS_SELECTOR, 'img.captcha-image')
            captcha_src = captcha_element.get_attribute('src')

            # Save CAPTCHA image for solving
            captcha_element.screenshot('captcha.png')

            # Here you would integrate with a CAPTCHA solving service
            # For demonstration, we'll simulate manual input
            captcha_solution = self.get_captcha_solution('captcha.png')

            if captcha_solution:
                # Fill in the CAPTCHA solution
                captcha_input = self.driver.find_element(By.NAME, 'captcha')
                captcha_input.send_keys(captcha_solution)

                # Submit the form
                submit_button = self.driver.find_element(By.CSS_SELECTOR, 'input[type="submit"]')
                submit_button.click()

                # Wait for page to load
                WebDriverWait(self.driver, 10).until(
                    EC.url_changes(self.driver.current_url)
                )

                # Create new response with updated page
                new_page_source = self.driver.page_source
                new_response = HtmlResponse(
                    url=self.driver.current_url,
                    body=new_page_source,
                    encoding='utf-8'
                )

                yield from self.parse_protected_content(new_response)

        except Exception as e:
            self.logger.error(f"Selenium CAPTCHA handling failed: {e}")

    def get_captcha_solution(self, image_path):
        # Integrate with CAPTCHA solving service here
        # This is a placeholder - implement actual solving logic
        solver = TwoCaptcha('YOUR_API_KEY')
        try:
            result = solver.normal(image_path)
            return result['code']
        except:
            return None

    def parse_protected_content(self, response):
        # Extract data from the protected page
        for item in response.css('div.protected-item'):
            yield {
                'title': item.css('h3::text').get(),
                'content': item.css('p::text').get()
            }

    def closed(self, reason):
        # Clean up Selenium driver
        self.driver.quit()

Method 3: Manual CAPTCHA Solving with Pause Mechanism

For development or small-scale scraping, you can implement a manual solving mechanism:

import scrapy
import time
from scrapy.shell import inspect_response

class ManualCaptchaSpider(scrapy.Spider):
    name = 'manual_captcha_spider'

    def parse_captcha_page(self, response):
        # Check if CAPTCHA is present
        if response.css('img.captcha-image'):
            self.logger.info("CAPTCHA detected. Opening interactive shell...")

            # Open Scrapy shell for manual inspection
            inspect_response(response, self)

            # Pause execution to allow manual solving
            captcha_solution = input("Please solve the CAPTCHA and enter the solution: ")

            if captcha_solution:
                return FormRequest.from_response(
                    response,
                    formdata={
                        'captcha': captcha_solution,
                        'other_fields': 'values'
                    },
                    callback=self.after_captcha_solved
                )

        # Continue normal processing if no CAPTCHA
        return self.parse_normal_content(response)

    def after_captcha_solved(self, response):
        if "success" in response.text.lower():
            self.logger.info("CAPTCHA solved successfully!")
            # Continue with protected content scraping
        else:
            self.logger.error("CAPTCHA solution was incorrect")

Method 4: CAPTCHA Avoidance Strategies

Sometimes the best approach is to avoid CAPTCHAs altogether:

Using Scrapy Settings for Stealth

# settings.py
DOWNLOAD_DELAY = 3
RANDOMIZE_DOWNLOAD_DELAY = 0.5
CONCURRENT_REQUESTS = 1
CONCURRENT_REQUESTS_PER_DOMAIN = 1

# Rotate user agents
DOWNLOADER_MIDDLEWARES = {
    'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
    'scrapy_user_agents.middlewares.RandomUserAgentMiddleware': 400,
}

# Use session cookies
COOKIES_ENABLED = True

# Respect robots.txt (sometimes helps avoid detection)
ROBOTSTXT_OBEY = True

Custom Middleware for CAPTCHA Detection

class CaptchaDetectionMiddleware:
    def process_response(self, request, response, spider):
        # Detect CAPTCHA presence
        captcha_indicators = [
            'captcha',
            'recaptcha',
            'hcaptcha',
            'verify you are human'
        ]

        response_text = response.text.lower()

        if any(indicator in response_text for indicator in captcha_indicators):
            spider.logger.warning(f"CAPTCHA detected on {response.url}")

            # You can implement different strategies here:
            # 1. Retry with different user agent
            # 2. Use proxy rotation
            # 3. Trigger CAPTCHA solving
            # 4. Skip this URL

            return self.handle_captcha_response(request, response, spider)

        return response

    def handle_captcha_response(self, request, response, spider):
        # Implement your CAPTCHA handling strategy
        spider.logger.info("Implementing CAPTCHA bypass strategy...")

        # For example, retry with delay
        request.dont_filter = True
        request.meta['download_delay'] = 10

        return request

Best Practices for CAPTCHA Handling

1. Cost-Effective CAPTCHA Solving

# Implement cost monitoring for CAPTCHA solving services
class CostAwareCaptchaSpider(scrapy.Spider):
    def __init__(self):
        self.captcha_solve_count = 0
        self.max_captcha_solves = 100  # Set budget limit

    def solve_captcha_with_budget(self, image_data):
        if self.captcha_solve_count >= self.max_captcha_solves:
            self.logger.warning("CAPTCHA solving budget exceeded")
            return None

        self.captcha_solve_count += 1
        # Proceed with solving
        return self.solver.normal(image_data)

2. Retry Logic for Failed CAPTCHA Attempts

class RetryableCaptchaSpider(scrapy.Spider):
    def solve_captcha_with_retry(self, response, max_retries=3):
        for attempt in range(max_retries):
            try:
                result = self.solve_captcha(response)
                if result:
                    return result
            except Exception as e:
                self.logger.warning(f"CAPTCHA attempt {attempt + 1} failed: {e}")
                time.sleep(2 ** attempt)  # Exponential backoff

        self.logger.error("All CAPTCHA solving attempts failed")
        return None

3. Proxy Rotation Integration

For websites that show CAPTCHAs based on IP reputation, combine CAPTCHA solving with proxy rotation:

pip install scrapy-proxy-middleware
# settings.py
DOWNLOADER_MIDDLEWARES = {
    'scrapy_proxy_middleware.middlewares.ProxyMiddleware': 350,
    'scrapy_proxy_middleware.middlewares.BanDetectionMiddleware': 620,
}

PROXY_LIST = 'proxy_list.txt'
PROXY_MODE = 0  # Rotate proxies for each request

Monitoring and Logging

Implement comprehensive logging to track CAPTCHA solving performance:

import logging
from datetime import datetime

class CaptchaLogger:
    def __init__(self):
        self.captcha_stats = {
            'encountered': 0,
            'solved': 0,
            'failed': 0,
            'cost': 0.0
        }

    def log_captcha_encountered(self, url):
        self.captcha_stats['encountered'] += 1
        logging.info(f"CAPTCHA encountered on {url}")

    def log_captcha_solved(self, cost=0.001):
        self.captcha_stats['solved'] += 1
        self.captcha_stats['cost'] += cost
        logging.info(f"CAPTCHA solved. Total cost: ${self.captcha_stats['cost']:.3f}")

    def log_captcha_failed(self):
        self.captcha_stats['failed'] += 1
        logging.error("CAPTCHA solving failed")

    def print_summary(self):
        success_rate = (self.captcha_stats['solved'] / 
                       max(self.captcha_stats['encountered'], 1)) * 100

        print(f"""
        CAPTCHA Solving Summary:
        - Encountered: {self.captcha_stats['encountered']}
        - Solved: {self.captcha_stats['solved']}
        - Failed: {self.captcha_stats['failed']}
        - Success Rate: {success_rate:.1f}%
        - Total Cost: ${self.captcha_stats['cost']:.3f}
        """)

Conclusion

Handling CAPTCHAs in Scrapy requires a multi-faceted approach depending on your specific needs, budget, and scale. For production environments, third-party CAPTCHA solving services offer the most reliable solution, while browser automation provides flexibility for complex scenarios. For scenarios requiring JavaScript execution similar to handling dynamic content, you might also consider tools like browser automation frameworks for handling complex interactions.

Remember to always respect website terms of service and implement responsible scraping practices, including appropriate delays and request limits. Consider whether your scraping activities comply with legal requirements and the website's robots.txt file.

The key to successful CAPTCHA handling is combining multiple strategies: proper request spacing, user agent rotation, proxy usage, and when necessary, automated CAPTCHA solving services. Monitor your success rates and costs to optimize your approach over time.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon