How do I handle CAPTCHA in Scrapy?
CAPTCHA (Completely Automated Public Turing test to tell Computers and Humans Apart) is a common anti-bot mechanism that websites use to prevent automated scraping. While Scrapy is excellent for web scraping, handling CAPTCHAs requires additional strategies since they're specifically designed to block automated tools. This guide covers multiple approaches to overcome CAPTCHA challenges in your Scrapy projects.
Understanding CAPTCHA Types
Before implementing solutions, it's important to understand the different types of CAPTCHAs you might encounter:
- Image-based CAPTCHAs: Text distorted in images
- reCAPTCHA v2: "I'm not a robot" checkbox with image challenges
- reCAPTCHA v3: Invisible scoring system
- hCaptcha: Similar to reCAPTCHA but privacy-focused
- Audio CAPTCHAs: Audio-based challenges
- Math CAPTCHAs: Simple arithmetic problems
Method 1: CAPTCHA Solving Services Integration
The most practical approach for production environments is integrating third-party CAPTCHA solving services into your Scrapy spider.
Using 2captcha Service
First, install the required dependency:
pip install 2captcha-python
Here's a Scrapy spider that integrates 2captcha for solving image CAPTCHAs:
import scrapy
import base64
from twocaptcha import TwoCaptcha
from scrapy.http import FormRequest
import time
class CaptchaSpider(scrapy.Spider):
name = 'captcha_spider'
def __init__(self):
# Initialize 2captcha solver with your API key
self.solver = TwoCaptcha('YOUR_2CAPTCHA_API_KEY')
def start_requests(self):
urls = ['https://example.com/login']
for url in urls:
yield scrapy.Request(url=url, callback=self.parse_login)
def parse_login(self, response):
# Extract CAPTCHA image
captcha_img = response.css('img.captcha-image::attr(src)').get()
if captcha_img:
# Download CAPTCHA image
captcha_url = response.urljoin(captcha_img)
yield scrapy.Request(
url=captcha_url,
callback=self.solve_captcha,
meta={'response': response}
)
def solve_captcha(self, response):
original_response = response.meta['response']
# Encode image to base64
image_data = base64.b64encode(response.body).decode('utf-8')
try:
# Solve CAPTCHA
result = self.solver.normal(image_data)
captcha_solution = result['code']
# Submit form with CAPTCHA solution
return FormRequest.from_response(
original_response,
formdata={
'username': 'your_username',
'password': 'your_password',
'captcha': captcha_solution
},
callback=self.after_login
)
except Exception as e:
self.logger.error(f"CAPTCHA solving failed: {e}")
return None
def after_login(self, response):
# Process the page after successful login
if "dashboard" in response.url or "welcome" in response.text.lower():
self.logger.info("Login successful!")
# Continue scraping protected content
yield scrapy.Request(
url='https://example.com/protected-data',
callback=self.parse_data
)
else:
self.logger.error("Login failed, CAPTCHA might be incorrect")
def parse_data(self, response):
# Extract your target data here
for item in response.css('div.item'):
yield {
'title': item.css('h2::text').get(),
'description': item.css('p::text').get()
}
Handling reCAPTCHA v2
For reCAPTCHA v2 challenges, you'll need to extract the site key and use a specialized solving method:
import scrapy
from twocaptcha import TwoCaptcha
class RecaptchaSpider(scrapy.Spider):
name = 'recaptcha_spider'
def __init__(self):
self.solver = TwoCaptcha('YOUR_2CAPTCHA_API_KEY')
def parse_recaptcha(self, response):
# Extract reCAPTCHA site key
site_key = response.css('[data-sitekey]::attr(data-sitekey)').get()
page_url = response.url
if site_key:
try:
# Solve reCAPTCHA v2
result = self.solver.recaptcha(
sitekey=site_key,
url=page_url
)
recaptcha_response = result['code']
# Submit form with reCAPTCHA response
return FormRequest.from_response(
response,
formdata={
'g-recaptcha-response': recaptcha_response,
'other_field': 'value'
},
callback=self.after_recaptcha
)
except Exception as e:
self.logger.error(f"reCAPTCHA solving failed: {e}")
return None
def after_recaptcha(self, response):
# Process response after reCAPTCHA verification
if response.status == 200:
self.logger.info("reCAPTCHA verification successful")
# Continue with data extraction
Method 2: Browser Automation with Selenium
For complex CAPTCHAs or when you need human-like interaction, integrate Selenium with Scrapy. While this approach is slower, it provides more flexibility for handling dynamic content and JavaScript-heavy pages.
import scrapy
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from scrapy.http import HtmlResponse
class SeleniumCaptchaSpider(scrapy.Spider):
name = 'selenium_captcha_spider'
def __init__(self):
# Configure Chrome options
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--headless') # Remove for debugging
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-dev-shm-usage')
self.driver = webdriver.Chrome(options=chrome_options)
def start_requests(self):
# Use Selenium to handle initial page load
self.driver.get('https://example.com/captcha-page')
# Wait for CAPTCHA to load
WebDriverWait(self.driver, 10).until(
EC.presence_of_element_located((By.CLASS_NAME, "captcha-container"))
)
# Take screenshot for manual inspection (optional)
self.driver.save_screenshot('captcha_page.png')
# Get page source and create Scrapy response
page_source = self.driver.page_source
response = HtmlResponse(
url=self.driver.current_url,
body=page_source,
encoding='utf-8'
)
# Process with CAPTCHA solving logic
yield from self.solve_selenium_captcha(response)
def solve_selenium_captcha(self, response):
# Extract CAPTCHA image using Selenium
try:
captcha_element = self.driver.find_element(By.CSS_SELECTOR, 'img.captcha-image')
captcha_src = captcha_element.get_attribute('src')
# Save CAPTCHA image for solving
captcha_element.screenshot('captcha.png')
# Here you would integrate with a CAPTCHA solving service
# For demonstration, we'll simulate manual input
captcha_solution = self.get_captcha_solution('captcha.png')
if captcha_solution:
# Fill in the CAPTCHA solution
captcha_input = self.driver.find_element(By.NAME, 'captcha')
captcha_input.send_keys(captcha_solution)
# Submit the form
submit_button = self.driver.find_element(By.CSS_SELECTOR, 'input[type="submit"]')
submit_button.click()
# Wait for page to load
WebDriverWait(self.driver, 10).until(
EC.url_changes(self.driver.current_url)
)
# Create new response with updated page
new_page_source = self.driver.page_source
new_response = HtmlResponse(
url=self.driver.current_url,
body=new_page_source,
encoding='utf-8'
)
yield from self.parse_protected_content(new_response)
except Exception as e:
self.logger.error(f"Selenium CAPTCHA handling failed: {e}")
def get_captcha_solution(self, image_path):
# Integrate with CAPTCHA solving service here
# This is a placeholder - implement actual solving logic
solver = TwoCaptcha('YOUR_API_KEY')
try:
result = solver.normal(image_path)
return result['code']
except:
return None
def parse_protected_content(self, response):
# Extract data from the protected page
for item in response.css('div.protected-item'):
yield {
'title': item.css('h3::text').get(),
'content': item.css('p::text').get()
}
def closed(self, reason):
# Clean up Selenium driver
self.driver.quit()
Method 3: Manual CAPTCHA Solving with Pause Mechanism
For development or small-scale scraping, you can implement a manual solving mechanism:
import scrapy
import time
from scrapy.shell import inspect_response
class ManualCaptchaSpider(scrapy.Spider):
name = 'manual_captcha_spider'
def parse_captcha_page(self, response):
# Check if CAPTCHA is present
if response.css('img.captcha-image'):
self.logger.info("CAPTCHA detected. Opening interactive shell...")
# Open Scrapy shell for manual inspection
inspect_response(response, self)
# Pause execution to allow manual solving
captcha_solution = input("Please solve the CAPTCHA and enter the solution: ")
if captcha_solution:
return FormRequest.from_response(
response,
formdata={
'captcha': captcha_solution,
'other_fields': 'values'
},
callback=self.after_captcha_solved
)
# Continue normal processing if no CAPTCHA
return self.parse_normal_content(response)
def after_captcha_solved(self, response):
if "success" in response.text.lower():
self.logger.info("CAPTCHA solved successfully!")
# Continue with protected content scraping
else:
self.logger.error("CAPTCHA solution was incorrect")
Method 4: CAPTCHA Avoidance Strategies
Sometimes the best approach is to avoid CAPTCHAs altogether:
Using Scrapy Settings for Stealth
# settings.py
DOWNLOAD_DELAY = 3
RANDOMIZE_DOWNLOAD_DELAY = 0.5
CONCURRENT_REQUESTS = 1
CONCURRENT_REQUESTS_PER_DOMAIN = 1
# Rotate user agents
DOWNLOADER_MIDDLEWARES = {
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
'scrapy_user_agents.middlewares.RandomUserAgentMiddleware': 400,
}
# Use session cookies
COOKIES_ENABLED = True
# Respect robots.txt (sometimes helps avoid detection)
ROBOTSTXT_OBEY = True
Custom Middleware for CAPTCHA Detection
class CaptchaDetectionMiddleware:
def process_response(self, request, response, spider):
# Detect CAPTCHA presence
captcha_indicators = [
'captcha',
'recaptcha',
'hcaptcha',
'verify you are human'
]
response_text = response.text.lower()
if any(indicator in response_text for indicator in captcha_indicators):
spider.logger.warning(f"CAPTCHA detected on {response.url}")
# You can implement different strategies here:
# 1. Retry with different user agent
# 2. Use proxy rotation
# 3. Trigger CAPTCHA solving
# 4. Skip this URL
return self.handle_captcha_response(request, response, spider)
return response
def handle_captcha_response(self, request, response, spider):
# Implement your CAPTCHA handling strategy
spider.logger.info("Implementing CAPTCHA bypass strategy...")
# For example, retry with delay
request.dont_filter = True
request.meta['download_delay'] = 10
return request
Best Practices for CAPTCHA Handling
1. Cost-Effective CAPTCHA Solving
# Implement cost monitoring for CAPTCHA solving services
class CostAwareCaptchaSpider(scrapy.Spider):
def __init__(self):
self.captcha_solve_count = 0
self.max_captcha_solves = 100 # Set budget limit
def solve_captcha_with_budget(self, image_data):
if self.captcha_solve_count >= self.max_captcha_solves:
self.logger.warning("CAPTCHA solving budget exceeded")
return None
self.captcha_solve_count += 1
# Proceed with solving
return self.solver.normal(image_data)
2. Retry Logic for Failed CAPTCHA Attempts
class RetryableCaptchaSpider(scrapy.Spider):
def solve_captcha_with_retry(self, response, max_retries=3):
for attempt in range(max_retries):
try:
result = self.solve_captcha(response)
if result:
return result
except Exception as e:
self.logger.warning(f"CAPTCHA attempt {attempt + 1} failed: {e}")
time.sleep(2 ** attempt) # Exponential backoff
self.logger.error("All CAPTCHA solving attempts failed")
return None
3. Proxy Rotation Integration
For websites that show CAPTCHAs based on IP reputation, combine CAPTCHA solving with proxy rotation:
pip install scrapy-proxy-middleware
# settings.py
DOWNLOADER_MIDDLEWARES = {
'scrapy_proxy_middleware.middlewares.ProxyMiddleware': 350,
'scrapy_proxy_middleware.middlewares.BanDetectionMiddleware': 620,
}
PROXY_LIST = 'proxy_list.txt'
PROXY_MODE = 0 # Rotate proxies for each request
Monitoring and Logging
Implement comprehensive logging to track CAPTCHA solving performance:
import logging
from datetime import datetime
class CaptchaLogger:
def __init__(self):
self.captcha_stats = {
'encountered': 0,
'solved': 0,
'failed': 0,
'cost': 0.0
}
def log_captcha_encountered(self, url):
self.captcha_stats['encountered'] += 1
logging.info(f"CAPTCHA encountered on {url}")
def log_captcha_solved(self, cost=0.001):
self.captcha_stats['solved'] += 1
self.captcha_stats['cost'] += cost
logging.info(f"CAPTCHA solved. Total cost: ${self.captcha_stats['cost']:.3f}")
def log_captcha_failed(self):
self.captcha_stats['failed'] += 1
logging.error("CAPTCHA solving failed")
def print_summary(self):
success_rate = (self.captcha_stats['solved'] /
max(self.captcha_stats['encountered'], 1)) * 100
print(f"""
CAPTCHA Solving Summary:
- Encountered: {self.captcha_stats['encountered']}
- Solved: {self.captcha_stats['solved']}
- Failed: {self.captcha_stats['failed']}
- Success Rate: {success_rate:.1f}%
- Total Cost: ${self.captcha_stats['cost']:.3f}
""")
Conclusion
Handling CAPTCHAs in Scrapy requires a multi-faceted approach depending on your specific needs, budget, and scale. For production environments, third-party CAPTCHA solving services offer the most reliable solution, while browser automation provides flexibility for complex scenarios. For scenarios requiring JavaScript execution similar to handling dynamic content, you might also consider tools like browser automation frameworks for handling complex interactions.
Remember to always respect website terms of service and implement responsible scraping practices, including appropriate delays and request limits. Consider whether your scraping activities comply with legal requirements and the website's robots.txt file.
The key to successful CAPTCHA handling is combining multiple strategies: proper request spacing, user agent rotation, proxy usage, and when necessary, automated CAPTCHA solving services. Monitor your success rates and costs to optimize your approach over time.