Getting banned while web scraping is a common challenge that can derail data collection projects. Scrapy provides multiple built-in and configurable strategies to minimize detection and avoid IP blocks. Here's a comprehensive guide to implementing anti-ban techniques in your Scrapy projects.
1. Respect robots.txt
The robots.txt
file contains website crawling rules that should be followed as part of ethical scraping practices. Scrapy respects this file by default.
# settings.py
ROBOTSTXT_OBEY = True # Default: True
To bypass robots.txt (use with caution):
ROBOTSTXT_OBEY = False
2. Configure Request Delays
Rapid-fire requests are the fastest way to trigger anti-bot measures. Implement strategic delays between requests:
# settings.py
DOWNLOAD_DELAY = 3 # 3 seconds between requests
RANDOMIZE_DOWNLOAD_DELAY = 0.5 # Randomize delay (0.5 * to 1.5 * DOWNLOAD_DELAY)
3. Enable AutoThrottle Extension
AutoThrottle automatically adjusts request speed based on server response times and load:
# settings.py
AUTOTHROTTLE_ENABLED = True
AUTOTHROTTLE_START_DELAY = 1
AUTOTHROTTLE_MAX_DELAY = 60
AUTOTHROTTLE_TARGET_CONCURRENCY = 2.0
AUTOTHROTTLE_DEBUG = True # Enable to see throttling stats
4. Rotate User Agents
Vary your User-Agent header to mimic different browsers and devices:
Static User-Agent
# settings.py
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36'
Dynamic User-Agent Rotation
# Install: pip install fake-useragent
from fake_useragent import UserAgent
class RotateUserAgentMiddleware:
def __init__(self):
self.ua = UserAgent()
def process_request(self, request, spider):
request.headers['User-Agent'] = self.ua.random
return None
Enable in settings:
DOWNLOADER_MIDDLEWARES = {
'myproject.middlewares.RotateUserAgentMiddleware': 400,
}
5. Use Proxy Rotation
Rotate IP addresses to distribute requests across different sources:
Single Proxy
# In spider
def start_requests(self):
for url in self.start_urls:
yield scrapy.Request(
url=url,
meta={'proxy': 'http://proxy-server:port'}
)
Proxy Pool Middleware
import random
class ProxyMiddleware:
def __init__(self):
self.proxies = [
'http://proxy1:port',
'http://proxy2:port',
'http://proxy3:port'
]
def process_request(self, request, spider):
proxy = random.choice(self.proxies)
request.meta['proxy'] = proxy
6. Handle Cookies and Sessions
Maintain session state to appear more like a regular user:
# settings.py
COOKIES_ENABLED = True
Custom cookie handling:
# In spider
def parse(self, response):
# Extract and use session cookies
cookies = response.headers.getlist('Set-Cookie')
yield scrapy.Request(
url=next_page,
cookies=cookies,
callback=self.parse_page
)
7. Configure Concurrent Requests
Limit concurrent requests to avoid overwhelming servers:
# settings.py
CONCURRENT_REQUESTS = 16 # Default: 16
CONCURRENT_REQUESTS_PER_DOMAIN = 8 # Default: 8
8. Add Request Headers
Include additional headers to mimic browser behavior:
# settings.py
DEFAULT_REQUEST_HEADERS = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en-US,en;q=0.5',
'Accept-Encoding': 'gzip, deflate',
'DNT': '1',
'Connection': 'keep-alive',
'Upgrade-Insecure-Requests': '1',
}
9. Implement Retry Logic
Handle failed requests gracefully:
# settings.py
RETRY_ENABLED = True
RETRY_TIMES = 3
RETRY_HTTP_CODES = [500, 502, 503, 504, 408, 429]
10. Monitor and Adapt
Use Scrapy's built-in stats to monitor your scraping performance:
# In spider
def closed(self, reason):
stats = self.crawler.stats.get_stats()
print(f"Requests: {stats.get('downloader/request_count', 0)}")
print(f"Responses: {stats.get('downloader/response_count', 0)}")
print(f"Items: {stats.get('item_scraped_count', 0)}")
Complete Example Configuration
# settings.py
BOT_NAME = 'mybot'
# Obey robots.txt rules
ROBOTSTXT_OBEY = True
# Configure delays
DOWNLOAD_DELAY = 3
RANDOMIZE_DOWNLOAD_DELAY = 0.5
# Enable AutoThrottle
AUTOTHROTTLE_ENABLED = True
AUTOTHROTTLE_START_DELAY = 1
AUTOTHROTTLE_MAX_DELAY = 60
AUTOTHROTTLE_TARGET_CONCURRENCY = 2.0
# Configure concurrency
CONCURRENT_REQUESTS = 8
CONCURRENT_REQUESTS_PER_DOMAIN = 2
# Enable retries
RETRY_ENABLED = True
RETRY_TIMES = 3
# User agent
USER_AGENT = 'Mozilla/5.0 (compatible; MyBot/1.0)'
# Headers
DEFAULT_REQUEST_HEADERS = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en',
}
# Enable cookies
COOKIES_ENABLED = True
Best Practices
- Start conservatively - Begin with longer delays and fewer concurrent requests
- Monitor response patterns - Watch for CAPTCHAs, 429 errors, or unusual response times
- Respect the website - Follow terms of service and don't overload servers
- Use commercial solutions - For production systems, consider using proxy services or web scraping APIs
- Test thoroughly - Validate your anti-ban measures on a small scale first
By implementing these techniques systematically, you can significantly reduce the likelihood of getting banned while maintaining efficient data collection with Scrapy.