In Scrapy, you can throttle requests by using the DOWNLOAD_DELAY
and CONCURRENT_REQUESTS_PER_DOMAIN
settings. Throttling requests is important to avoid overwhelming the server with too many requests at the same time.
DOWNLOAD_DELAY
DOWNLOAD_DELAY
is a setting that specifies the delay between consecutive requests. It's specified in seconds and defaults to 0, i.e., no delay. When you increase this setting, Scrapy will wait longer between requests.
Here is how you can set it in your settings.py:
DOWNLOAD_DELAY = 0.5 # 500 ms of delay
This will make Scrapy wait 0.5 seconds between each request.
CONCURRENT_REQUESTS_PER_DOMAIN
CONCURRENT_REQUESTS_PER_DOMAIN
is a setting that specifies the maximum number of concurrent requests that will be performed to any single domain. By default, this setting is set to 8.
Here is how you can set it in your settings.py:
CONCURRENT_REQUESTS_PER_DOMAIN = 2
This will limit the number of concurrent requests to 2 for each domain.
AutoThrottle extension
In addition to these settings, Scrapy also has an AutoThrottle extension that automatically adjusts Scrapy to the optimal crawling speed, so it's respectful to servers and avoids hitting servers too hard.
To enable the AutoThrottle extension, add the following lines to your settings.py:
AUTOTHROTTLE_ENABLED = True
AUTOTHROTTLE_START_DELAY = 5
AUTOTHROTTLE_MAX_DELAY = 60
AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
This will start with a download delay of 5 seconds, the delay may increase up to 60 seconds if the server's response times slow down, aiming for a target concurrency of 1.0 requests per second.
With these settings, Scrapy will automatically adjust the delay between requests based on the server's response times and load. This is a more advanced and flexible way to throttle requests, and it's recommended if you want to be as respectful as possible to the servers you are crawling.