Scrapy provides built-in support for cookies. It uses Python's built-in http.cookiejar
to store and send cookies, and automatically handles typical tasks like session handling. However, sometimes you may want to manipulate cookies manually.
Here's how to handle cookies in Scrapy.
1. Default Behavior
By default, Scrapy automatically handles cookies. All you have to do is enable the COOKIES_ENABLED
setting (it's enabled by default).
COOKIES_ENABLED = True
2. Manual Management
To manually manage cookies, you'll need to disable the default cookie middleware and send your own cookies in the Request
objects.
First, disable the default cookie middleware by adding this line in your settings:
COOKIES_ENABLED = False
Then, add your cookies in the Request
like this:
def start_requests(self):
yield scrapy.Request(url="http://example.com", cookies={"cookie_name": "cookie_value"}, callback=self.parse_page)
In the code above, replace "cookie_name"
and "cookie_value"
with your specific cookie's name and value.
3. Accessing Response Cookies
To access the cookies sent by the server in a response, you can use the response.headers
attribute. However, note that this attribute returns a Bytes
object, not a dictionary.
Here's how to access the Set-Cookie
header:
def parse_page(self, response):
raw_cookies = response.headers.getlist('Set-Cookie')
for raw_cookie in raw_cookies:
cookie = str(raw_cookie, 'utf-8')
print(cookie)
In the code above, raw_cookies
is a list of Bytes
objects. You have to convert each object to a string to use it.
4. Using Session Cookies
If you want to keep session cookies between requests, you can use the dont_merge_cookies
flag in your Request
objects:
def start_requests(self):
yield scrapy.Request(url="http://example.com", cookies={"sessionid": "123"}, meta={"dont_merge_cookies": True}, callback=self.parse_page)
In the code above, the sessionid
cookie will be kept between requests.
Remember that handling cookies manually requires a good understanding of how HTTP cookies work. Always consider the implications of your changes on the scraping process and respect the website's policies.