Can Nokogiri handle cookies or sessions while scraping?

Nokogiri by itself is an HTML, XML, SAX, and Reader parser for Ruby. It's great for parsing and manipulating HTML/XML documents but does not handle network requests, which means it doesn't deal with cookies or sessions directly. These functionalities are typically associated with HTTP clients that handle the sending and receiving of web requests.

To handle cookies or sessions while scraping with Nokogiri, you will need to use an HTTP client that supports cookies, such as Net::HTTP built into Ruby, or third-party libraries like HTTParty or RestClient. Here's how you can manage cookies with Nokogiri and Net::HTTP:

require 'nokogiri'
require 'net/http'
require 'uri'

uri = URI('http://example.com/login')
http = Net::HTTP.new(uri.host, uri.port)
http.use_ssl = true if uri.scheme == 'https'

# Create the HTTP request and set the cookie if you have one
request = Net::HTTP::Post.new(uri.request_uri)
request.set_form_data('username' => 'user', 'password' => 'pass')
response = http.request(request)

# Save the cookie
cookie = response['Set-Cookie']

# Use the cookie for subsequent requests
uri = URI('http://example.com/protected_page')
request = Net::HTTP::Get.new(uri)
request['Cookie'] = cookie
response = http.request(request)

# Now you can use Nokogiri to parse the HTML content
doc = Nokogiri::HTML(response.body)
# ... do something with the parsed document ...

In this example, we first perform a login to retrieve a session cookie. Then, we use that cookie in a subsequent request to access a protected page. After getting the response, we use Nokogiri to parse the HTML content.

If you are doing web scraping, you might also consider using higher-level libraries like Mechanize, which is built on top of Nokogiri and Net::HTTP, and provides a more convenient way to handle cookies, sessions, and forms:

require 'mechanize'

# Create a new Mechanize object
agent = Mechanize.new

# Mechanize handles cookies automatically
page = agent.get('http://example.com/login')

# Submit the login form
login_form = page.form_with(action: '/login')
login_form.username = 'user'
login_form.password = 'pass'
page = agent.submit(login_form)

# Now you can navigate to other pages and scrape data
protected_page = agent.get('http://example.com/protected_page')
doc = protected_page.parser
# ... do something with the parsed document ...

In the above example, Mechanize takes care of cookies and sessions automatically, and you can submit forms and navigate pages in a way that's similar to a real web browser. It uses Nokogiri under the hood to parse and manipulate HTML documents, which makes it a powerful tool for web scraping tasks that involve session handling.

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon