MechanicalSoup is a Python library for automating interaction with websites. It provides a simple API for navigating pages, filling out forms, and scraping content. When you need to scrape data from a page that is behind authentication, you must first create a session that logs in to the website.
Here's a step-by-step guide on how to do this with MechanicalSoup:
- Install MechanicalSoup: If you haven't already installed MechanicalSoup, you can do so using pip:
pip install MechanicalSoup
- Create a Browser Instance: Start by creating a
StatefulBrowser
instance from MechanicalSoup, which will maintain the session across requests.
import mechanicalsoup
browser = mechanicalsoup.StatefulBrowser()
- Open the Login Page: Use the
browser
to open the login page.
login_url = 'https://example.com/login'
browser.open(login_url)
- Fill the Login Form: Find the login form on the page and fill in your credentials. You will need to inspect the HTML of the login page to know the name or the id of the form inputs.
browser.select_form('form[id="loginForm"]') # Use the appropriate selector for the login form
browser['username'] = 'your_username' # Replace with the correct form field names and your credentials
browser['password'] = 'your_password'
- Submit the Login Form: Submit the form to log in.
response = browser.submit_selected()
- Check Login Success: After submitting the form, check if the login was successful. This is often done by looking for a specific element on the page that indicates a successful login.
if "Welcome" in response.text:
print("Login successful!")
else:
print("Login failed!")
- Navigate to the Target Page: Once logged in, navigate to the page you want to scrape.
browser.open('https://example.com/protected-page')
- Scrape Data: Now that you are logged in and on the desired page, you can scrape data as you would from any other page.
page = browser.get_current_page()
data = page.select('div.content') # Use the appropriate selector to find the data you want to scrape
- Logout (Optional): If the website has a logout button, it's a good practice to end your session by logging out.
browser.select_form('form[id="logoutForm"]')
browser.submit_selected()
Here's a complete example that puts all of these steps together:
import mechanicalsoup
# Create a browser instance
browser = mechanicalsoup.StatefulBrowser()
# Open the login page
login_url = 'https://example.com/login'
browser.open(login_url)
# Fill in the login form
browser.select_form('form[id="loginForm"]')
browser['username'] = 'your_username'
browser['password'] = 'your_password'
# Submit the form
response = browser.submit_selected()
# Check for login success
if "Welcome" in response.text:
print("Login successful!")
# Navigate to the page behind authentication
browser.open('https://example.com/protected-page')
# Scrape data
page = browser.get_current_page()
data = page.select('div.content')
# Process the scraped data
# ...
# Logout
browser.select_form('form[id="logoutForm"]')
browser.submit_selected()
else:
print("Login failed!")
Remember that web scraping can be against the terms of service of some websites, and it's essential to respect the website's rules and the legality of your actions when scraping. Always review the website's robots.txt
file and terms of service before you begin scraping.