In this post, we are going to look at how we can scrape data using Selenium, Chrome Web Driver, and Python. For background, scraping refers to going through the web page, controlling the inputs, extracting the required relevant data, and saving it for future purposes. To do all this, we need to control the web browser and perform operations on the page just like a human would do. So, we need to have sophisticated software for this.
Selenium is what we are looking for. Selenium is a powerful tool that can be used to control web browsers. These web browsers can either be running on the local or remote machine. The power of Selenium is that it can be integrated with all the major programming languages. We will be using Selenium with Python to control the web page. Controlling a web page gives the power to automate the software testing and scrap the web pages for data.
What you need to know
Basic Concepts of Selenium
Before we look at some actual implementations, let's cover the concepts we'll encounter. Understanding these concepts beforehand will help readers grasp web scraping using Selenium more easily. We are not restricted to any particular language when scraping data from web pages with Selenium. Here, we'll use a Python script to scrape the data.
The idea is that Selenium provides an API for every major language, and we install that API in our script/code. Then, we use the relevant driver to control the web browser. In other words, we'll automate the control of the web page opened in the Chrome browser using a Python script via Selenium!
In our post, we will be using the Chrome web driver. Chrome web drivers are freely available for all popular operating systems. You can download one for yourself by following this link.
Setting up Python Development Environment with Selenium
Once the web driver (Chrome web driver in our case) is ready, we next need to install the Selenium API in our Python development environment. To install the Selenium API in your Python development environment run the below command in the Python terminal:
pip install selenium
Interaction with the Web Pages
Once the Selenium WebDriver is ready and connected to the web browser, it is time to control the web page using this driver. We can control any web page, but in this tutorial, we will be loading our own web page and then extracting data from it using Selenium. To load Selenium in our Python script, run the code below:
from selenium import webdriver from Selenium.webdriver.chrome.options import Options from Selenium.webdriver.common.by import By DRIVER_PATH = 'C:\\chromedriver_win32\\chromedriver' PAGE_PATH = 'C:\\chromedriver_win32\\index.html' driver = webdriver.Chrome(executable_path=DRIVER_PATH) driver.get(PAGE_PATH)
In the above code, first, we import the Selenium classes into our Python script. Next, we specify to Selenium where to find the Chrome driver and which web page to fetch. Finally, we instantiate the web page driver and fetch the page using driver.get().
Selecting Elements and Selenium Web Elements
Every web page consists of HTML elements. These HTML elements can be used to input values such as text boxes and text fields, or just static elements used to display the data. HTML provides several elements to cater to all input and output types. Each element can be uniquely identified using IDs. Selenium allows us to control and select these elements using web drivers. To control the web page, the first step is to gain access to the HTML elements in Selenium. We can select these HTML elements and get them as Selenium web elements. A Selenium web element is an object that represents an HTML element. While IDs are unique and the best way to identify elements on a web page, this is not the only way to select web page elements. We can also select web elements using a number of other ways, such as:
name_element = driver.find_element(By.ID, "name_id") date_element = driver.find_element(By.ID, "date_id") email_element = driver.find_element(By.ID, "email_id") age_element = driver.find_element(By.ID, "age_id")
The id is unique across the whole web page. Selenium can search the web page for the specified id and return the element.
By Element Name
The name attribute is commonly used in HTML forms to identify form controls, such as input fields and buttons. Selenium can locate and interact with these elements by their names. This can be useful in cases where the element does not have an id attribute or if multiple elements on the page have the same name.
Every web page is made up of DOM. This DOM is a tree-like structure. Using xPath, Selenium can traverse to the web element and gain reference to the element.
By HTML Tag
HTML elements are made up of tags. Selenium can pull references using the tag names such as `p`.
By Class and CSS Selectors
The CSS classes are used to style any web element. Selenium can look for the specified class name to find the element on the web page. CSS also provides a way to style elements using CSS selectors, and Selenium supports the selection of elements using these CSS selectors.
Selenium can be used to capture screenshots of a web page while it is being controlled and executed by the web scripts. The ability to take screenshots is very important for a variety of reasons. For example, it can be used to capture errors that occurred on the screen during the automation process. Additionally, it can be used to visually communicate the different phases of the automation process. Automating the process of taking screenshots saves a lot of time compared to manual efforts.
Finally, Selenium web drivers also allow returning of the values from the script into Selenium variables. So, suppose you need to get the value of some element with id: id, you can run the below lines:
driver.execute_script('document.getElementById("name_id").value="some sample name"') driver.execute_script('document.getElementById("date_id").value="2022-10-06"') driver.execute_script('document.getElementById("email_id").value="firstname.lastname@example.org"') driver.execute_script('document.getElementById("age_id").value="10"')
Likewise, we can also find the reference of any button on the web page and check for its properties. Depending upon the state of the element, we could add the programming logic:
submit_button = driver.find_element(By.ID, "submit_id") print(submit_button.is_enabled())
Putting it all together
Now that we have seen the basics of the Selenium web driver, I will put all we have learned above to show a simple example of how Selenium works. Later, we will also scrap the data from the HTML page using a fully functional example.
Getting Started with Selenium
Imagine I have an HTML page by the name index.html. The below Python script I have prepared that goes through the above HMTL page, gains references to web elements, modifies texts in these web elements, and finally takes a screenshot.
<html> <head> <title> MY Sample Page for Selenium </title> </head> <body><p> </p><div> <p>Name: <input type="text" id="name_id"/></p> <p>Date of Birth: <input type="date" id="date_id"/></p> <p>Email: <input type="email" id="email_id"/></p> <p>Age: <input type="text" id="age_id"/></p> </div><p> </p><div> <input type="button" id="submit_id" value="Submit"/> </div><p> </p></body> </html>
from selenium import webdriver from selenium.webdriver.chrome.options import Options from selenium.webdriver.common.by import By # We can pass the desired behaviour of the Chrome in the form of Options options = Options() # Headless = False means that Chrome browser would be shown # Otherwise Chrome browser window would not appear options.headless = False # Specify where Selenium should find the Chrome Web Drivers DRIVER_PATH = 'C:\\WebScraping\\chromedriver' PAGE_PATH = 'C:\\WebScraping\\index.html' # Open the browser with the options provided driver = webdriver.Chrome(options=options, executable_path=DRIVER_PATH) # Load the Web page # This web page could be any valid local or remote web page # In our case we have a simple web page driver.get(PAGE_PATH) # Gain references to the HMTL elements name_element = driver.find_element(By.ID, "name_id") date_element = driver.find_element(By.ID, "date_id") email_element = driver.find_element(By.ID, "email_id") age_element = driver.find_element(By.ID, "age_id") # Set the values of the elements driver.execute_script('document.getElementById("name_id").value="some sample name"') driver.execute_script('document.getElementById("date_id").value="2022-10-06"') driver.execute_script('document.getElementById("email_id").value="email@example.com"') driver.execute_script('document.getElementById("age_id").value="10"') # Gain reference to the submit button submit_button = driver.find_element(By.ID, "submit_id") # Check if the submit button is disabled or not print(submit_button.is_enabled()) # For the sake of completeness take a screenshot of the page and save it in the machine. driver.save_screenshot(filename="C:\\WebScraping\\img.png") # Close the browser and release the associated resources driver.quit()
Scraping Data using Selenium
<html> <head> <title> Web Scraping for Selenium </title> </head> <body><p> </p><table> <tr> <td>Product</td> <td>Sale</td> <td>Profit (%)</td> </tr> <tr> <td>Product A</td> <td>10$</td> <td>2.5%</td> </tr> <tr> <td>Product B</td> <td>20$</td> <td>2.5%</td> </tr> <tr> <td>Product C</td> <td>5$</td> <td>.5%</td> </tr> <tr> <td>Product D</td> <td>20$</td> <td>12.5%</td> </tr> </table><p> </p></body> </html>
from selenium import webdriver from selenium.webdriver.chrome.options import Options from selenium.webdriver.common.by import By # We can pass the desired behavior of the Chrome in the form of Options options = Options() # Headless = False means that Chrome browser would be shown # Otherwise Chrome browser window would not appear options.headless = True # True because I dont want to see the browser window here # Specify where Selenium should find the Chrome Web Drivers DRIVER_PATH = 'C:\\WebScrapping\\chromedriver' PAGE_PATH = 'C:\\WebScrapping\\Data Scrap.html' # Open the browser with the options provided driver = webdriver.Chrome(options=options, executable_path=DRIVER_PATH) # Load the Web page # This web page could be any valid local or remote web page # In our case we have a simple web page driver.get(PAGE_PATH) # Locate the table element table_element = driver.find_element(By.XPATH, "/html/body/table") # Find all the rows in the table # Notice we use find_elements here and not find_element tr_elements = driver.find_elements(By.XPATH, ".//tr") # Array to hold the table structure table_data =  # Iterate over all the rows and then fetch their data for row in tr_elements: row_cols = row.find_elements(By.XPATH, './/td') row =  for col in row_cols: # Extract the text col_val = col.get_attribute('innerHTML') row.append(col_val) table_data.append(row) # Printing the information in the table structure for row in table_data: for col in row: print(col, end="\t\t\t") print() driver.quit()
In the above script, we find the table element using XPath. Once the table element has been located, we loop on it to find all the rows and hence all the rows in the table. Finally, once we have all the rows, we iterate over these rows to get the text of all the columns inside each row using innerHTML property. Below is the output we get when we run this script:
In this post, we looked at the different features Selenium brings with itself. Knowing a tool that can control a webpage for you is extremely handy. Using this power, we can automate all our software tests. This will not just save our testing effort but will also make software testing more reliable and extensive.
Another very powerful application of Selenium is to scrap the data. Because you can control any webpage you can extract important user data that is present on the webpages in the form of tables. Python is an extremely popular scripting language with data scientists and machine learning enthusiasts because of the availability of rich machine learning packages. Combining Python with Selenium can be a great way to excel in the data science carrier.
There is a lot more to Selenium. Make sure you refer to the official documentation of Selenium to leverage all it has to offer.