Web Scraping with Python and Selenium

Posted by Vlad Mishkin | March 12, 2023 | Tags: Programming | Python |

In this post, we are going to look at how we can scrape data using Selenium, Chrome Web Driver, and Python. For background, scraping refers to going through the web page, controlling the inputs, extracting the required relevant data, and saving it for future purposes. To do all this, we need to control the web browser and perform operations on the page just like a human would do. So, we need to have sophisticated software for this.

Selenium is what we are looking for. Selenium is a powerful tool that can be used to control web browsers. These web browsers can either be running on the local or remote machine. The power of Selenium is that it can be integrated with all the major programming languages. We will be using Selenium with Python to control the web page. Controlling a web page gives the power to automate the software testing and scrap the web pages for data.

What you need to know

To get the most out of this post, ensure you know the basics of programming and how web applications work, including Javascript. Since we are using Python for the sake of this tutorial, knowing the fundamentals of Python is also beneficial.

Basic Concepts of Selenium

Before we look at some actual implementations, let's cover the concepts we'll encounter. Understanding these concepts beforehand will help readers grasp web scraping using Selenium more easily. We are not restricted to any particular language when scraping data from web pages with Selenium. Here, we'll use a Python script to scrape the data.

The idea is that Selenium provides an API for every major language, and we install that API in our script/code. Then, we use the relevant driver to control the web browser. In other words, we'll automate the control of the web page opened in the Chrome browser using a Python script via Selenium!

Web Driver

A web driver is a driver used to control the web browser. Browsers execute scripts that include both Javascript and HTML. Web drivers allow interfacing web browsers with any external programming language via an API such as Selenium. Hence, Selenium controls web browsers via web drivers that are executed through any programming language.

In our post, we will be using the Chrome web driver. Chrome web drivers are freely available for all popular operating systems. You can download one for yourself by following this link.

Setting up Python Development Environment with Selenium

Once the web driver (Chrome web driver in our case) is ready, we next need to install the Selenium API in our Python development environment. To install the Selenium API in your Python development environment run the below command in the Python terminal:

pip install selenium

Interaction with the Web Pages

Once the Selenium WebDriver is ready and connected to the web browser, it is time to control the web page using this driver. We can control any web page, but in this tutorial, we will be loading our own web page and then extracting data from it using Selenium. To load Selenium in our Python script, run the code below:

from selenium import webdriver
from Selenium.webdriver.chrome.options import Options
from Selenium.webdriver.common.by import By

DRIVER_PATH = 'C:\\chromedriver_win32\\chromedriver'
PAGE_PATH = 'C:\\chromedriver_win32\\index.html'

driver = webdriver.Chrome(executable_path=DRIVER_PATH)

driver.get(PAGE_PATH)

In the above code, first, we import the Selenium classes into our Python script. Next, we specify to Selenium where to find the Chrome driver and which web page to fetch. Finally, we instantiate the web page driver and fetch the page using driver.get().

Selecting Elements and Selenium Web Elements

Every web page consists of HTML elements. These HTML elements can be used to input values such as text boxes and text fields, or just static elements used to display the data. HTML provides several elements to cater to all input and output types. Each element can be uniquely identified using IDs. Selenium allows us to control and select these elements using web drivers. To control the web page, the first step is to gain access to the HTML elements in Selenium. We can select these HTML elements and get them as Selenium web elements. A Selenium web element is an object that represents an HTML element. While IDs are unique and the best way to identify elements on a web page, this is not the only way to select web page elements. We can also select web elements using a number of other ways, such as:

name_element = driver.find_element(By.ID, "name_id")

date_element = driver.find_element(By.ID, "date_id")

email_element = driver.find_element(By.ID, "email_id")

age_element = driver.find_element(By.ID, "age_id")

By id

The id is unique across the whole web page. Selenium can search the web page for the specified id and return the element.

By Element Name

The name attribute is commonly used in HTML forms to identify form controls, such as input fields and buttons. Selenium can locate and interact with these elements by their names. This can be useful in cases where the element does not have an id attribute or if multiple elements on the page have the same name.

By xPath

Every web page is made up of DOM. This DOM is a tree-like structure. Using xPath, Selenium can traverse to the web element and gain reference to the element.

By HTML Tag

HTML elements are made up of tags. Selenium can pull references using the tag names such as `p`.

By Class and CSS Selectors

The CSS classes are used to style any web element. Selenium can look for the specified class name to find the element on the web page. CSS also provides a way to style elements using CSS selectors, and Selenium supports the selection of elements using these CSS selectors.

Screenshots

Selenium can be used to capture screenshots of a web page while it is being controlled and executed by the web scripts. The ability to take screenshots is very important for a variety of reasons. For example, it can be used to capture errors that occurred on the screen during the automation process. Additionally, it can be used to visually communicate the different phases of the automation process. Automating the process of taking screenshots saves a lot of time compared to manual efforts.

driver.save_screenshot(filename="img.png")

Javascript? No Problem

Web browsers are equipped to execute Javascript scripts. These scripts add functionality to the web pages making them more and more interactive. Just like screenshots, Selenium allows the developers to execute this Javascript from the web driver realm. For instance, at any time, if you wish to show some alert on the screen you can achieve it very easily by running the code below.

Now executing the Javascript can take time. As a developer, you get the choice from Selenium to execute the Javascript in synchronous or asynchronous modes. So, if the script is going to take some time to execute, having it run asynchronously will not halt the execution.

Similarly, pages have Javascript that is executed when the page is loaded by the browser. Once the Javascript loads, it can manipulate the DOM, and hence the underlying IDs might be added or removed.

Finally, Selenium web drivers also allow returning of the values from the script into Selenium variables. So, suppose you need to get the value of some element with id: id, you can run the below lines:

driver.execute_script('document.getElementById("name_id").value="some sample name"')

driver.execute_script('document.getElementById("date_id").value="2022-10-06"')

driver.execute_script('document.getElementById("email_id").value="xyz@abc.com"')

driver.execute_script('document.getElementById("age_id").value="10"')

Likewise, we can also find the reference of any button on the web page and check for its properties. Depending upon the state of the element, we could add the programming logic:

submit_button = driver.find_element(By.ID, "submit_id")
print(submit_button.is_enabled())

Putting it all together

Now that we have seen the basics of the Selenium web driver, I will put all we have learned above to show a simple example of how Selenium works. Later, we will also scrap the data from the HTML page using a fully functional example.

Getting Started with Selenium

Imagine I have an HTML page by the name index.html. The below Python script I have prepared that goes through the above HMTL page, gains references to web elements, modifies texts in these web elements, and finally takes a screenshot.

index.html

<html>
  <head>
  <title> MY Sample Page for Selenium </title>
  </head>
  <body><p>
    </p><div>
      <p>Name: <input type="text" id="name_id"/></p>
      <p>Date of Birth: <input type="date" id="date_id"/></p>
      <p>Email: <input type="email" id="email_id"/></p>
      <p>Age: <input type="text" id="age_id"/></p>
    </div><p>
    </p><div>
      <input type="button" id="submit_id" value="Submit"/>
    </div><p>
  </p></body>
</html>

script.py

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By

# We can pass the desired behaviour of the Chrome in the form of Options
options = Options()

# Headless = False means that Chrome browser would be shown
# Otherwise Chrome browser window would not appear
options.headless = False

# Specify where Selenium should find the Chrome Web Drivers
DRIVER_PATH = 'C:\\WebScraping\\chromedriver'
PAGE_PATH = 'C:\\WebScraping\\index.html'

# Open the browser with the options provided
driver = webdriver.Chrome(options=options, executable_path=DRIVER_PATH)

# Load the Web page
# This web page could be any valid local or remote web page
# In our case we have a simple web page
driver.get(PAGE_PATH)

# Gain references to the HMTL elements
name_element = driver.find_element(By.ID, "name_id")
date_element = driver.find_element(By.ID, "date_id")
email_element = driver.find_element(By.ID, "email_id")
age_element = driver.find_element(By.ID, "age_id")

# Set the values of the elements
driver.execute_script('document.getElementById("name_id").value="some sample name"')
driver.execute_script('document.getElementById("date_id").value="2022-10-06"')
driver.execute_script('document.getElementById("email_id").value="xyz@abc.com"')
driver.execute_script('document.getElementById("age_id").value="10"')

# Gain reference to the submit button
submit_button = driver.find_element(By.ID, "submit_id")

# Check if the submit button is disabled or not
print(submit_button.is_enabled())

# For the sake of completeness take a screenshot of the page and save it in the machine.
driver.save_screenshot(filename="C:\\WebScraping\\img.png")

# Close the browser and release the associated resources
driver.quit()

Scraping Data using Selenium

index.html

<html>
     <head>
     <title> Web Scraping for Selenium </title>
     </head>
     <body><p>
           </p><table>
                 <tr>
                       <td>Product</td>
                       <td>Sale</td>
                       <td>Profit (%)</td>
                 </tr>
                 <tr>
                       <td>Product A</td>
                       <td>10$</td>
                       <td>2.5%</td>
                 </tr>
                 <tr>
                       <td>Product B</td>
                       <td>20$</td>
                       <td>2.5%</td>
                 </tr>
                 <tr>
                       <td>Product C</td>
                       <td>5$</td>
                       <td>.5%</td>
                 </tr>
                 <tr>
                       <td>Product D</td>
                       <td>20$</td>
                       <td>12.5%</td>
                 </tr>
           </table><p>
     </p></body>
</html>

script.py

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By

# We can pass the desired behavior of the Chrome in the form of Options
options = Options()

# Headless = False means that Chrome browser would be shown
# Otherwise Chrome browser window would not appear
options.headless = True # True because I dont want to see the browser window here

# Specify where Selenium should find the Chrome Web Drivers
DRIVER_PATH = 'C:\\WebScrapping\\chromedriver'
PAGE_PATH = 'C:\\WebScrapping\\Data Scrap.html'

# Open the browser with the options provided
driver = webdriver.Chrome(options=options, executable_path=DRIVER_PATH)

# Load the Web page
# This web page could be any valid local or remote web page
# In our case we have a simple web page
driver.get(PAGE_PATH)

# Locate the table element
table_element = driver.find_element(By.XPATH, "/html/body/table")

# Find all the rows in the table
# Notice we use find_elements here and not find_element
tr_elements = driver.find_elements(By.XPATH, ".//tr")

# Array to hold the table structure
table_data = []

# Iterate over all the rows and then fetch their data
for row in tr_elements:
    row_cols = row.find_elements(By.XPATH, './/td')
    row = []

    for col in row_cols:
        # Extract the text
        col_val = col.get_attribute('innerHTML')
        row.append(col_val)

    table_data.append(row)

# Printing the information in the table structure
for row in table_data:
    for col in row:
        print(col, end="\t\t\t")
    print()

driver.quit()

In the above script, we find the table element using XPath. Once the table element has been located, we loop on it to find all the rows and hence all the rows in the table. Finally, once we have all the rows, we iterate over these rows to get the text of all the columns inside each row using innerHTML property. Below is the output we get when we run this script:

Product Sale Profit (%)
Product A 10$ 2.5%
Product B 20$ 2.5%
Product C 5$ .5%
Product D 20$ 12.5%

Conclusion

In this post, we looked at the different features Selenium brings with itself. Knowing a tool that can control a webpage for you is extremely handy. Using this power, we can automate all our software tests. This will not just save our testing effort but will also make software testing more reliable and extensive.

Another very powerful application of Selenium is to scrap the data. Because you can control any webpage you can extract important user data that is present on the webpages in the form of tables. Python is an extremely popular scripting language with data scientists and machine learning enthusiasts because of the availability of rich machine learning packages. Combining Python with Selenium can be a great way to excel in the data science carrier.

There is a lot more to Selenium. Make sure you refer to the official documentation of Selenium to leverage all it has to offer.

Table of contents

You might also enjoy

Web Scraping with Python

Web Scraping with Python

A tutorial about web scraping in Python with examples. We will take a look at the most popular Python tools for web scraping: Requests, BeautifulSoup, lxml and others.

Posted by Vlad Mishkin | February 5, 2023