Python is an incredibly flexible language used in a myriad of different applications. It is commonly used in data science, machine learning, and web development. This flexibility is why Python has become one of the most widely used programming languages in the world. You can also use Python to build web scrapers that will pull data from the web, which can be helpful in many different tasks that require real-time data.
In this article, we're going to take a closer look at the web scraping utilities of Python. We'll explore the different use cases and look at a few of the most popular libraries to help make the process easier. Let's dive in.
What Is Web Scraping?
In a nutshell, web scraping is the process of sending an HTTP request to a website to retrieve data. The applications for web scraping are incredibly varied. For example, maybe you want to scrape a weather website to gather data to use in your own weather application, or perhaps you want to create your own rudimentary stock ticker by scraping current stock prices from a website.
Regardless of your use case, one of the easiest ways to create web scrapers is by using Python along with a number of extremely useful libraries. We'll take a look at some of these libraries with examples of how they can be used.
It should be noted here that we'll be using Python 3 for all of our examples, as Python 2 is deprecated. The process of installing Python 3 on your machine depends on your operating system, and you can find the downloads and installation instructions on the official Python website.
Requests Library for HTTP Requests
The first step in any web scraping project is to send an HTTP request to a server. In most applications, this will be a GET request. While it is possible to use standard Python HTTP libraries, the amount of code required is far greater than is necessary. To simplify the process, the Requests library is an integral tool for any web scraping project.
The data that the Requests library returns can be difficult to read, but it is the simplest way to receive the information you need in your web scraping project. Additional libraries can help with taking that data and putting it in a more human-readable and usable format.
To use the request library, you'll simply need to install it by entering your terminal and inputting the following command:
pip3 install requests
With Requests installed, you can now use it to make HTTP requests. For example, let's imagine you've got your eye on a Sony Playstation 5, which, at the time of this writing, is difficult to find in stock at retail stores. Instead, you want to keep tabs on eBay listings for the gaming console.
With the Requests library and a few lines of Python code, you can easily scrape eBay's search results for the price of its PS5 listings.
First, we'll enter the Python console by going into your terminal and typing' Python.' Within the Python console, enter the following code:
import requests
response = requests.get("https://www.ebay.com/sch/i.html?_from=R40&_nkw=playstation+5&_sacat=0&_sop=10")
print(response.text)
These three lines of code will: 1. Import the requests library 2. Send a get request to the URL of the target website, which in this case is the results page of a search for "Playstation 5" 3. Display the text response
As you'll notice, the text is highly unstructured and difficult to read. That's because when Requests retrieves data, the response comes in the form of a string rather than a structured HTML tree. So, the Requests library does a great job of quickly and efficiently fetching the data we are looking for.
However, we still need additional libraries to make this data look nicer and to extract the HTML elements that we are interested in.
To do this, we will bring another library called Beautiful Soup into our workflow.
Beautiful Soup is Tasty
Beautiful Soup is a popular library for extracting parsed HTML data. For this purpose, it works very well and provides the information scraped from the web into a much more useful and easy-to-read format.
There are two key elements that Beautiful Soup does not have in-built: HTTP requests and an HTML parser. For this reason, one of the most common web scraping workflows consists of using the Requests library to send the HTTP request, the html.parser module from the Python Standard Library, and Beautiful Soup to query and extract the relevant HTML data. These three tools combined can provide data that is both relevant and easily readable.
Beautiful Soup can be installed as part of the bs4 package via pip as follows:
pip3 install bs4
Now, going back to our quest to find a Playstation 5, we can use the following code in the Python console:
import requests
from bs4 import BeautifulSoup
response = requests.get('https://www.ebay.com/sch/i.html?_from=R40&_nkw=playstation+5&_sacat=0&_sop=10')
soup = BeautifulSoup(response.text, 'html.parser')
print(Soup.title)
You'll notice that the resulting output will simply print the title element of the webpage, which is "<title>Playstation 5: Search Result | eBay</title>". This shows that the scraper is working, but it isn't yet providing us with any useful information. To gather some data we can use, we can inspect the website's HTML in the web browser's developer tools. By doing this, we will see that each search result has its price contained in a span with a class called "s-item__price."
Armed with this information, we can now extract the price of each of the search results on the first page of eBay.
import requests
from bs4 import BeautifulSoup
response = requests.get('https://www.ebay.com/sch/i.html?_from=R40&_nkw=playstation+5&_sacat=0&_sop=10')
soup = BeautifulSoup(response.text, 'html.parser')
prices = soup.find_all("span", {"class", "s-item__price"})
for price in prices:
print(price.text)
Running this code will:
- Import the Requests library and Beautiful Soup.
- Use Requests to make your HTTP GET request to eBay's search result page for Playstation 5
- Create a variable, 'soup,' and store in it the response text, which is parsed by the html.parser module.
- Create a 'prices' variable and store within it each HTML span element with the class of' s-item__price.'
- Loop through each of the price elements and print their text, which is simply the price of that item.
When used in conjunction with Requests and the html.parser module, Beautiful Soup is a very easy-to-use and powerful library for web scraping. We were able to target a specific web page–the eBay results page for our specified search term–and extract the price of each listing.
This is a basic example of what Beautiful Soup can do. If you were to get more in-depth, you could also use additional libraries to save these prices to a CSV file if you wanted to track Playstation 5 prices daily, for example. You could even run web scrapers for multiple e-commerce sites, such as eBay and Amazon, to compare Playstation 5 pricing and availability.
lxml: Another Parser
Lxml is another popular library for Python web scraping projects. At its core, lxml is a parser for both HTML and XML. This gives it a little more flexibility over the html.parser module from the Python Standard Library, and it works well alongside both Beautiful Soup and the Requests library. You don't have to use Beautiful Soup to get usable data with lxml, though, as we will show in our lxml example.
Like the previous libraries, lxml can be installed by using pip from within your terminal.
pip3 install lxml
Now that lxml is installed, we're going to use both Requests and lxml to scrape job postings from indeed.com . Specifically, we're going to scrape the search results for the search term 'Python Developer' with the location 'United States'. How will we do this?
First, we need to go into our Python console and take care of our imports:
import requests
from lxml import HTML
Note that along with Requests, we are only importing the HTML class from lxml. We could import the full lxml library and reference the HTML class in our code, but this would just lead to more verbose syntax.
Next, we need to go to indeed.com and search for 'Python Developer' in the location 'United States.' After the results populate, copy the URL of the results page. If you want the most recent job postings, you'll also want to click on "date" in the "Sort by" field before copying the URL. Now we’ll save this URL as a variable for ease of use and make the HTTP request:
url = 'https://www.indeed.com/jobs?q=python%20developer&l=United%20States&sort=date&vjk=8d0888f0b826848b'
response = requests.get(url)
With our HTTP request sent and response received, we need to go into the browser and inspect the HTML to find the elements that we will be extracting with our web scraper. When you do this, you will see that each job posting is listed in a table with the class "resultContent." Nested within this element, we will find an anchor tag that holds the information we are scraping for: the job titles. We can select each of these elements by using the 'jcs-JobTitle' class that is attributed to each of them, as we can also see by viewing the attributes of the anchor tag.
With this information, we can now continue building out our web scraper using lxml to parse the HTML. The code below represents our full web scraper:
import requests
from lxml import html
url = 'https://www.indeed.com/jobs?q=python%20developer&l=United%20States&sort=date&vjk=8d0888f0b826848b'
response = requests.get(url)
data = response.content
source = html.fromstring(data)
path = "//a[@class='jcs-JobTitle']"
listings = source.xpath(path)
for listing in listings:
print(listing.text_content())
There's a lot going on here, but some of this code should look familiar from the previous examples. Our web scraper:
- Imports the libraries and classes required, including the HTML class from lxml.
- Makes an HTTP request using Requests.
- Pulls the response content to store into a 'data' variable, then converts that data from string to HTML to store in the 'source' variable.
- Defines the xpath for all anchor elements with the 'jcs-JobTitle' class selector and stores it in the 'path' variable.
- Stores each instance of the targeted elements in a 'listings' variable.
- Loops through each listing to print the text_content, which is the job title.
This provides us with a simple piece of data –a job title– but it can, of course, be expanded to include other bits of data. For example, maybe you want to extract the URL from each listing to display alongside each job title, or you want to extract the salary estimate for listings in which it is available. The possibilities are limited only by the types of data that are contained on the web page.
It's true that lxml does have a steeper learning curve than html.parser. However, it is arguably more flexible, especially with its ability to also parse XML data. For very simple web scrapers, html.parser should do the trick. If you need some extra bells and whistles or need to scrape XML data, you may want to look at lxml.
Libraries make Python Web Scraping Accessible
Python is a powerful programming language for web scraping. Aside from the language's relative ease of use, the available libraries do a fantastic job of making web scraping a breeze. Even though they aren't built into the Python Standard Library–except for the html.parser module–they are well maintained and have excellent documentation to get up and running quickly.
To recap, Requests is perfect for sending quick HTTP requests, and Beautiful Soup is a powerful and elegant way of extracting data. For parsers, both lxml and the parser.html module both do well, though lxml is a little more robust. With lxml's extra features, though, comes a little more complexity and a steeper learning curve.