Beautiful Soup: Build a Web Scraper with Python

Posted by Vlad Mishkin | December 11, 2023 | Tags: Programming | Python |

For every web scraping application, gathering HTML files from the source is the first step. Once the HTML is gathered, different techniques can be applied to get the required result. There are countless ways of creating a web scraping application. You can use various paid APIs, software, and whatnot but in this article, we will focus on how we can our own web scrapper using Python and cURL. Before starting the article, we recommend that you should have a basic knowledge of python and have it installed on your machine.

Why do we need web scrappers?

Web scrappers have multiple industry applications. Some are listed below.

  • Market Research: High quality extracted data from websites is super helpful in analyzing market trends and customer behavior.
  • Email Marketing: Web scraping is widely used for extracting emails for marketing purposes. 
  • News Monitoring: News based web scrappers extract latest news reports for the companies that heavly rely on them.
  • Price Monitoring: Web scraping is used by companies to extract product data and their pricing to lay out proper pricing strategies. Companies use the data for a proper prodcut pricing to maximize the sales.

What is CURL?

CURL is an essential command line tool that is used to make HTTP requests to web servers. It comes without any GUI but its lightweight, easy-to-use, and extremely fast interfaces have made it extremely popular among the programmer. Below, we have an extremely basic cURL command. Open your terminal and write the following command in it.

curl https://sampleapis.com/

This command will make a GET request to the URL and show all the HTML content in the terminal. If you get something like what we have got; congrats, you have successfully made your first GET request using cURL.

To make it even simple, cURL works as a traditional browser. It can not only fetch HTML but anything the server returns. With this, you have laid the foundations for the web scrapper that we are about to create.

Using cURL in Python

As we have seen, getting HTML is not a complex process but the data given to us is scrambled, noisy, and unclear. To extract the right information by data manipulation, we are using Python as it is excellent for being a beginner-friendly programming language and has extensive support for data manipulation techniques. Along with this, the huge community is beneficial if you are stuck at some point.

Till now, we have used cURL in the command line but to scrap the data, we will have to use cURL in our code. Python lets us use the “os” module that allows the use of terminal commands. But, in this article, we will be using an external library named pycURL which is a cURL interface allowing us to use all the capabilities in a readable and easier way.

We assume that you already have installed python. PycURL can be installed by using the following command in the terminal.

pip install pycurl 

Parsing the HTML

To parse the fetched HTML, there are a variety of libraries but the most popular package in the python community is known as BeautifulSoup. This package makes web scraping a breeze. To learn more about it, you can head to their documentation.

Using the following command to install BeautifulSoup.

pip install beautifulsoup4

Getting started

Let us create a project directory and a virtual environment inside it. Use the following command to create a virtual environment inside the project folder.

python3 -m venv env

With this, we get an env folder which has to be activated before we install our dependencies.

Activate the virtual environment with this command.

source env/bin/activate

Now that we have created and activated the virtual environment, let us install all the required packages.

pip install pycurl beautifulsoup4 certifi

Let us create a new file named scrapper.py and add the following code inside it.

import pycurl
import certify
from io import BytesIO
from bs4 import BeautifulSoup

In the above lines, we have imported all the packages that will be used to create our web scrapper.

For the sake of better structure, we will divide our article into main two parts; data fetching and data parsing. The data fetching part will be done by PycURL and for the parsing, we will use BeautifulSoup.

Data fetching with cURL.

We have a sample URL from which we will get the HTML. Paste the following code in the scrapper.py file.

TARGET_URL = 'https://httpbin.org/forms/post'

# Using cURL and Python to gather data from a server via PycURL
buffer = BytesIO()
curl = pycurl.Curl()
curl.setopt(curl.URL, TARGET_URL)
curl.setopt(curl.WRITEDATA, buffer)
curl.setopt(curl.CAINFO, certifi.where())
curl.perform()
curl.close()

# Using BytesIO to retrieve the scraped data
body = buffer.getvalue()

# Saving the output and printing it in terminal
data = body.decode('iso-8859-1')
print(data)

Let us explain what is happening here. The first line defines our target URL for fetching the data. Then we are defining a buffer that will hold the data fetched from the server. Right after initializing the pycURL, we set different options necessary for the communication between the client and server. Certifi provides the root certificates for checking the trust value of SSL certificates in the target domains.

When all of the options are set, curl.perform() makes the required request and we close the connection.

Finally, we get the data that was stored inside the buffer and print it in the console. The output after this step will look as follows.

<!DOCTYPE html>
<html>
  <head>
  </head>
  <body>
  <!-- Example form from HTML5 spec http://www.w3.org/TR/html5/forms.html#writing-a-form's-user-interface -->
  <form method="post" action="/post">
   <p><label>Customer name: <input name="custname"></label></p>
   <p><label>Telephone: <input type=tel name="custtel"></label></p>
   <p><label>E-mail address: <input type=email name="custemail"></label></p>
   <fieldset>
    <legend> Pizza Size </legend>
    <p><label> <input type=radio name=size value="small"> Small </label></p>
    <p><label> <input type=radio name=size value="medium"> Medium </label></p>
    <p><label> <input type=radio name=size value="large"> Large </label></p>
   </fieldset>
   <fieldset>
    <legend> Pizza Toppings </legend>
    <p><label> <input type=checkbox name="topping" value="bacon"> Bacon </label></p>
    <p><label> <input type=checkbox name="topping" value="cheese"> Extra Cheese </label></p>
    <p><label> <input type=checkbox name="topping" value="onion"> Onion </label></p>
    <p><label> <input type=checkbox name="topping" value="mushroom"> Mushroom </label></p>
   </fieldset>
   <p><label>Preferred delivery time: <input type=time min="11:00" max="21:00" step="900" name="delivery"></label></p>
   <p><label>Delivery instructions: <textarea name="comments"></textarea></label></p>
   <p><button>Submit order</button></p>
  </form>
  </body>
</html>

We have successfully made cURL requires and retrieved the HTML content for the further step.

Data parsing with BeautifulSoup

Raw data is meaningless and to extract the value from the data, we parse the information. The main goal of a web scraper is to extract the right information from HTML. Let us say we want to extract all the text that is inside the <p> tags for the HTML. We can do it by adding the following code in our scrapper.py.

soup = BeautifulSoup(data, 'html.parser')
paragraphs = soup.find_all("p")
for p in paragraphs:
   print(p)

BeautifulSoup takes data from the buffer and in the second argument, we define the parser type. Since we are parsing HTML data, we give it html.parser. Next, we use the find all function to get all the p tags. The output here looks as follows.

It is to our creative power to extract whatever we want to. Since in this article we want the text inside the p tags, change the above code as follows.

soup = BeautifulSoup(data, 'html.parser')
paragraphs = soup.find_all("p")
for p in paragraphs:
   print(p.text)

The output will be as follows.

Putting it all together.

In the end, our scrapper.py file looks as below.

import pycurl
import certifi
from io import BytesIO
from bs4 import BeautifulSoup
# Setting global variables
TARGET_URL = 'https://httpbin.org/forms/post'

# Using cURL and Python to gather data from a server via PycURL
buffer = BytesIO()
curl = pycurl.Curl()
curl.setopt(curl.URL, TARGET_URL)
curl.setopt(curl.WRITEDATA, buffer)
curl.setopt(curl.CAINFO, certifi.where())
curl.perform()
curl.close()

# Using BytesIO to retrieve the scraped data
body = buffer.getvalue()

# Saving the output and printing it in terminal
data = body.decode('iso-8859-1')

soup = BeautifulSoup(data, 'html.parser')
paragraphs = soup.find_all("p")
for p in paragraphs:
   print(p.text)

Once you have extracted the data, you can store it, or send it further for more manipulation to your backend; the possibilities are endless!

Conclusion.

In this article, we have seen how easy it is to get started and create your own web scraper. With pycURL and BeautifulSoup, you can create the most advanced web scrappers for your needs without any complications.

Table of contents

You might also enjoy

Essential Python Libraries for Web Scraping: Harness Data Like a Pro

Imagine being able to extract vast amounts of data from the web in a matter of minutes, turning a sea of information into valuable insights. This is the power of Python libraries for web scraping, a skill that opens doors to limitless possibilities in the data-driven world. Ready to dive in?

Posted by Vlad Mishkin | December 11, 2023
Web Scraping with Python

Web Scraping with Python

A tutorial about web scraping in Python with examples. We will take a look at the most popular Python tools for web scraping: Requests, BeautifulSoup, lxml and others.

Posted by Vlad Mishkin | February 5, 2023