Web Scraping with JavaScript

Posted by Vlad Mishkin | February 5, 2023 | Tags: Programming | JavaScript |

Web scraping allows for the extraction of data from websites and web applications. JavaScript is a programming language that is capable of web scraping. Javascript has become one of the most popular and widely used languages, and it is very powerful when used alongside NodeJS. NodeJS is an asynchronous event-driven JavaScript runtime, and it is designed to build scalable network applications.

It is simple to add libraries to your JavaScript project using NodeJS. These libraries add features and functionality that are not available through vanilla JavaScript. In this article, we'll be examining the potential uses of the most popular web scraping libraries that are currently available for JavaScript.

I have categorized the most popular libraries available for web scraping in JavaScript. This is a list that showcases the libraries that may be useful for your particular software project.

Which Web Scraping option is right for you?

Web scraping tools generally fall into three categories in terms of how they process and interact with HTML content.

  1. HTML source code - using tools like Cheerio to process the HTML source code.
  2. Headless browsers - Puppeteer, Selenium, and similar tools. More on this later.
  3. Building the DOM - A library called JSDom is an example of a library that can create the DOM using a string of HTML.

Let's look at each of these in more detail.

HTML Source code

This is the simplest approach, but it can only be used if you are sure that all of the data you are targeting is contained within the HTML source code. To check if the data do want is in the source code, you can right-click on any webpage in your browser and choose "Inspect" or "View Page Source". This should allow you to view the HTML source code. We'll be looking at a JS library called Cheerio that handles this scenario.

Headless Browsers

In many cases, you can't access the information from the raw HTML code. This is because the DOM was manipulated by some JavaScript that was executed in the background. Are you wondering what the DOM is?

According to W3 Schools : The HTML DOM is a standard object model and programming interface for HTML. It defines:

  • The HTML elements as objects
  • The properties of all HTML elements
  • The methods to access all HTML elements
  • The events for all HTML elements

In other words: The HTML DOM is a standard for how to get, change, add, or delete HTML elements.

An example would be Single Page Applications (SPAs). The HTML pages for SPAs typically contain very little information, with JavaScript populating different parts of the HTML document at runtime.

To get the data from a SPA, you would typically need to employ the use of headless browsers. A headless browser is the same as your standard browser (Chrome, Firefox, Safari), except it has no user interface. It runs in the background and allows you to programmatically interact with different elements on the page, like clicking buttons, entering keystrokes, and more.

The most popular choices for web scraping this way are Puppeteer, Selenium, and Nightmare. We'll be taking a look at all of these libraries later.

Building the DOM

So why not use a headless browser for all scenarios? The answer is speed, and the amount of computing power required. You are essentially simulating a browser, which can be overkill when all you need to need to do is build the DOM.

There is a NodeJS library called JSDom , which will parse the HTML you pass it, just like a browser does. However, it isn't a browser but a tool for building a DOM from a given HTML source code, while also executing the JavaScript code within that HTML.

Thanks to this abstraction, JSDom is able to run faster than a headless browser. If it's faster, why don't you use it instead of headless browsers all the time?

JSDom admits its shortcomings in its own documentation: People often have trouble with asynchronous script loading when using JSDom. Many pages load scripts asynchronously, but there is no way to tell when they're done doing so and thus when it's a good time to run your code and inspect the resulting DOM structure. This is a fundamental limitation.

Okay, we've covered the different categories of available web scrapers. Let's look at some of the most popular libraries in each of these categories.

The JavaScript web scraping libraries we'll be looking at are:

  • Puppeteer
  • Selenium
  • Nightmare
  • Axios & Cheerio
  • JSDom

Before we dive into the libraries themselves, let's make sure you have Node.js installed properly by following these steps.

Node.js Installation

If you don't have Node downloaded, download Node.js and npm , and check that it has been successfully installed by running the following commands in the terminal.

  • node -v (verifies that Node.js is installed)
  • npm -v (verifies that node package manager is installed)

Once you have installed Node.js, you will get access to npm, the inbuilt package manager, which will be used to install the libraries. Let's move on to our first JS library, Puppeteer.

Puppeteer

Puppeteer is a Node.js library maintained by Chrome's development team at Google. Puppeteer provides a high-level API to control headless Chrome or Chromium or to interact with the DevTools protocol.

Google designed Puppeteer to provide a simple yet powerful interface in Node.js for automating tasks and performing various actions using the Chromium browser engine. It runs headless by default, but it can be configured to run full Chrome or Chromium.

The API built by the Puppeteer team uses the DevTools Protocol to take control of a web browser and perform different tasks such as scrolling, clicking, and navigation.

Most actions that you can do manually in the browser can also be done using Puppeteer, making it a fantastic library for web scraping. Furthermore, they can be automated so you can save precious time and focus on critical tasks.

Let's go through a quick example to show you how to set up and perform a basic action using Puppeteer.

Run this command in your terminal to add Puppeteer to your project:

npm install puppeteer --save

Import the Puppeteer library into your script like so:

const puppeteer = require('puppeteer');

We'll write a function that navigates to a web page in the browser. This can be achieved in several simple lines of code:

async function performScrape(url){
    const browser = await puppeteer.launch();
    const page = await browser.newPage();
    page.goto(url);

    browser.close ();
}

Finally, call your method and provide the URL to navigate to. Feel free to experiment and enter a URL of your own!

performScrape('https://edition.cnn.com/');

That wraps up Puppeteer. As you can see, it was quite simple to get up and running and perform basic actions very quickly.

Let's move on to another popular library and headless browser option that's also used for automation and web scraping, Selenium!

Selenium

Selenium is a web browser automation tool. Primarily, it is for automating web applications for testing purposes, but it also has web scraping capabilities. Its versatility is one of the main reasons for Selenium's popularity. Selenium allows you to open a browser of your choice and perform tasks as an actual user would, such as:

  • Clicking buttons
  • Entering information in forms
  • Searching for specific information on the web pages

Let's run through a brief code example using Selenium.

First, add Selenium to your project by entering the following code into the terminal:

npm install selenium-webdriver --save

Create a .js file and import Selenium into your project by writing the following line of code:

const webdriver = require('selenium-webdriver');

Let's create a function that opens a particular URL in the Chrome browser.

async function performSeleniumScrape(url){
    let browser = await new Builder().forBrowser("chrome").build();
    await browser.get(url);
    await browser.quit();
}

You can call this method like so:

performSeleniumScrape('https://edition.cnn.com/');

That concludes our explanation of Selenium. We'll take a look at one more headless browser that's popular with JavaScript users, Nightmare.

Nightmare

Nightmare is a high-level browser automation library, or as it's more commonly known, a headless browser. It is similar in functionality to both Puppeteer and Selenium. Let's go through a code example to demonstrate its use.

Install Nightmare by running the following command in the terminal:

npm install --save nightmare

Enter the following code in a file called webscraper.js. First import Nightmare using this line of code:

const Nightmare = require('nightmare');

We'll write code that goes to the CNN website and click the menu dropdown button.

nightmare

.goto('https://edition.cnn.com/')

.click('#menuButton')
  .end()

.catch(error => {

console.error('Error:', error)
  })

Run this by entering this command in your terminal:

node webscraper.js

You can see that Nightmare has a different syntax than the other headless browsers we've seen so far.

Let's take a look at Axios & Cheerio, a different set of JS libraries we can use for web scraping.

Axios & Cheerio

This option is different than the previous three options as it is not a headless browser. Cheerio is a tool for parsing HTML and XML in Node.js, and it's very popular, with over 25k stars on GitHub.

It is fast, flexible, and easy to use. Cheerio uses a subset of jQuery. If you're already familiar with the syntax of jQuery, you will have no issues with understanding and using Cheerio. According to the documentation , Cheerio parses markup and provides an API for manipulating the resulting data structure but does not interpret the result like a web browser.

The major difference between Cheerio and a web browser is that Cheerio has a user interface, load CSS, load external resources, or execute JavaScript. It simply parses markup and provides an API for manipulating the resulting data structure. This is why it is so much faster than using a headless browser.

If you want to perform web scraping with Cheerio, you need to fetch the markup using a package like Axios . Axios is a promise-based HTTP client for Node.js and the browser. Axios can retrieve the HTML, then Cheerio takes over and processes it.

Let's demonstrate how these two packages work together using an example.

In a new directory, run this command to create a new Node.js app:

npm init -y

Install your dependencies by running this command in the terminal:

npm install express axios cheerio

Install dev dependencies (for development purposes), nodemon restarts our node app automatically when files change:

npm install nodemon -save-dev

Create a file called webscraper.js and enter the following code:

const axios = require('axios');
const cheerio = require('cheerio');

url = 'https://edition.cnn.com/';

axios.get(url)
    .then((response) => {
        let $ = cheerio.load(response.data);
        $('a').each(function (I, e) {
          let links = $(e).attr('href');
          console.log(links);
      })
    }).catch(function (e) {
    console.log(e);
});

Let's break this down. First, we import Axios and Cheerio into our script using this code:

const axios = require('axios');
const cheerio = require('cheerio');

This code is Axios getting, or fetching, the HTML from the URL we have provided:

axios.get(url);

If the request is successful, we process the response. First, we have Cheerio load the response from our Axios request:

let $ = cheerio.load(response.data);

Then using Cheerio we iterate through every link that's present at the URL and print that link to the console:

$('a').each(function (I, e) {
let links = $(e).attr('href');
console.log(links);

In your own project, you may want to do more with the links here, such as store them in a file or open a certain link in a browser.

That concludes our look at Axios and Cheerio. Let's look at our final option, JSDom.

JSDom

You might remember from earlier in the article how JSDom falls into a different category than the other JavaScript web scraping categories we've looked at so far. With headless browsers like Puppeteer, Selenium, and Nightmare, you are essentially simulating a browser, which can be overkill when all you need to need to do is build the DOM.

This is where JSDom shines, which parses the HTML you pass it, just like a browser does. However, it isn't a browser but a tool for building a DOM from a given HTML source code while also executing the JavaScript code within that HTML.

Let's work through a code example that will highlight how JSDom differs from the other web scrapers we've seen thus far.

Install JSDom by using inputting this command in your project terminal:

npm install --save jsdom

Import the necessary libraries using this code:

const fs = require('fs');
const got = require('got');
const jsdom = require("jsdom");
const { JSDOM } = jsdom;

This code uses JSDom to iterate through the links present on the CNN home page and print them out to the console. It is similar in design to our Axios and Cheerio example:

const url= 'https://edition.cnn.com/';

got(url).then(response => {
    const dom = new JSDOM(response.body);

dom.window.document.querySelectorAll('a').forEach(link => {
      console.log(link.href);
    });
  }).catch(err => {
    console.log(err);
  });

Summary

We've reached the end of this article. We've looked at a lot of different libraries that can be used for the purposes of web scraping in JavaScript. Let's review what we've learned:

  • HTTP clients such as Axios and Got are used to send HTTP requests to a server and receive a response. This response is the HTML we can process and crawl through using our web scraping libraries.
  • Cheerio is a subset of jQuery and can be run server-side for web crawling, but it does not execute Javascript code.
  • JSDom creates a DOM per the standard Javascript specification using an HTML string and allows you to perform DOM manipulations on it.
  • Puppeteer, Selenium, and Nightmare are headless browsers that allow you to programmatically manipulate web applications as if a real user was interacting with them using a browser.

As we can see from this list, there are plenty of different ways to scrape data from the web. This reflects the desire for flexibility, as people try to collect data in multiple formats. Whether you're a beginner looking for a simple way to get started with web scraping, or an experienced user with specific needs, these tools will help you find your footing and take steps towards accomplishing your goals. Good luck, and have fun building your next web scraping project!

Table of contents