Web Scraping with PHP

Posted by Vlad Mishkin | February 5, 2023 | Tags: Programming | PHP |

PHP is one of the most popular programming languages in the world. It is an open-source scripting language that runs on most web servers. PHP is used to create websites and web applications, as well as mobile apps, games, and other software programs.

Web scraping allows for the extraction of data from websites and web applications. While there are many ways to scrape data from the web, PHP has become one of the most popular languages to do it with. There are a lot of different web scraping libraries available for PHP on the market today, and I wanted to find out which ones were the most popular with developers.

PHP is one of the most robust and flexible programming languages, making it a fantastic language for web scraping purposes. Web scraping libraries make web scraping much more accessible and many of these libraries support features and formats like CSV, XML, and more.

Goutte

Goutte is a feature-rich PHP web scraper that is easy to use, even for beginners. Goutte is a web scraper that can be used to extract information from HTML pages. It's powerful and robust in its feature set, but it doesn't come with a big learning curve. It has an understandable object-oriented design and even includes support for websites that feature the use of JavaScript and CSS.

Goutte was developed by Google, and it allows you to scrape websites using the same code that Google uses for their web crawlers.

The advantages of using Goutte for web scraping are: - Community: Has a large community who are willing to help out if you run into problems. - Documentation:Goutte is well supported by a large volume of useful documentation. - Speed:Websites can be scraped very quickly with Goutte. This is because it uses HTTP 1.1 persistent connections, which means that the browser makes a single connection to the server and reuses it for all subsequent requests. This reduces overhead and increases performance. - Stability: Since Goutte uses HTTP 1.1 persistent connections, there's less chance of encountering errors than there would be if you were using an HTTP 1.0 client or HTTP 1.1 non-persistent connection.

Goutte is a popular PHP library for web scraping for these reasons. However, it does have some limitations that might make it challenging for some projects. Here are some potential limitations that you may have to overcome: - Can't handle dynamic content like forms that well. - Some programming ability and familiarity with PHP are required.

Goutte PHP Web Scraper Example

Let's run through an example of navigating to a page using Goutte and PHP and clicking the link visible on the page.

Before we get started, we'll need to install Composer globally. Composer is a tool for dependency management in PHP. It allows you to declare the libraries your project depends on, and it will manage them for you.

Make sure Goutte is installed by executing this command in your terminal:

composer require fabpot/goutte

Create a new PHP script for your project. You can call this script whatever you want. Open this new script and add the following 3 lines of code to initialize Goutte:

require 'vendor/autoload.php';
use Goutte\Client;
$client = new Client();

Now that Goutte is initialized, add the two lines below to the end of the file to fetch a URL using the client->request() function.

$url = "http://example.com/";
$crawler = $client->request('GET', $url);

// Click on the "More information..." link
$link = $crawler->selectLink('More information…')->link();
$crawler = $client->click($link);

There you have it! In a few lines of code, we were able to navigate to a web page and click the link. Let's move on to another popular library for web scraping in PHP, Simple HTML DOM.

Simple HTML DOM

As we've mentioned before, web scraping is a process that involves extracting data from a website. It is commonly used to get information about websites, find contact details of people and organizations, or extract data from the Internet. Let's look at how Simple HTML DOM can be used for web scraping in PHP.

Simple HTML DOM Parser handles any HTML document, even ones that are considered invalid by the HTML specification. The advantage of using Simple HTML DOM for web scraping in PHP is that it is easy to learn and implement. It requires less code than other libraries such as jQuery or Prototype because it doesn't contain any external JavaScript files which need to be loaded separately before using them.

The advantages of Simple HTML DOM for web scraping in PHP are:

  • Simplicity: You can accomplish most simple tasks in a few lines of code
  • Speed: You don't have to load the entire page into memory, which means you can run through a lot of HTML pages at the same time without encountering performance issues.
  • Lightweight: You don't need to install any extra software or libraries, just PHP itself, making it a lightweight solution for web scraping.

These are all great reasons to use this library for web scraping purposes. However, there are some potential disadvantages that need to be mentioned.

The disadvantages and potential limitations of using this library are as follows: - Limited access to page elements: This method of scraping only gets information about the structure of a web page and not its content, it cannot be used on all types of websites or pages. - Rigid: Simple HTML DOM is not as flexible as other methods, and it can be difficult to use if you're not familiar with HTML.

Let's demonstrate the use of this library with a practical example.

Simple HTML DOM Web Scraping Example

To get started, download the latest version of Simple HTML DOM here . This will download the library in its entirety.

Copy and paste the simple_html_dom.php file into your active project.

Create a new PHP file with whatever name you want, for example, scrape.php:

Open this new file and add the Simple HTML DOM library to your project using the following code:

include('simple_html_dom_php');

Now add this code which is responsible for navigating to the website itself:

$html = file_get_html('http://example.com/')

This method, file_get_html, is part of the Simple HTML DOM library. It retrieves all of the HTML from the URL we provide and returns a DOM object that's stored the variable: $html.

Add this line of code to find the first H1 heading on the page:

//this finds the first H1 heading on the page. On our page there is only one link so this works perfectly
echo $html->find('h1', 0)->plaintext;

Now open this new file on your web server you'll see the heading from http://example.com/ : "Example Domain".

That concludes our look at Simple HTML DOM. Let's look at a different library we can use for web scraping in PHP, PHPScraper!

PHPScraper

PHPScraper is a tool that allows you to scrape data from websites using PHP scripts. PHPScraper makes it easy to extract data from web pages without having to write any code yourself. All you need are some basic instructions about what data you want and where it lives within each page of HTML.

With PHPScraper, you can extract any kind of information from any website using simple commands like “find_element_by_class” or “find_element_by_id” which allow you to search for specific elements on any given page just by typing in their class or id number.

Whether you're scraping data from a website, or simply want to crawl a website and collect all of its links, you can use PHPScraper to accomplish this. It's a great tool for both web scraping and crawling.

There are a lot of benefits to using PHPScraper:

  • Ease of use: If you can write a few lines of PHP code, then you can use this tool.

  • No installation required: PHPScraper runs in your browser (although there are also command-line versions available should you need it).

  • High Performance: There are no limits on how many pages PHPScraper can process at once.

The disadvantages are pretty minor compared with the benefits:

However, there are some disadvantages to using PHPScraper for web scraping in PHP: - Might not suit every use case:It's not as flexible as some other libraries are, it doesn't allow you to do things like extract information from an API or parse through complicated HTML pages. Second, - Limited Server usage by default: Since PHPScraper is based on PHP, it only works on servers running Apache or Nginx and configured with mod_rewrite enabled by default (although this can be changed).

Let's look at an example of PHPScraper in action so we can gain a greater understanding of this library.

PHPScraper Example

Like the previous library we looked at, Goutte, we need to install Composer before using PHPScraper. Make sure Composer is installed before proceeding further.

Next, we'll add the PHP library to our project using this line of code: The library is usually installed using composer:

composer require spekulatius/phpscraper

After the installation is completed, the package will be picked up by the Composer autoloader. If you are using a VanillaPHP library, you will need to include the autoloader in your script:

require 'vendor/autoload.php';

Let's write a PHP script that navigates to a webpage and counts the number of links present on the page.

Initialize the scraper and assign it a variable like so:

$web = new \spekulatius\phpscraper();

Now tell our craper to navigate to our example website:

$web->go(' [http://example.com/](http://example.com/) ');
Now add this code. It grabs all links that are present on the page, and prints out the total count of them, along with each link.

// Print the number of links.
echo "This page contains " . count($web->links) . " links.\n\n";

// Loop through the links
foreach ($web->links as $link) {
echo " - " . $link . "\n";
}

/**
* This code will print out:
*
* This page contains 1 link.
*
* - https://www.iana.org/domains/example
*/

There we have it! Our example navigates to a web page and prints out all of the links present on that page. This concludes our examination of PHPScraper. Let's move on to the next option for web scraping in PHP, cURL.

cURL for Web Scraping in PHP

cURL is a command-line tool that allows you to make HTTP requests. It's great for web scraping because it can be used in any programming language, and it's easy to install on most operating systems.

It's easy to use and comes with PHP by default, so there's no need to install anything extra if you already have PHP installed. It also supports HTTPS, which is necessary for most web scrapes.

The biggest downside is that it doesn't support cookies or authentication requests. If you need to scrape protected websites, cURL won't work for you.

Let's highlight some of the most notable advantages of using cURL for web scraping: - Feature-rich: The cURL library has a lot of features that make it easy to implement in your code. - Speed: It's very fast, which is important when you're scraping large amounts of data.

cURL PHP Web Scraping Example

cURL is part of a library called libcurl. This library allows you to connect to servers that have different types of protocols. These protocols include HTTP, HTTPS, and more. cURL provides a more stable method of connecting to these servers. Once you're connected you can then start to scrape the information. Let's start with our example.

Create a new PHP script and input the following code:

// Initialize curl
$ch = curl_init();

// URL for Scraping
curl_setopt($ch, CURLOPT_URL,

'https://www.geeksforgeeks.org/matlab-data-types/');

// Return Transfer True
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);

$output = curl_exec($ch);

// Closing cURL
curl_close($ch);

Setting CURLOPT_RETURNTRANSFER to TRUE returns the page as a string rather than output it directly. The code so far grabs the data we need to scrape from the web page.

You can view this data using an echo with the variable we stored the data in:

echo $output

Now let's put this data into a DOM document that we can access and scrape:

$dom = new DOMDocument;
$dom->loadHTML($output);

At this stage we have our data in a HTML structure inside a variable. Let's write some code that will print out every link present on the HTML page:

$tags = $dom->getElementsByTagName('a');

for($I=0; $I < $tags->length; $I++){
$link = $tags->item($i);
echo " - " . $link . "\n";
}

This code will print out every link found on your scraped page. This concludes our look at using cURL for web scraping purposes. Let's look at our next, and final library for PHP web scraping.

Guzzle

Guzzle is a PHP library that allows you to connect to many different web services and servers. It boasts many features like support for cookies and file uploads. One of the most common uses for Guzzle is web scraping.

Web scraping with Guzzle is quite simple. You just need to make sure that you've set up your config file correctly, then start making requests from your script! The easiest way to do this is by using Guzzle's built-in cURL client module.

Guzzle is a lightweight PHP HTTP client library, and it's one of the best ways to do web scraping in PHP.

Here are some of the major advantages of using Guzzle for web scraping in PHP: - Speed: Guzzle is highly optimized for speed and performance. - Simplicity: It has a simple API that makes it easy to learn, even for beginners. - Security: It supports TLS/SSL encryption, which is important for security when scraping sensitive information.

Here are some of the possible limitations you may encounter when using Guzzle for web scraping:

  • Only supports GET: Guzzle doesn't support all HTTP methods or status codes—only GET requests can be used with this library (however, this usually isn't needed).
  • SSL Cert needs manual setup: You must manually set up an SSL certificate before you can use this library (this is also not usually a problem).

Let's look at an example of using Guzzle for web scraping in PHP

Guzzle PHP Web Scraper Example

As with other libraries, we need to make sure Composer is set up correctly before we can continue with our example. With that out of the way, let's continue with our example.

To initialize Guzzle, XML and Xpath, add the following code to the guzzle_requests.php file:

require 'vendor/autoload.php';

$httpClient = new \GuzzleHttp\Client();
$response = $httpClient->get('https://example.com/');
$htmlString = (string) $response->getBody();

//add this line to suppress any warnings
libxml_use_internal_errors(true);

$doc = new DOMDocument();
$doc->loadHTML($htmlString);
$xpath = new DOMXPath($doc);

The above code snippet will load the web page into a string. We then parse the string using XML and assign it to the $xpath variable.

Now we want to target the H1 heading on the page.

$titles = $xpath->evaluate('/html/body/div/h1');
$extractedTitles = [];
foreach ($titles as $title) {
$extractedTitles[] = $title->textContent.PHP_EOL;
echo $title->textContent.PHP_EOL;
}

We use the foreach loop to extract the text contents and echo them to the terminal.

At this step, you may choose to do something with your extracted data, maybe assign the data to an array variable, write to a file, or store it in a database. That concludes our final example!

Summary

This is just a taste of the many wonderful libraries available for web scraping in PHP. Once you have a grasp on the basics and some practice, feel free to experiment with each of the libraries we have mentioned to see which might be the best fit for your project.

In the end, it will depend on your needs. Whichever one you choose, web scraping in PHP is easy if you take the time to learn about the appropriate library for your project.

Table of contents