Simple HTML DOM is a PHP library that provides an easy way to manipulate HTML documents. It allows developers to parse HTML files and extract data from them, much like how a web browser would process HTML. The library offers a convenient way to find and manipulate elements within an HTML document using selectors similar to those in jQuery.
Simple HTML DOM is particularly useful for web scraping because it can handle poorly formatted HTML, which is common on the web, and provides a simple interface for accessing the DOM elements.
Here's a basic example of how you might use Simple HTML DOM to scrape data from a webpage:
// First, include the Simple HTML DOM library
include('simple_html_dom.php');
// Create a DOM object from a URL
$html = file_get_html('http://example.com/');
// Find all images on a web page
foreach($html->find('img') as $element) {
echo $element->src . '<br>';
}
// Find all links with the class 'external'
foreach($html->find('a.external') as $element) {
echo $element->href . '<br>';
}
// Find the div with an id of 'main'
$mainDiv = $html->find('div#main', 0);
if ($mainDiv) {
echo $mainDiv->innertext;
}
// Clean up memory
$html->clear();
unset($html);
In this example, file_get_html
is a function provided by Simple HTML DOM that reads the webpage into a DOM object. You can then use the find
method to search for elements using CSS selectors. The find
method can return either an array of objects or a single object, and you can specify which you want with the second parameter.
Please note that Simple HTML DOM is not a built-in PHP extension or part of the PHP core. It is a third-party library that you need to include in your project. Also, since it is a PHP library, it can't be used with JavaScript or other programming languages. However, there are equivalent libraries and tools for other languages. For instance, in Python, you might use libraries like Beautiful Soup or lxml for similar purposes.
For JavaScript, especially in a Node.js environment, you can use libraries like cheerio
which have a similar syntax to jQuery and allow for easy manipulation and querying of DOM elements on the server side.
Here's an example of using cheerio
in a Node.js script:
const cheerio = require('cheerio');
const axios = require('axios');
async function scrapeData() {
try {
// Fetch the HTML with a GET request
const { data } = await axios.get('http://example.com/');
// Load the HTML string into cheerio
const $ = cheerio.load(data);
// Find all images and log their `src` attribute
$('img').each((index, element) => {
console.log($(element).attr('src'));
});
// Find all links with the class 'external' and log their `href` attribute
$('a.external').each((index, element) => {
console.log($(element).attr('href'));
});
// Find the div with an id of 'main' and log its inner HTML
const mainDiv = $('div#main').html();
console.log(mainDiv);
} catch (error) {
console.error(error);
}
}
scrapeData();
In this JavaScript example, axios
is used to perform the HTTP GET request, and cheerio
is used for parsing and querying the HTML data much like you would with jQuery on the client side.