What are the methods available in Goutte for DOM manipulation?

Goutte is a screen scraping and web crawling library for PHP. It provides an API to crawl websites and extract data from the HTML/XML responses. Goutte is essentially a wrapper around Guzzle (an HTTP client) and Symfony's DomCrawler and CssSelector components.

The Symfony DomCrawler component, which Goutte utilizes for DOM manipulation, offers various methods to interact with and manipulate the DOM of a crawled page. Here are some of the key methods available:

  1. filter($selector): This method filters the DOM elements based on a CSS selector and returns a new Crawler instance for the matching elements.
$crawler->filter('.post-title')->each(function ($node) {
    echo $node->text();
});
  1. filterXPath($xpath): Similar to filter(), but it allows you to use XPath queries to filter elements.
$crawler->filterXPath('//h1')->each(function ($node) {
    echo $node->text();
});
  1. eq($position): This method selects the element at a given position in the list of matched elements.
$title = $crawler->filter('.post-title')->eq(0)->text();
  1. first(): This method selects the first element from the list of matched elements.
$firstTitle = $crawler->filter('.post-title')->first()->text();
  1. last(): This method selects the last element from the list of matched elements.
$lastTitle = $crawler->filter('.post-title')->last()->text();
  1. siblings(): This method finds all the siblings of the current node.
$siblings = $crawler->filter('.selected')->siblings();
  1. nextAll(): This method selects all following siblings.
$followingSiblings = $crawler->filter('.selected')->nextAll();
  1. previousAll(): This method selects all preceding siblings.
$precedingSiblings = $crawler->filter('.selected')->previousAll();
  1. parents(): This method finds all the parents of the current node.
$parents = $crawler->filter('.selected')->parents();
  1. children(): This method finds all the children of the current node.
$children = $crawler->filter('#parent')->children();
  1. text(): This method retrieves the inner text content of the current node.
$text = $crawler->filter('.post-title')->text();
  1. html(): This method retrieves the inner HTML content of the current node.
$html = $crawler->filter('.post-content')->html();
  1. attr($attribute): This method retrieves the value of an attribute for the first node in the list of matched elements.
$href = $crawler->filter('a')->attr('href');

These methods can be combined and chained to perform complex DOM queries and manipulations. Goutte makes it easy to crawl web pages and extract information using a simple and expressive API.

Please note that Goutte is for server-side scraping using PHP, and the above code examples are for PHP only. If you're looking for similar functionality in JavaScript, you might want to look into libraries like Cheerio or Puppeteer which offer a different set of APIs for DOM manipulation on the client-side or in a Node.js environment.

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon