Can Cheerio handle XML documents as well as HTML?

Yes, Cheerio can handle XML documents in addition to HTML. Cheerio is a fast, flexible, and lean implementation of core jQuery designed specifically for the server to parse, manipulate, and render web pages. While it is primarily used for HTML documents, Cheerio can parse and manipulate XML documents with similar ease.

Here's an example of how you can use Cheerio to parse and manipulate an XML document in Node.js:

const cheerio = require('cheerio');

// Sample XML data
const xmlData = `
<?xml version="1.0"?>
<catalog>
   <book id="bk101">
      <author>Gambardella, Matthew</author>
      <title>XML Developer's Guide</title>
      <genre>Computer</genre>
      <price>44.95</price>
   </book>
   <book id="bk102">
      <author>Ralls, Kim</author>
      <title>Midnight Rain</title>
      <genre>Fantasy</genre>
      <price>5.95</price>
   </book>
</catalog>
`;

// Parse XML with Cheerio
const $ = cheerio.load(xmlData, {
  normalizeWhitespace: true,
  xmlMode: true
});

// Manipulate XML like you would with jQuery
$('book').each(function () {
  const id = $(this).attr('id');
  const author = $(this).find('author').text();
  const title = $(this).find('title').text();
  const price = $(this).find('price').text();

  console.log(`Book ID: ${id}`);
  console.log(`Author: ${author}`);
  console.log(`Title: ${title}`);
  console.log(`Price: ${price}`);
  console.log('\n');
});

// Add a new element to XML
const newBook = `
<book id="bk103">
   <author>Example Author</author>
   <title>Example Book Title</title>
   <genre>Non-fiction</genre>
   <price>29.99</price>
</book>
`;

$('catalog').append(newBook);

// Serialize the modified XML back to a string
const modifiedXML = $.xml();

console.log(modifiedXML);

In the example above:

  1. We load the xmlData string into Cheerio with the xmlMode option set to true. This tells Cheerio to parse the document as XML.
  2. We use jQuery-like selectors and methods to iterate over the <book> elements, extract text content, and log it.
  3. We append a new <book> element to the <catalog> element.
  4. We serialize the modified XML back into a string using $.xml().

When working with XML, it's important to enable xmlMode to ensure that Cheerio correctly understands the structure of the document and preserves any XML-specific features such as self-closing tags, case-sensitive tags, and attributes.

Cheerio provides a convenient and familiar interface for web scraping and XML manipulation, making it a popular choice for many Node.js developers.

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon