How do I save changes made to an HTML document with Html Agility Pack?

The Html Agility Pack (HAP) is a .NET library used to manipulate HTML documents. It is particularly useful for web scraping, as it allows you to navigate and edit the HTML of a web page. If you've made changes to an HTML document using HAP, you might want to save those changes to a file.

Below is a step-by-step guide on how to save changes made to an HTML document with the Html Agility Pack:

Step 1: Install Html Agility Pack

If you haven't already, you'll need to install the Html Agility Pack. You can do this via NuGet Package Manager. Run the following command in the Package Manager Console:

Install-Package HtmlAgilityPack

Step 2: Load the HTML Document

First, you need to load the HTML document into an HtmlDocument object. You can load it from a string, a file, a web response, etc.

var htmlDoc = new HtmlAgilityPack.HtmlDocument();
htmlDoc.Load(filePath); // Load the HTML file
// Or load from a string: htmlDoc.LoadHtml(htmlString);

Step 3: Make Changes to the HTML Document

Make whatever changes you need to the HTML document using the Html Agility Pack API.

var nodes = htmlDoc.DocumentNode.SelectNodes("//a[@href]");
if (nodes != null)
{
    foreach (var node in nodes)
    {
        // Modify the href attribute
        node.SetAttributeValue("href", "http://newurl.com");
    }
}

Step 4: Save the Changes

After you've made changes to the HtmlDocument object, you can save it back to a file or a stream.

htmlDoc.Save(filePath); // Save the changes to the same file
// Or save to a new file: htmlDoc.Save(newFilePath);

If you want to save the document to a stream, like a MemoryStream, you can do the following:

using (var stream = new MemoryStream())
{
    htmlDoc.Save(stream);
    // You can now use the stream however you need to
}

Step 5: (Optional) Formatting the Output

The Html Agility Pack can sometimes save the document in a single line of text. If you want the output to be indented for easier reading, HAP does not provide a built-in way to do this directly. However, you can use external libraries like XDocument for this purpose.

var xDocument = XDocument.Parse(htmlDoc.DocumentNode.OuterHtml);
xDocument.Save(newFilePath);

Keep in mind that XDocument might alter the HTML by adding/removing some tags because it treats the content as XML. Use this method only if you're sure that your HTML is well-formed and can be treated as XML.

Conclusion

With the Html Agility Pack, you can easily load, manipulate, and save HTML documents. Just be sure to install the package, load the document, make your changes, and save the document back to the file system or a stream. If you need prettified output, you may have to use another library in conjunction with HAP to achieve that.

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon