Can Html Agility Pack convert HTML documents to plain text?

Yes, Html Agility Pack can be used to convert HTML documents to plain text in C#. Html Agility Pack is a .NET code library that is designed to parse HTML documents and can handle malformed or "real-world" HTML, making it suitable for extracting data from HTML pages.

To convert an HTML document to plain text, you would typically load the HTML into an HtmlDocument object, then traverse the document and extract the text nodes. Here's an example of how you might do this:

using HtmlAgilityPack;
using System;
using System.Text;

class Program
{
    static void Main(string[] args)
    {
        var html = @"
            <html>
            <body>
                <h1>Welcome to My Homepage</h1>
                <p>This is a paragraph with <a href='https://example.com'>a link</a>.</p>
                <div>Some more text here in a div.</div>
            </body>
            </html>";

        var htmlDoc = new HtmlDocument();
        htmlDoc.LoadHtml(html);

        string plainText = ConvertToPlainText(htmlDoc.DocumentNode);
        Console.WriteLine(plainText);
    }

    static string ConvertToPlainText(HtmlNode node)
    {
        StringBuilder sb = new StringBuilder();
        foreach (HtmlNode subnode in node.ChildNodes)
        {
            if (subnode.NodeType == HtmlNodeType.Text)
            {
                // Append the text of the current node to the StringBuilder
                sb.Append(subnode.InnerText);
            }
            else if (subnode.NodeType == HtmlNodeType.Element)
            {
                // Recursively convert the child nodes to plain text
                sb.Append(ConvertToPlainText(subnode));
            }
        }
        return sb.ToString();
    }
}

This code uses a recursive function ConvertToPlainText to traverse all nodes in the HTML document and concatenate the InnerText property of text nodes. The InnerText property gets or sets the concatenated text of the node and all its child nodes.

Make sure to handle scripts, styles, and other elements that might not contain human-readable text appropriately, as this example does not exclude those. You could also modify the function to add spaces or newlines as necessary to preserve the readability of the text after stripping the HTML tags.

Keep in mind that the text obtained this way may not be perfectly formatted since some HTML tags affect the layout and appearance of text (like <br>, <p>, or block elements), and extra handling would be needed to replicate those effects in plain text.

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon