Yes, you can modify an HTML document with lxml
after parsing it. The lxml
library provides mechanisms for parsing HTML and XML documents, creating elements, removing elements, and altering elements, which makes it quite powerful for manipulating HTML documents.
Here's an example of how you might modify an HTML document using lxml
in Python:
from lxml import html
# Sample HTML content
html_content = """
<html>
<head>
<title>Sample Page</title>
</head>
<body>
<h1>Welcome to the Sample Page</h1>
<p>This is a sample paragraph.</p>
</body>
</html>
"""
# Parse the HTML
tree = html.fromstring(html_content)
# Modify the title
title = tree.find('.//title')
title.text = 'Updated Page Title'
# Add a new element
new_paragraph = html.Element("p")
new_paragraph.text = "This is a new paragraph."
tree.body.append(new_paragraph)
# Remove an element
h1 = tree.find('.//h1')
tree.body.remove(h1)
# Print the modified HTML
print(html.tostring(tree, pretty_print=True).decode('utf-8'))
In this example, we:
1. Parse the HTML into a tree structure.
2. Change the text of the <title>
element.
3. Create a new <p>
element and append it to the <body>
.
4. Remove the <h1>
element from the <body>
.
After modifying the tree, we can serialize it back to a string using html.tostring
. The pretty_print=True
argument formats the output to be more readable.
It's also possible to use XPath or CSS selectors to find elements within the HTML document more easily with lxml
. Here's a quick example using XPath:
# Find all paragraphs using XPath and add a class attribute
paragraphs = tree.xpath('//p')
for p in paragraphs:
p.set('class', 'new-class')
And using CSS Selectors:
# Find all paragraphs using CSS Selectors and add a class attribute
from lxml.cssselect import CSSSelector
selector = CSSSelector('p')
paragraphs = selector(tree)
for p in paragraphs:
p.set('class', 'new-class')
These examples demonstrate the flexibility of lxml
in modifying parsed HTML documents. Remember that after making changes, you can always convert the lxml
tree back into a string or byte representation to output the modified HTML.