Yes, lxml
can parse XML documents as well as HTML. It is a powerful and Pythonic binding for the C libraries libxml2
and libxslt
, which are widely used for parsing XML and HTML data. lxml
provides a very simple and easy-to-use API for parsing, creating, and modifying XML and HTML documents.
Here's how you can use lxml
to parse XML and HTML documents:
Parsing XML with lxml
To parse an XML document with lxml
, you can use the etree
module from the lxml
package. Here's an example of how to parse an XML file:
from lxml import etree
# Parse the XML file
xml_file = 'example.xml'
tree = etree.parse(xml_file)
# Get the root element
root = tree.getroot()
# Print the tag of the root element
print(root.tag)
# Iterate over the children of the root element
for child in root:
print(child.tag, child.attrib)
If you have an XML string, you can parse it using etree.fromstring
:
xml_data = '''<?xml version="1.0"?>
<root>
<child name="child1"/>
<child name="child2"/>
</root>'''
root = etree.fromstring(xml_data)
# Use the root just like you would with a parsed file
print(root.tag)
for child in root:
print(child.tag, child.attrib)
Parsing HTML with lxml
lxml
also provides tools for parsing HTML, which can be particularly useful for web scraping. The lxml.html
module is designed to deal with the quirks of real-world HTML. Here's an example of using lxml
to parse an HTML document:
from lxml import html
# Parse the HTML file
html_file = 'example.html'
tree = html.parse(html_file)
# Get the root element
root = tree.getroot()
# Print the title of the HTML document
print(root.find(".//title").text)
# Find all 'a' elements and print their href attribute
for element in root.findall(".//a"):
print(element.get('href'))
For HTML content in a string, use html.fromstring
:
html_data = '''
<!DOCTYPE html>
<html>
<head>
<title>Example Page</title>
</head>
<body>
<h1>Hello, World!</h1>
<a href="http://example.com">Example Link</a>
</body>
</html>'''
root = html.fromstring(html_data)
# Use the root to navigate the HTML structure
print(root.find(".//title").text)
for element in root.findall(".//a"):
print(element.get('href'))
Installing lxml
If you don't have lxml
installed, you can install it using pip
:
pip install lxml
Keep in mind that lxml
is a library that relies on C extensions, so it requires certain development libraries to be installed on your system. If you encounter any issues during installation, make sure you have the system dependencies installed (like libxml2-dev
and libxslt-dev
on Debian-based systems).
Conclusion
lxml
is a versatile library that is capable of parsing both XML and HTML documents. Its API is easy to use and provides access to sophisticated XPath and XSLT tools, making it a popular choice for a wide range of parsing tasks in Python.