Can lxml parse XML documents as well as HTML?

Yes, lxml can parse XML documents as well as HTML. It is a powerful and Pythonic binding for the C libraries libxml2 and libxslt, which are widely used for parsing XML and HTML data. lxml provides a very simple and easy-to-use API for parsing, creating, and modifying XML and HTML documents.

Here's how you can use lxml to parse XML and HTML documents:

Parsing XML with `lxml`

To parse an XML document with lxml, you can use the etree module from the lxml package. Here's an example of how to parse an XML file:

from lxml import etree

# Parse the XML file
xml_file = 'example.xml'
tree = etree.parse(xml_file)

# Get the root element
root = tree.getroot()

# Print the tag of the root element
print(root.tag)

# Iterate over the children of the root element
for child in root:
    print(child.tag, child.attrib)

If you have an XML string, you can parse it using etree.fromstring:

xml_data = '''<?xml version="1.0"?>
<root>
    <child name="child1"/>
    <child name="child2"/>
</root>'''

root = etree.fromstring(xml_data)

# Use the root just like you would with a parsed file
print(root.tag)
for child in root:
    print(child.tag, child.attrib)

Parsing HTML with `lxml`

lxml also provides tools for parsing HTML, which can be particularly useful for web scraping. The lxml.html module is designed to deal with the quirks of real-world HTML. Here's an example of using lxml to parse an HTML document:

from lxml import html

# Parse the HTML file
html_file = 'example.html'
tree = html.parse(html_file)

# Get the root element
root = tree.getroot()

# Print the title of the HTML document
print(root.find(".//title").text)

# Find all 'a' elements and print their href attribute
for element in root.findall(".//a"):
    print(element.get('href'))

For HTML content in a string, use html.fromstring:

html_data = '''
<!DOCTYPE html>
<html>
<head>
    <title>Example Page</title>
</head>
<body>
    <h1>Hello, World!</h1>
    <a href="http://example.com">Example Link</a>
</body>
</html>'''

root = html.fromstring(html_data)

# Use the root to navigate the HTML structure
print(root.find(".//title").text)
for element in root.findall(".//a"):
    print(element.get('href'))

Installing `lxml`

If you don't have lxml installed, you can install it using pip:

pip install lxml

Keep in mind that lxml is a library that relies on C extensions, so it requires certain development libraries to be installed on your system. If you encounter any issues during installation, make sure you have the system dependencies installed (like libxml2-dev and libxslt-dev on Debian-based systems).

Conclusion

lxml is a versatile library that is capable of parsing both XML and HTML documents. Its API is easy to use and provides access to sophisticated XPath and XSLT tools, making it a popular choice for a wide range of parsing tasks in Python.

Can lxml parse XML documents as well as HTML?

Parsing XML with `lxml`

Parsing HTML with `lxml`

Installing `lxml`

Conclusion

Related Questions

How do I handle namespaces in XML parsing with lxml?

What are XPath expressions and how can they be used with lxml?

How do I select specific elements from an HTML page using lxml?

Get Started Now

Can lxml parse XML documents as well as HTML?

Parsing XML with lxml

Parsing HTML with lxml

Installing lxml

Conclusion

Related Questions

How do I handle namespaces in XML parsing with lxml?

What are XPath expressions and how can they be used with lxml?

How do I select specific elements from an HTML page using lxml?

Get Started Now

Parsing XML with `lxml`

Parsing HTML with `lxml`

Installing `lxml`