Table of contents

How do I parse XML responses using Requests?

When working with APIs and web scraping, you'll often encounter XML responses that need to be parsed and processed. The Python Requests library makes it easy to retrieve XML data, but you'll need additional libraries to parse the XML content effectively. This guide covers multiple approaches to parsing XML responses using popular Python XML parsing libraries.

Understanding XML Response Handling

The Requests library retrieves XML data as text content, which then needs to be parsed using specialized XML parsing libraries. Python offers several excellent options for XML parsing, each with its own strengths:

  • xml.etree.ElementTree: Built-in Python library, lightweight and fast
  • lxml: High-performance library with XPath support
  • BeautifulSoup: User-friendly library that handles malformed XML gracefully

Method 1: Using xml.etree.ElementTree (Built-in)

ElementTree is Python's built-in XML parsing library and requires no additional installations. It's perfect for well-formed XML documents and offers good performance.

import requests
import xml.etree.ElementTree as ET

# Make request to XML endpoint
response = requests.get('https://api.example.com/data.xml')

# Check if request was successful
if response.status_code == 200:
    # Parse XML content
    root = ET.fromstring(response.content)

    # Access elements by tag name
    for item in root.findall('item'):
        title = item.find('title').text
        description = item.find('description').text
        print(f"Title: {title}")
        print(f"Description: {description}")
else:
    print(f"Request failed with status code: {response.status_code}")

Working with XML Namespaces

Many XML documents use namespaces, which require special handling:

import requests
import xml.etree.ElementTree as ET

response = requests.get('https://api.example.com/rss.xml')

if response.status_code == 200:
    root = ET.fromstring(response.content)

    # Define namespace map
    namespaces = {
        'atom': 'http://www.w3.org/2005/Atom',
        'media': 'http://search.yahoo.com/mrss/'
    }

    # Find elements with namespaces
    for entry in root.findall('.//atom:entry', namespaces):
        title = entry.find('atom:title', namespaces).text
        link = entry.find('atom:link', namespaces).get('href')
        print(f"Title: {title}, Link: {link}")

Method 2: Using lxml for Advanced XML Processing

The lxml library offers superior performance and advanced features like XPath queries. Install it using:

pip install lxml

Basic lxml Usage

import requests
from lxml import etree

# Fetch XML data
response = requests.get('https://api.example.com/products.xml')

if response.status_code == 200:
    # Parse with lxml
    root = etree.fromstring(response.content)

    # Use XPath for powerful element selection
    products = root.xpath('//product[@category="electronics"]')

    for product in products:
        name = product.xpath('./name/text()')[0]
        price = product.xpath('./price/text()')[0]
        print(f"Product: {name}, Price: ${price}")

Advanced XPath Queries with lxml

import requests
from lxml import etree

response = requests.get('https://api.example.com/inventory.xml')

if response.status_code == 200:
    root = etree.fromstring(response.content)

    # Complex XPath queries
    # Find products with price greater than 100
    expensive_items = root.xpath('//product[price > 100]')

    # Find the second product in each category
    second_products = root.xpath('//category/product[2]')

    # Find products containing specific text
    electronics = root.xpath('//product[contains(name, "laptop")]')

    for item in expensive_items:
        name = item.xpath('./name/text()')[0]
        price = float(item.xpath('./price/text()')[0])
        print(f"Expensive item: {name} - ${price}")

Method 3: Using BeautifulSoup for Robust XML Parsing

BeautifulSoup excels at handling malformed or inconsistent XML and provides an intuitive API. Install it along with lxml parser:

pip install beautifulsoup4 lxml

Basic BeautifulSoup XML Parsing

import requests
from bs4 import BeautifulSoup

# Fetch XML content
response = requests.get('https://api.example.com/feed.xml')

if response.status_code == 200:
    # Parse with BeautifulSoup using xml parser
    soup = BeautifulSoup(response.content, 'xml')

    # Find elements using CSS selectors or tag names
    items = soup.find_all('item')

    for item in items:
        title = item.find('title').get_text() if item.find('title') else 'No title'
        pub_date = item.find('pubDate').get_text() if item.find('pubDate') else 'No date'
        print(f"Title: {title}")
        print(f"Published: {pub_date}")

Handling Malformed XML with BeautifulSoup

import requests
from bs4 import BeautifulSoup

response = requests.get('https://api.example.com/messy-data.xml')

if response.status_code == 200:
    # BeautifulSoup can handle malformed XML gracefully
    soup = BeautifulSoup(response.content, 'xml')

    # Use CSS selectors for flexible element selection
    products = soup.select('product[status="active"]')

    for product in products:
        # Safe text extraction with fallbacks
        name = product.find('name')
        name_text = name.get_text().strip() if name else 'Unknown'

        price = product.find('price')
        price_text = price.get_text().strip() if price else '0'

        print(f"Product: {name_text}, Price: {price_text}")

Error Handling and Best Practices

Always implement proper error handling when parsing XML responses:

import requests
import xml.etree.ElementTree as ET
from requests.exceptions import RequestException, Timeout
from xml.etree.ElementTree import ParseError

def parse_xml_safely(url, timeout=30):
    try:
        # Make request with timeout
        response = requests.get(url, timeout=timeout)
        response.raise_for_status()  # Raises exception for bad status codes

        # Attempt to parse XML
        root = ET.fromstring(response.content)
        return root

    except RequestException as e:
        print(f"Request error: {e}")
        return None
    except ParseError as e:
        print(f"XML parsing error: {e}")
        return None
    except Exception as e:
        print(f"Unexpected error: {e}")
        return None

# Usage
xml_root = parse_xml_safely('https://api.example.com/data.xml')
if xml_root is not None:
    # Process XML data
    for item in xml_root.findall('.//item'):
        print(item.find('title').text)

Working with Large XML Files

For large XML files, consider using iterative parsing to manage memory usage:

import requests
from lxml import etree

def parse_large_xml(url):
    response = requests.get(url, stream=True)

    if response.status_code == 200:
        # Use iterparse for memory-efficient processing
        context = etree.iterparse(response.raw, events=('start', 'end'))
        context = iter(context)
        event, root = next(context)

        for event, elem in context:
            if event == 'end' and elem.tag == 'product':
                # Process individual product
                name = elem.find('name').text
                price = elem.find('price').text
                print(f"Product: {name}, Price: {price}")

                # Clear element to free memory
                elem.clear()
                root.clear()

# Process large XML file efficiently
parse_large_xml('https://api.example.com/large-catalog.xml')

Converting XML to JSON

Sometimes you need to convert XML responses to JSON format for easier manipulation:

import requests
import xml.etree.ElementTree as ET
import json

def xml_to_dict(element):
    """Convert XML element to dictionary"""
    result = {}

    # Add attributes
    if element.attrib:
        result['@attributes'] = element.attrib

    # Add text content
    if element.text and element.text.strip():
        if len(element) == 0:
            return element.text.strip()
        result['text'] = element.text.strip()

    # Add child elements
    for child in element:
        child_data = xml_to_dict(child)
        if child.tag in result:
            if not isinstance(result[child.tag], list):
                result[child.tag] = [result[child.tag]]
            result[child.tag].append(child_data)
        else:
            result[child.tag] = child_data

    return result

# Fetch and convert XML to JSON
response = requests.get('https://api.example.com/data.xml')
if response.status_code == 200:
    root = ET.fromstring(response.content)
    data_dict = xml_to_dict(root)
    json_data = json.dumps(data_dict, indent=2)
    print(json_data)

Choosing the Right XML Parser

  • Use ElementTree for simple, well-formed XML documents and when you want to avoid external dependencies
  • Use lxml when you need high performance, XPath support, or advanced XML features
  • Use BeautifulSoup when dealing with malformed XML or when you prefer a more intuitive API

When building web scrapers that need to handle dynamic content that loads after page load, you might need to combine XML parsing with browser automation tools. Similarly, if you're working with APIs that return mixed content types, understanding XML parsing becomes crucial for comprehensive data extraction.

Conclusion

Parsing XML responses with the Requests library is straightforward when you choose the right XML parsing library for your needs. ElementTree works well for simple cases, lxml provides powerful features and performance, while BeautifulSoup offers the most forgiving approach for irregular XML. Always implement proper error handling and consider memory usage when working with large XML documents.

Remember to respect rate limits and terms of service when scraping XML data from web APIs, and consider caching parsed results to improve performance in production applications.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon