Table of contents

How do you handle API responses with different content types?

When working with APIs, you'll encounter various response content types beyond the standard JSON format. Modern APIs can return XML, HTML, plain text, binary data, and even mixed content types. Understanding how to properly handle these different formats is crucial for building robust web scraping and API integration applications.

Understanding Content Types

The HTTP Content-Type header indicates the media type of the response body. Common content types include:

  • application/json - JSON data
  • application/xml or text/xml - XML documents
  • text/html - HTML content
  • text/plain - Plain text
  • application/octet-stream - Binary data
  • image/jpeg, image/png - Image files
  • application/pdf - PDF documents

Detecting Content Types

Before processing a response, you should check its content type to determine the appropriate parsing strategy.

Python Example

import requests
import json
import xml.etree.ElementTree as ET
from bs4 import BeautifulSoup

def handle_api_response(url):
    response = requests.get(url)
    content_type = response.headers.get('content-type', '').lower()

    if 'application/json' in content_type:
        return handle_json_response(response)
    elif 'application/xml' in content_type or 'text/xml' in content_type:
        return handle_xml_response(response)
    elif 'text/html' in content_type:
        return handle_html_response(response)
    elif 'text/plain' in content_type:
        return handle_text_response(response)
    elif 'application/octet-stream' in content_type:
        return handle_binary_response(response)
    else:
        return handle_unknown_response(response)

def handle_json_response(response):
    try:
        return response.json()
    except json.JSONDecodeError as e:
        print(f"Failed to parse JSON: {e}")
        return None

def handle_xml_response(response):
    try:
        root = ET.fromstring(response.content)
        return root
    except ET.ParseError as e:
        print(f"Failed to parse XML: {e}")
        return None

def handle_html_response(response):
    soup = BeautifulSoup(response.content, 'html.parser')
    return soup

def handle_text_response(response):
    return response.text

def handle_binary_response(response):
    return response.content

def handle_unknown_response(response):
    print(f"Unknown content type: {response.headers.get('content-type')}")
    return response.content

JavaScript Example

async function handleApiResponse(url) {
    try {
        const response = await fetch(url);
        const contentType = response.headers.get('content-type')?.toLowerCase() || '';

        if (contentType.includes('application/json')) {
            return await handleJsonResponse(response);
        } else if (contentType.includes('application/xml') || contentType.includes('text/xml')) {
            return await handleXmlResponse(response);
        } else if (contentType.includes('text/html')) {
            return await handleHtmlResponse(response);
        } else if (contentType.includes('text/plain')) {
            return await handleTextResponse(response);
        } else if (contentType.includes('application/octet-stream')) {
            return await handleBinaryResponse(response);
        } else {
            return await handleUnknownResponse(response);
        }
    } catch (error) {
        console.error('Request failed:', error);
        return null;
    }
}

async function handleJsonResponse(response) {
    try {
        return await response.json();
    } catch (error) {
        console.error('Failed to parse JSON:', error);
        return null;
    }
}

async function handleXmlResponse(response) {
    try {
        const text = await response.text();
        const parser = new DOMParser();
        return parser.parseFromString(text, 'text/xml');
    } catch (error) {
        console.error('Failed to parse XML:', error);
        return null;
    }
}

async function handleHtmlResponse(response) {
    try {
        const text = await response.text();
        const parser = new DOMParser();
        return parser.parseFromString(text, 'text/html');
    } catch (error) {
        console.error('Failed to parse HTML:', error);
        return null;
    }
}

async function handleTextResponse(response) {
    return await response.text();
}

async function handleBinaryResponse(response) {
    return await response.arrayBuffer();
}

async function handleUnknownResponse(response) {
    console.warn('Unknown content type:', response.headers.get('content-type'));
    return await response.blob();
}

Handling Specific Content Types

JSON Responses

JSON is the most common API response format. Always include error handling for malformed JSON:

# Python
def safe_json_parse(response):
    try:
        data = response.json()
        return data
    except json.JSONDecodeError:
        # Fallback: try to clean the response
        text = response.text.strip()
        if text.startswith('(') and text.endswith(')'):
            # Handle JSONP responses
            text = text[1:-1]
        return json.loads(text)
// JavaScript
async function safeJsonParse(response) {
    try {
        return await response.json();
    } catch (error) {
        // Fallback: try to parse manually cleaned text
        const text = await response.text();
        const cleanText = text.trim();
        if (cleanText.startsWith('(') && cleanText.endsWith(')')) {
            // Handle JSONP responses
            return JSON.parse(cleanText.slice(1, -1));
        }
        throw error;
    }
}

XML Responses

XML parsing requires different libraries and approaches:

# Python with xml.etree.ElementTree
import xml.etree.ElementTree as ET

def parse_xml_response(response):
    try:
        root = ET.fromstring(response.content)

        # Extract data from XML
        data = {}
        for child in root:
            data[child.tag] = child.text

        return data
    except ET.ParseError as e:
        print(f"XML parsing error: {e}")
        return None

# Python with lxml for more advanced parsing
from lxml import etree

def parse_xml_with_lxml(response):
    try:
        root = etree.fromstring(response.content)

        # Use XPath to extract specific elements
        titles = root.xpath('//title/text()')
        return {'titles': titles}
    except etree.XMLSyntaxError as e:
        print(f"XML syntax error: {e}")
        return None

HTML Responses

When APIs return HTML content, you'll need to parse and extract relevant data. This is particularly useful when handling AJAX requests using Puppeteer or processing web pages:

# Python with BeautifulSoup
from bs4 import BeautifulSoup

def parse_html_response(response):
    soup = BeautifulSoup(response.content, 'html.parser')

    # Extract specific elements
    data = {
        'title': soup.find('title').text if soup.find('title') else None,
        'links': [a['href'] for a in soup.find_all('a', href=True)],
        'images': [img['src'] for img in soup.find_all('img', src=True)]
    }

    return data

Binary Data Handling

For file downloads, images, or other binary content:

# Python
def download_binary_file(url, filename):
    response = requests.get(url, stream=True)

    if response.headers.get('content-type', '').startswith('image/'):
        with open(filename, 'wb') as f:
            for chunk in response.iter_content(chunk_size=8192):
                f.write(chunk)
        return True
    return False

# Check file size before downloading
def safe_binary_download(url, max_size_mb=10):
    response = requests.head(url)
    content_length = int(response.headers.get('content-length', 0))

    if content_length > max_size_mb * 1024 * 1024:
        raise ValueError(f"File too large: {content_length} bytes")

    return requests.get(url)

Advanced Content Type Handling

Content Negotiation

Some APIs support content negotiation, allowing you to request specific formats:

# Request JSON explicitly
headers = {'Accept': 'application/json'}
response = requests.get(url, headers=headers)

# Request XML
headers = {'Accept': 'application/xml'}
response = requests.get(url, headers=headers)

# Request multiple formats with preference
headers = {'Accept': 'application/json, application/xml;q=0.8, text/html;q=0.6'}
response = requests.get(url, headers=headers)

Handling Charset Encoding

Content type headers often include charset information:

import chardet

def get_text_with_encoding(response):
    content_type = response.headers.get('content-type', '')

    # Check if charset is specified
    if 'charset=' in content_type:
        charset = content_type.split('charset=')[1].split(';')[0]
        return response.content.decode(charset)

    # Auto-detect encoding
    detected = chardet.detect(response.content)
    encoding = detected['encoding'] or 'utf-8'

    return response.content.decode(encoding, errors='replace')

Error Handling Best Practices

Implement comprehensive error handling for different content types:

class ContentTypeHandler:
    def __init__(self):
        self.handlers = {
            'application/json': self._handle_json,
            'application/xml': self._handle_xml,
            'text/xml': self._handle_xml,
            'text/html': self._handle_html,
            'text/plain': self._handle_text,
        }

    def process_response(self, response):
        content_type = response.headers.get('content-type', '').split(';')[0]

        handler = self.handlers.get(content_type, self._handle_default)

        try:
            return handler(response)
        except Exception as e:
            return {
                'error': f'Failed to process {content_type}: {str(e)}',
                'content_type': content_type,
                'status_code': response.status_code
            }

    def _handle_json(self, response):
        return response.json()

    def _handle_xml(self, response):
        import xml.etree.ElementTree as ET
        return ET.fromstring(response.content)

    def _handle_html(self, response):
        from bs4 import BeautifulSoup
        return BeautifulSoup(response.content, 'html.parser')

    def _handle_text(self, response):
        return response.text

    def _handle_default(self, response):
        return {
            'content': response.content,
            'encoding': response.encoding,
            'content_type': response.headers.get('content-type')
        }

Testing Different Content Types

When building applications that handle multiple content types, create comprehensive tests:

import unittest
from unittest.mock import Mock, patch

class TestContentTypeHandling(unittest.TestCase):
    def test_json_response(self):
        mock_response = Mock()
        mock_response.headers = {'content-type': 'application/json'}
        mock_response.json.return_value = {'key': 'value'}

        result = handle_api_response_mock(mock_response)
        self.assertEqual(result['key'], 'value')

    def test_xml_response(self):
        mock_response = Mock()
        mock_response.headers = {'content-type': 'application/xml'}
        mock_response.content = b'<root><item>test</item></root>'

        result = handle_api_response_mock(mock_response)
        self.assertIsNotNone(result)

    def test_unknown_content_type(self):
        mock_response = Mock()
        mock_response.headers = {'content-type': 'application/unknown'}
        mock_response.content = b'unknown data'

        result = handle_api_response_mock(mock_response)
        self.assertEqual(result, b'unknown data')

Console Commands and Tools

Use command-line tools to test API responses:

# Check content type with curl
curl -I https://api.example.com/data

# Request specific content type
curl -H "Accept: application/json" https://api.example.com/data
curl -H "Accept: application/xml" https://api.example.com/data

# Save binary response to file
curl -o image.jpg https://api.example.com/image

# Display response headers and content type
curl -v https://api.example.com/data 2>&1 | grep -i content-type

Conclusion

Handling different API response content types requires a flexible approach that can adapt to various formats while maintaining robust error handling. When working with complex web applications, you might also need to consider monitoring network requests in Puppeteer to understand the full picture of API interactions. By implementing proper content type detection, format-specific parsing, and comprehensive error handling, you can build applications that reliably process diverse API responses and provide a better user experience.

Remember to always validate content types, implement fallback mechanisms for parsing errors, and test your handlers with various response formats to ensure reliability across different API endpoints and scenarios.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon