Table of contents

How do you extract images and their attributes using Cheerio?

Cheerio is a powerful server-side implementation of jQuery designed for Node.js applications. One of the most common web scraping tasks is extracting images and their associated attributes from HTML documents. This comprehensive guide will show you how to effectively extract images using Cheerio, covering various scenarios and best practices.

Getting Started with Cheerio

Before diving into image extraction, ensure you have Cheerio installed in your Node.js project:

npm install cheerio
npm install axios  # For fetching HTML content

Here's a basic setup for loading HTML content with Cheerio:

const cheerio = require('cheerio');
const axios = require('axios');

async function loadPage(url) {
  try {
    const response = await axios.get(url);
    return cheerio.load(response.data);
  } catch (error) {
    console.error('Error fetching page:', error);
    return null;
  }
}

Basic Image Extraction

Extracting All Images

The simplest way to extract all images from a webpage is to select all <img> elements:

const $ = cheerio.load(html);

// Extract all image elements
$('img').each((index, element) => {
  const $img = $(element);
  const src = $img.attr('src');
  const alt = $img.attr('alt');

  console.log(`Image ${index + 1}:`);
  console.log(`Source: ${src}`);
  console.log(`Alt text: ${alt}`);
});

Extracting Specific Image Attributes

You can extract various attributes from image elements:

function extractImageAttributes($) {
  const images = [];

  $('img').each((index, element) => {
    const $img = $(element);

    const imageData = {
      src: $img.attr('src'),
      alt: $img.attr('alt') || '',
      title: $img.attr('title') || '',
      width: $img.attr('width'),
      height: $img.attr('height'),
      className: $img.attr('class') || '',
      id: $img.attr('id') || '',
      'data-src': $img.attr('data-src'), // For lazy-loaded images
      srcset: $img.attr('srcset') || ''
    };

    images.push(imageData);
  });

  return images;
}

// Usage
const $ = cheerio.load(html);
const allImages = extractImageAttributes($);
console.log(allImages);

Advanced Image Extraction Techniques

Handling Lazy-Loaded Images

Many modern websites use lazy loading for images. These images often store the actual source in data-src or similar attributes:

function extractLazyLoadedImages($) {
  const images = [];

  $('img').each((index, element) => {
    const $img = $(element);

    // Check for various lazy loading patterns
    const src = $img.attr('src') || 
                $img.attr('data-src') || 
                $img.attr('data-lazy-src') || 
                $img.attr('data-original');

    if (src) {
      images.push({
        src: src,
        alt: $img.attr('alt') || '',
        isLazyLoaded: !!($img.attr('data-src') || $img.attr('data-lazy-src'))
      });
    }
  });

  return images;
}

Extracting Images from Specific Containers

Often, you'll want to extract images from specific sections of a webpage:

function extractImagesFromContainer($, containerSelector) {
  const images = [];

  $(containerSelector).find('img').each((index, element) => {
    const $img = $(element);

    images.push({
      src: $img.attr('src'),
      alt: $img.attr('alt') || '',
      parentContainer: containerSelector
    });
  });

  return images;
}

// Examples
const $ = cheerio.load(html);
const galleryImages = extractImagesFromContainer($, '.gallery');
const articleImages = extractImagesFromContainer($, 'article');
const headerImages = extractImagesFromContainer($, 'header');

Handling Background Images

CSS background images require a different approach since they're defined in style attributes:

function extractBackgroundImages($) {
  const backgroundImages = [];

  $('[style*="background-image"]').each((index, element) => {
    const $element = $(element);
    const style = $element.attr('style');

    // Extract URL from background-image CSS property
    const urlMatch = style.match(/background-image:\s*url\(['"]?([^'"]+)['"]?\)/);

    if (urlMatch) {
      backgroundImages.push({
        src: urlMatch[1],
        element: element.tagName,
        className: $element.attr('class') || '',
        id: $element.attr('id') || ''
      });
    }
  });

  return backgroundImages;
}

Working with Responsive Images

Extracting Srcset Information

Modern websites use srcset for responsive images:

function extractResponsiveImages($) {
  const responsiveImages = [];

  $('img[srcset]').each((index, element) => {
    const $img = $(element);
    const srcset = $img.attr('srcset');

    // Parse srcset into individual sources
    const sources = srcset.split(',').map(source => {
      const parts = source.trim().split(' ');
      return {
        url: parts[0],
        descriptor: parts[1] || '1x'
      };
    });

    responsiveImages.push({
      src: $img.attr('src'),
      alt: $img.attr('alt') || '',
      srcset: sources,
      sizes: $img.attr('sizes') || ''
    });
  });

  return responsiveImages;
}

Handling Picture Elements

For more complex responsive scenarios using <picture> elements:

function extractPictureElements($) {
  const pictures = [];

  $('picture').each((index, element) => {
    const $picture = $(element);
    const sources = [];

    // Extract source elements
    $picture.find('source').each((i, sourceElement) => {
      const $source = $(sourceElement);
      sources.push({
        srcset: $source.attr('srcset'),
        media: $source.attr('media'),
        type: $source.attr('type'),
        sizes: $source.attr('sizes')
      });
    });

    // Extract the fallback img element
    const $img = $picture.find('img');

    pictures.push({
      sources: sources,
      fallback: {
        src: $img.attr('src'),
        alt: $img.attr('alt') || ''
      }
    });
  });

  return pictures;
}

Filtering and Validating Images

Filtering by File Type

function filterImagesByType($, allowedTypes = ['jpg', 'jpeg', 'png', 'gif', 'webp']) {
  const filteredImages = [];

  $('img').each((index, element) => {
    const $img = $(element);
    const src = $img.attr('src');

    if (src) {
      const extension = src.split('.').pop().toLowerCase().split('?')[0];

      if (allowedTypes.includes(extension)) {
        filteredImages.push({
          src: src,
          alt: $img.attr('alt') || '',
          type: extension
        });
      }
    }
  });

  return filteredImages;
}

Filtering by Size Attributes

function filterImagesBySize($, minWidth = 0, minHeight = 0) {
  const sizedImages = [];

  $('img').each((index, element) => {
    const $img = $(element);
    const width = parseInt($img.attr('width')) || 0;
    const height = parseInt($img.attr('height')) || 0;

    if (width >= minWidth && height >= minHeight) {
      sizedImages.push({
        src: $img.attr('src'),
        alt: $img.attr('alt') || '',
        width: width,
        height: height
      });
    }
  });

  return sizedImages;
}

Complete Example: Comprehensive Image Extractor

Here's a complete example that combines all the techniques:

const cheerio = require('cheerio');
const axios = require('axios');
const url = require('url');

class ImageExtractor {
  constructor(baseUrl) {
    this.baseUrl = baseUrl;
  }

  async extractFromUrl(targetUrl) {
    try {
      const response = await axios.get(targetUrl);
      const $ = cheerio.load(response.data);

      return this.extractAllImages($);
    } catch (error) {
      console.error('Error extracting images:', error);
      return [];
    }
  }

  extractAllImages($) {
    const results = {
      regularImages: this.extractRegularImages($),
      lazyLoadedImages: this.extractLazyLoadedImages($),
      backgroundImages: this.extractBackgroundImages($),
      responsiveImages: this.extractResponsiveImages($),
      pictureElements: this.extractPictureElements($)
    };

    return results;
  }

  extractRegularImages($) {
    const images = [];

    $('img').each((index, element) => {
      const $img = $(element);
      const src = $img.attr('src');

      if (src) {
        images.push({
          src: this.resolveUrl(src),
          alt: $img.attr('alt') || '',
          title: $img.attr('title') || '',
          width: $img.attr('width'),
          height: $img.attr('height'),
          className: $img.attr('class') || '',
          id: $img.attr('id') || ''
        });
      }
    });

    return images;
  }

  resolveUrl(relativePath) {
    if (relativePath.startsWith('http')) {
      return relativePath;
    }
    return url.resolve(this.baseUrl, relativePath);
  }

  // ... other extraction methods
}

// Usage
async function main() {
  const extractor = new ImageExtractor('https://example.com');
  const images = await extractor.extractFromUrl('https://example.com/gallery');

  console.log('Extracted images:', JSON.stringify(images, null, 2));
}

Best Practices and Tips

1. Handle Relative URLs

Always convert relative image URLs to absolute URLs for proper access:

const url = require('url');

function resolveImageUrl(src, baseUrl) {
  if (src.startsWith('http')) {
    return src;
  }
  return url.resolve(baseUrl, src);
}

2. Error Handling

Implement robust error handling for missing attributes:

function safeGetAttribute($element, attribute, defaultValue = '') {
  try {
    return $element.attr(attribute) || defaultValue;
  } catch (error) {
    console.warn(`Error getting attribute ${attribute}:`, error);
    return defaultValue;
  }
}

3. Performance Considerations

For large pages with many images, consider using streaming or pagination:

function extractImagesInBatches($, batchSize = 50) {
  const allImages = $('img');
  const batches = [];

  for (let i = 0; i < allImages.length; i += batchSize) {
    const batch = allImages.slice(i, i + batchSize);
    batches.push(batch.map((index, element) => {
      const $img = $(element);
      return {
        src: $img.attr('src'),
        alt: $img.attr('alt') || ''
      };
    }).get());
  }

  return batches;
}

When to Use Alternatives

While Cheerio is excellent for static HTML content, consider these alternatives for specific scenarios:

Conclusion

Cheerio provides a powerful and efficient way to extract images and their attributes from HTML documents. By understanding the various extraction techniques and implementing proper error handling, you can build robust image scraping solutions. Remember to respect website terms of service and implement appropriate rate limiting when scraping images at scale.

The techniques covered in this guide will help you handle most image extraction scenarios, from simple static images to complex responsive picture elements. Practice with different websites to become proficient with these patterns and adapt them to your specific use cases.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon