Table of contents

Can I Use Headless Chromium to Generate PDFs from Web Pages?

Yes, Headless Chromium provides excellent PDF generation capabilities that allow you to convert web pages into high-quality PDF documents. This functionality is particularly useful for creating reports, invoices, documentation, and archival copies of web content. Headless Chromium's PDF generation preserves CSS styling, JavaScript-rendered content, and responsive layouts.

How PDF Generation Works in Headless Chromium

Headless Chromium renders web pages exactly as they would appear in a regular browser, then uses Chrome's built-in PDF printing functionality to generate the document. This approach ensures that the PDF output closely matches what users see in their browsers, including:

  • CSS styles and layouts
  • Web fonts
  • Images and graphics
  • JavaScript-generated content
  • Responsive design elements

PDF Generation with Puppeteer (Node.js)

Puppeteer is the most popular Node.js library for controlling Headless Chromium. Here's how to generate PDFs:

Basic PDF Generation

const puppeteer = require('puppeteer');

async function generatePDF() {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();

  await page.goto('https://example.com', {
    waitUntil: 'networkidle0'
  });

  const pdf = await page.pdf({
    path: 'example.pdf',
    format: 'A4',
    printBackground: true
  });

  await browser.close();
  return pdf;
}

generatePDF();

Advanced PDF Configuration

const puppeteer = require('puppeteer');

async function generateAdvancedPDF() {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();

  // Set viewport for consistent rendering
  await page.setViewport({ width: 1200, height: 800 });

  await page.goto('https://example.com');

  // Wait for dynamic content to load
  await page.waitForSelector('.dynamic-content');

  const pdf = await page.pdf({
    path: 'advanced-example.pdf',
    format: 'A4',
    printBackground: true,
    margin: {
      top: '20mm',
      right: '20mm',
      bottom: '20mm',
      left: '20mm'
    },
    displayHeaderFooter: true,
    headerTemplate: '<div style="font-size:10px; width:100%; text-align:center;">Header Content</div>',
    footerTemplate: '<div style="font-size:10px; width:100%; text-align:center;">Page <span class="pageNumber"></span> of <span class="totalPages"></span></div>'
  });

  await browser.close();
  return pdf;
}

Handling Dynamic Content

When working with pages that load content dynamically, you need to wait for the content to fully render before generating the PDF. How to handle AJAX requests using Puppeteer provides detailed guidance on managing dynamic content.

async function generatePDFWithDynamicContent() {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();

  await page.goto('https://spa-example.com');

  // Wait for specific elements or network activity
  await Promise.all([
    page.waitForSelector('.content-loaded'),
    page.waitForFunction(() => window.dataLoaded === true),
    page.waitForTimeout(2000) // Fallback timeout
  ]);

  const pdf = await page.pdf({
    path: 'dynamic-content.pdf',
    format: 'A4',
    printBackground: true
  });

  await browser.close();
}

PDF Generation with Pyppeteer (Python)

Pyppeteer is the Python port of Puppeteer, offering similar functionality:

import asyncio
from pyppeteer import launch

async def generate_pdf():
    browser = await launch()
    page = await browser.newPage()

    await page.goto('https://example.com')
    await page.waitForSelector('body')

    await page.pdf({
        'path': 'example.pdf',
        'format': 'A4',
        'printBackground': True,
        'margin': {
            'top': '20mm',
            'right': '20mm',
            'bottom': '20mm',
            'left': '20mm'
        }
    })

    await browser.close()

asyncio.get_event_loop().run_until_complete(generate_pdf())

Python with Custom CSS for Print

import asyncio
from pyppeteer import launch

async def generate_pdf_with_custom_css():
    browser = await launch()
    page = await browser.newPage()

    # Add print-specific CSS
    await page.addStyleTag({
        'content': '''
        @media print {
            .no-print { display: none !important; }
            .page-break { page-break-before: always; }
            body { font-size: 12pt; }
        }
        '''
    })

    await page.goto('https://example.com')

    pdf = await page.pdf({
        'path': 'styled-example.pdf',
        'format': 'A4',
        'printBackground': True
    })

    await browser.close()
    return pdf

asyncio.run(generate_pdf_with_custom_css())

Command Line PDF Generation

You can also generate PDFs directly using Chrome or Chromium from the command line:

# Basic PDF generation
google-chrome --headless --disable-gpu --print-to-pdf=output.pdf https://example.com

# With custom paper size and margins
google-chrome --headless --disable-gpu \
  --print-to-pdf=output.pdf \
  --print-to-pdf-no-header \
  --virtual-time-budget=5000 \
  https://example.com

# Generate PDF with specific viewport
chromium --headless --disable-gpu \
  --window-size=1200,800 \
  --print-to-pdf=output.pdf \
  https://example.com

PDF Configuration Options

Page Format and Size

// Standard paper sizes
await page.pdf({ format: 'A4' });      // 210mm x 297mm
await page.pdf({ format: 'A3' });      // 297mm x 420mm
await page.pdf({ format: 'Letter' });  // 8.5in x 11in
await page.pdf({ format: 'Legal' });   // 8.5in x 14in

// Custom dimensions
await page.pdf({
  width: '210mm',
  height: '297mm'
});

Margins and Layout

await page.pdf({
  margin: {
    top: '20mm',
    right: '15mm',
    bottom: '20mm',
    left: '15mm'
  },
  landscape: false,  // Portrait orientation
  printBackground: true
});

Headers and Footers

await page.pdf({
  displayHeaderFooter: true,
  headerTemplate: `
    <div style="font-size:10px; width:100%; text-align:center; margin-top:5mm;">
      <span class="title"></span>
    </div>
  `,
  footerTemplate: `
    <div style="font-size:10px; width:100%; text-align:center; margin-bottom:5mm;">
      Page <span class="pageNumber"></span> of <span class="totalPages"></span>
    </div>
  `,
  margin: { top: '30mm', bottom: '30mm' }
});

Best Practices for PDF Generation

1. Wait for Content to Load

Always ensure dynamic content has fully loaded before generating the PDF. How to handle timeouts in Puppeteer offers strategies for managing loading times effectively.

// Wait for network to be idle
await page.goto(url, { waitUntil: 'networkidle0' });

// Wait for specific elements
await page.waitForSelector('.main-content');

// Wait for custom conditions
await page.waitForFunction(() => document.querySelector('.loading') === null);

2. Optimize CSS for Print

@media print {
  /* Hide unnecessary elements */
  .no-print, nav, footer, .sidebar {
    display: none !important;
  }

  /* Control page breaks */
  .page-break {
    page-break-before: always;
  }

  /* Optimize text size */
  body {
    font-size: 12pt;
    line-height: 1.4;
  }

  /* Ensure backgrounds print */
  * {
    -webkit-print-color-adjust: exact !important;
    color-adjust: exact !important;
  }
}

3. Handle Large Documents

For large documents, consider memory management and processing time:

async function generateLargePDF() {
  const browser = await puppeteer.launch({
    args: ['--no-sandbox', '--disable-setuid-sandbox']
  });

  const page = await browser.newPage();

  // Increase timeout for large pages
  page.setDefaultTimeout(60000);

  await page.goto(url, { waitUntil: 'networkidle2' });

  const pdf = await page.pdf({
    format: 'A4',
    printBackground: true,
    preferCSSPageSize: true
  });

  await browser.close();
  return pdf;
}

4. Error Handling and Debugging

async function generatePDFWithErrorHandling() {
  let browser;
  try {
    browser = await puppeteer.launch();
    const page = await browser.newPage();

    // Enable console logging for debugging
    page.on('console', msg => console.log('PAGE LOG:', msg.text()));
    page.on('pageerror', err => console.log('PAGE ERROR:', err.message));

    await page.goto(url);

    const pdf = await page.pdf({
      format: 'A4',
      printBackground: true
    });

    return pdf;
  } catch (error) {
    console.error('PDF generation failed:', error);
    throw error;
  } finally {
    if (browser) {
      await browser.close();
    }
  }
}

Performance Considerations

Memory Management

// Use single page instance for multiple PDFs
async function generateMultiplePDFs(urls) {
  const browser = await puppeteer.launch();
  const results = [];

  for (const url of urls) {
    const page = await browser.newPage();
    try {
      await page.goto(url);
      const pdf = await page.pdf({ format: 'A4' });
      results.push(pdf);
    } finally {
      await page.close(); // Important: close each page
    }
  }

  await browser.close();
  return results;
}

Concurrent PDF Generation

For processing multiple URLs simultaneously, how to run multiple pages in parallel with Puppeteer provides detailed strategies for concurrent operations.

async function generatePDFsConcurrently(urls) {
  const browser = await puppeteer.launch();

  const promises = urls.map(async (url) => {
    const page = await browser.newPage();
    try {
      await page.goto(url);
      return await page.pdf({ format: 'A4' });
    } finally {
      await page.close();
    }
  });

  const results = await Promise.all(promises);
  await browser.close();
  return results;
}

Common Issues and Solutions

1. Missing Fonts

// Ensure system fonts are available
const browser = await puppeteer.launch({
  args: ['--font-render-hinting=none']
});

2. Images Not Rendering

// Wait for images to load
await page.evaluate(() => {
  return Promise.all(Array.from(document.images)
    .filter(img => !img.complete)
    .map(img => new Promise(resolve => {
      img.onload = img.onerror = resolve;
    })));
});

3. CSS Not Applied

// Ensure CSS is fully loaded
await page.waitForFunction(() => {
  const sheets = Array.from(document.styleSheets);
  return sheets.every(sheet => {
    try {
      return sheet.cssRules.length > 0;
    } catch (e) {
      return true;
    }
  });
});

Conclusion

Headless Chromium provides powerful and flexible PDF generation capabilities that can handle complex web pages with dynamic content, custom styling, and responsive layouts. Whether you're using Puppeteer with Node.js, Pyppeteer with Python, or command-line tools, the key to successful PDF generation lies in properly waiting for content to load, optimizing CSS for print media, and implementing robust error handling.

The combination of Headless Chromium's rendering engine and proper configuration options allows you to create professional-quality PDFs that accurately represent your web content while maintaining fast processing speeds and reliable output.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon