Can I access the HTML source of a page using Playwright?

Yes, you can easily access the HTML source of a webpage using Playwright's page.content() method. This method returns the complete HTML source after JavaScript execution, making it ideal for scraping dynamic content.

JavaScript

Use the page.content() method to retrieve the full HTML source:

const { chromium } = require('playwright');

(async () => {
  const browser = await chromium.launch();
  const page = await browser.newPage();

  await page.goto('https://example.com');
  const htmlContent = await page.content();

  console.log(htmlContent);

  await browser.close();
})();

With Error Handling

const { chromium } = require('playwright');

(async () => {
  const browser = await chromium.launch();
  const page = await browser.newPage();

  try {
    await page.goto('https://example.com', { waitUntil: 'networkidle' });
    const htmlContent = await page.content();

    console.log(`HTML length: ${htmlContent.length} characters`);
    console.log(htmlContent);
  } catch (error) {
    console.error('Error fetching HTML:', error);
  } finally {
    await browser.close();
  }
})();

Python

Synchronous API

from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    browser = p.chromium.launch()
    page = browser.new_page()

    page.goto('https://example.com')
    html_content = page.content()

    print(html_content)
    browser.close()

Asynchronous API

import asyncio
from playwright.async_api import async_playwright

async def get_html():
    async with async_playwright() as p:
        browser = await p.chromium.launch()
        page = await browser.new_page()

        await page.goto('https://example.com')
        html_content = await page.content()

        print(html_content)
        await browser.close()

asyncio.run(get_html())

Advanced Usage

Getting HTML After Specific Interactions

const { chromium } = require('playwright');

(async () => {
  const browser = await chromium.launch();
  const page = await browser.newPage();

  await page.goto('https://example.com');

  // Click a button that loads content dynamically
  await page.click('#load-more-button');

  // Wait for new content to load
  await page.waitForSelector('.dynamic-content');

  // Get HTML after interaction
  const htmlContent = await page.content();

  console.log(htmlContent);
  await browser.close();
})();

Getting HTML of Specific Elements

// Get innerHTML of a specific element
const elementHTML = await page.innerHTML('#content-div');

// Get outerHTML of a specific element
const outerHTML = await page.locator('#content-div').innerHTML();

Important Notes

  • JavaScript Execution: page.content() returns HTML after JavaScript has been executed, including dynamically loaded content
  • Timing: The method captures the DOM state at the moment it's called
  • Complete Source: Returns the full document HTML, including <html>, <head>, and <body> tags
  • Network Activity: Consider using waitUntil: 'networkidle' for pages with ongoing network activity

Common Use Cases

  1. Web Scraping: Extract data from JavaScript-heavy websites
  2. Testing: Verify HTML structure after user interactions
  3. Content Analysis: Analyze fully rendered page content
  4. SEO Auditing: Check final HTML output for search engines

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon