How to run multiple pages in parallel with Puppeteer?

Running multiple pages in parallel with Puppeteer can be achieved by managing multiple browser instances or pages simultaneously. Here's how we can do it:

Python

In Python, we can use asyncio and Pyppeteer, an unofficial Python port of Puppeteer JavaScript. Here is a simple example:

import asyncio
from pyppeteer import launch

async def get_page_content(url):
    browser = await launch()
    page = await browser.newPage()
    await page.goto(url)
    content = await page.content()
    await browser.close()
    return content

async def main():
    tasks = []
    urls = ['http://example.com', 'http://example2.com', 'http://example3.com']
    for url in urls:
        tasks.append(asyncio.ensure_future(get_page_content(url)))

    pages_content = await asyncio.gather(*tasks)

# Run the asyncio event loop
asyncio.get_event_loop().run_until_complete(main())

In the above example, we create an asynchronous function get_page_content which launches a browser, navigates to a URL, retrieves the page content, then closes the browser. In the main function, we create a list of tasks for each URL we want to scrape in parallel, then use asyncio.gather to run these tasks concurrently.

JavaScript

In JavaScript, you can use Promise.all to run multiple pages in parallel with Puppeteer. Here's an example:

const puppeteer = require('puppeteer');

async function run() {
    const browser = await puppeteer.launch();
    const page1 = await browser.newPage();
    const page2 = await browser.newPage();

    await Promise.all([
        page1.goto('http://example.com'),
        page2.goto('http://example2.com'),
    ]);

    // Do something with the pages...
    // ...

    await browser.close();
}

run().catch(console.error);

In the example above, Promise.all is used to run the goto commands in parallel. This means that Puppeteer will navigate to 'http://example.com' and 'http://example2.com' at the same time.

Please note that while this approach does allow for running operations in parallel, it may not be suitable for a very large number of pages, as each open page consumes system resources.

For scraping a large number of pages, consider using a task queue or a pool of browser instances/pages to ensure that you're not overloading the system.

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon