Running multiple pages in parallel with Puppeteer can be achieved by managing multiple browser instances or pages simultaneously. Here's how we can do it:
Python
In Python, we can use asyncio and Pyppeteer, an unofficial Python port of Puppeteer JavaScript. Here is a simple example:
import asyncio
from pyppeteer import launch
async def get_page_content(url):
browser = await launch()
page = await browser.newPage()
await page.goto(url)
content = await page.content()
await browser.close()
return content
async def main():
tasks = []
urls = ['http://example.com', 'http://example2.com', 'http://example3.com']
for url in urls:
tasks.append(asyncio.ensure_future(get_page_content(url)))
pages_content = await asyncio.gather(*tasks)
# Run the asyncio event loop
asyncio.get_event_loop().run_until_complete(main())
In the above example, we create an asynchronous function get_page_content
which launches a browser, navigates to a URL, retrieves the page content, then closes the browser. In the main
function, we create a list of tasks for each URL we want to scrape in parallel, then use asyncio.gather
to run these tasks concurrently.
JavaScript
In JavaScript, you can use Promise.all to run multiple pages in parallel with Puppeteer. Here's an example:
const puppeteer = require('puppeteer');
async function run() {
const browser = await puppeteer.launch();
const page1 = await browser.newPage();
const page2 = await browser.newPage();
await Promise.all([
page1.goto('http://example.com'),
page2.goto('http://example2.com'),
]);
// Do something with the pages...
// ...
await browser.close();
}
run().catch(console.error);
In the example above, Promise.all
is used to run the goto
commands in parallel. This means that Puppeteer will navigate to 'http://example.com' and 'http://example2.com' at the same time.
Please note that while this approach does allow for running operations in parallel, it may not be suitable for a very large number of pages, as each open page consumes system resources.
For scraping a large number of pages, consider using a task queue or a pool of browser instances/pages to ensure that you're not overloading the system.