首页 > 解决方案 > puppeteer:遍历 CSV 文件并为每一行截屏?

问题描述

我想遍历一个 CSV 文件并使用 puppeteer 为 CSV 文件中的每一行截取一个 URL。

我有以下代码,可以正常工作,但是每个请求都等待前一个请求完成,因此运行需要很长时间:

const csv = require('csv-parser');
const fs = require('fs');
const puppeteer = require('puppeteer');

(async () => {
    const browser = await puppeteer.launch();

    const getFile = async function(rowId, path) {
        const page = await browser.newPage();
        page.setViewport({ width: 1000, height: 1500, deviceScaleFactor: 1 });
        let url = 'https://www.facebook.com/ads/library/?id=' + rowId;
        const response = await page.goto(url, { waitUntil: 'networkidle2' });
        await page.waitFor(3000);
        const body = await page.$('body');
        await body.screenshot({
            path: path
        });
        page.close();
    };

    let fname = 'ids.csv'
    const csvPipe = fs.createReadStream(fname).pipe(csv());
    csvPipe.on('data', async (row) => {
            let id = row.ad_id;
            console.log(id);
            let path = './images/' + id + '.png';
            csvPipe.pause();
            await getFile(id, path);
            csvPipe.resume();
        }).on('end', () => {
            console.log('CSV file successfully processed');
        });
})();

如何使请求并行运行以加快速度?

如果我删除pause()andresume()行,那么每次函数运行时都会出现此错误:

(node:18610) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). (rejection id: 14)
(node:18610) UnhandledPromiseRejectionWarning: TypeError: Cannot read property 'screenshot' of null
    at getFile (/Users/me/Dropbox/Projects/scrape/index.js:29:12)
    at <anonymous>
    at process._tickCallback (internal/process/next_tick.js:189:7)

标签: javascriptnode.jsasync-awaitpuppeteer

解决方案


如果您可以使用其他库,您可以puppeteer-cluster尝试一下(免责声明:我是作者)。它正好解决了这个问题。

您将作业排队并让库处理并发:

const cluster = await Cluster.launch({
    concurrency: Cluster.CONCURRENCY_PAGE, // you could also use something different (see docs)
    maxConcurrency: 4, // how many pages in parallel your system can handle
});

// setup your task
await cluster.task(async ({ page, data: { rowId, path } }) => {
    await page.goto(url);
    // ... remaining code
});

// just read everything at once and queue all jobs
let fname = 'ids.csv';
fs.createReadStream(fname).pipe(csv()).on('data',
    (row) => cluster.queue({ id: row.ad_id, path: './images/' + row.ad_id + '.png' })
);

// wait until all jobs are done and close the cluster
await cluster.idle();
await cluster.close();

此代码设置了一个包含 4 个工作人员(4 个浏览器页面)的集群,并处理排队的作业 ( { id: ..., path: ... })。


推荐阅读