javascript - puppeteer:遍历 CSV 文件并为每一行截屏?
问题描述
我想遍历一个 CSV 文件并使用 puppeteer 为 CSV 文件中的每一行截取一个 URL。
我有以下代码,可以正常工作,但是每个请求都等待前一个请求完成,因此运行需要很长时间:
const csv = require('csv-parser');
const fs = require('fs');
const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch();
const getFile = async function(rowId, path) {
const page = await browser.newPage();
page.setViewport({ width: 1000, height: 1500, deviceScaleFactor: 1 });
let url = 'https://www.facebook.com/ads/library/?id=' + rowId;
const response = await page.goto(url, { waitUntil: 'networkidle2' });
await page.waitFor(3000);
const body = await page.$('body');
await body.screenshot({
path: path
});
page.close();
};
let fname = 'ids.csv'
const csvPipe = fs.createReadStream(fname).pipe(csv());
csvPipe.on('data', async (row) => {
let id = row.ad_id;
console.log(id);
let path = './images/' + id + '.png';
csvPipe.pause();
await getFile(id, path);
csvPipe.resume();
}).on('end', () => {
console.log('CSV file successfully processed');
});
})();
如何使请求并行运行以加快速度?
如果我删除pause()
andresume()
行,那么每次函数运行时都会出现此错误:
(node:18610) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). (rejection id: 14)
(node:18610) UnhandledPromiseRejectionWarning: TypeError: Cannot read property 'screenshot' of null
at getFile (/Users/me/Dropbox/Projects/scrape/index.js:29:12)
at <anonymous>
at process._tickCallback (internal/process/next_tick.js:189:7)
解决方案
如果您可以使用其他库,您可以puppeteer-cluster
尝试一下(免责声明:我是作者)。它正好解决了这个问题。
您将作业排队并让库处理并发:
const cluster = await Cluster.launch({
concurrency: Cluster.CONCURRENCY_PAGE, // you could also use something different (see docs)
maxConcurrency: 4, // how many pages in parallel your system can handle
});
// setup your task
await cluster.task(async ({ page, data: { rowId, path } }) => {
await page.goto(url);
// ... remaining code
});
// just read everything at once and queue all jobs
let fname = 'ids.csv';
fs.createReadStream(fname).pipe(csv()).on('data',
(row) => cluster.queue({ id: row.ad_id, path: './images/' + row.ad_id + '.png' })
);
// wait until all jobs are done and close the cluster
await cluster.idle();
await cluster.close();
此代码设置了一个包含 4 个工作人员(4 个浏览器页面)的集群,并处理排队的作业 ( { id: ..., path: ... }
)。
推荐阅读
- android - 如何在回收站视图中单击
- spring-boot - Spring Security 5.1.1 OAuth2 客户端连接到 spring-security-oauth2 Auth Server
- c++ - QToolbar的句柄是什么类型的组件?
- php - 队列未处理
- python - 文档字符串中的 python 注释
- traffic - 在 istio 中,如果 iptables 只重定向出站流量,会发生什么?
- android - Xamarin 不会内存不足
- c++ - 使用 LuaBridge 将 LuaJIT 绑定到 C++ 会导致“PANIC: unprotected error”
- sql - PL/SQL 中的“使用”查询?
- linux - NASM x86_64 删除换行符并在字符串末尾添加 0