首页 > 解决方案 > 使用 promise-pool 和 puppeteer 创建一个持续增加的列表

问题描述

我需要使用 puppeteer 创建抓取工具,但是我在将项目添加到队列时遇到了一些问题

我得到了什么

const PromisePool = require("@supercharge/promise-pool");
const puppeteer = require("puppeteer");

const domain = process.argv[2];

let list = [];
list[0] = domain;

const run = async () => {
  const { results, errors } = await PromisePool.for(list)
    .withConcurrency(2)
    .process(async (webpage) => {
      links = [];

      const getData = async () => {
        return await page.evaluate(async () => {
          return await new Promise((resolve) => {
            resolve(Array.from(document.querySelectorAll("a")).map((anchor) => [anchor.href]));
          });
        });
      };

      links = await getData();

      for (var link in links) {
        var new_url = String(links[link]);
        new_url = new_url.split("#")[0];
        console.log("new url: " + new_url);
        if (new_url.includes(domain)) {
          if (new_url in list) {
            console.log("Url already exists: " + new_url);
            continue;
          }

          list[new_url] = new_url;
        } else {
          console.log("Url is external: " + new_url);
        }
      }
      browser.close();
    });
};

const mainFunction = async () => {
  const result = await run();
  return result;
};

(async () => {
  console.log(await mainFunction());
  console.log(list);
})();

问题在里面

links = [];

const getData = async () => {
  return await page.evaluate(async () => {
    return await new Promise((resolve) => {
      resolve(Array.from(document.querySelectorAll("a")).map((anchor) => [anchor.href]));
    });
  });
};

links = await getData();

page.evaluate 是异步的,它不等待返回,此链接永远不会为下一个 PromisePool 进程更新。

我需要一种方法来等待响应返回,然后继续处理脚本的其余部分。

标签: javascriptpromisepuppeteer

解决方案


您可以使用page.$$eval单个await.

page.$$eval(selector, pageFunction[, ...args])

它基本上是您想要实现的,因为该$$eval方法“Array.from(document.querySelectorAll(selector))在页面 [上下文] 内运行并将其作为第一个参数传递给pageFunction.”。(文档)

例如:

const links = await page.$$eval('a', anchors => anchors.map(el => el.href));

推荐阅读