首页 > 解决方案 > 正确循环多个链接

问题描述

我对木偶师很陌生。我昨天开始,我正在尝试制作一个程序,它可以遍历一个 url,一个接一个地增量存储玩家 ID,并使用 neDB 保存玩家统计信息。有数千个链接需要翻阅,我发现如果我使用 for 循环,我的计算机基本上会崩溃,因为 1,000 个 Chromium 试图同时打开所有这些。有没有更好的方法或正确的方法来做到这一点?任何意见,将不胜感激。

const puppeteer = require('puppeteer');
const Datastore = require('nedb');

const database = new Datastore('database.db');
database.loadDatabase();

async function scrapeProduct(url){
  const browser = await puppeteer.launch();
  const page = await browser.newPage();
  await page.goto(url);

  let attributes = [];

  //Getting player's name
  const [name] = await page.$x('//*[@id="ctl00_ctl00_ctl00_Main_Main_name"]');
  const txt = await name.getProperty('innerText');
  const playerName = await txt.jsonValue();
  attributes.push(playerName);

  //Getting all 12 individual stats of the player
  for(let i = 1; i < 13; i++){
    let vLink = '//*[@id="ctl00_ctl00_ctl00_Main_Main_SectionTabBox"]/div/div/div/div[1]/table/tbody/tr['+i+']/td[2]';
    const [e1] = await page.$x(vLink);
    const val = await e1.getProperty('innerText');
    const skillVal = await val.jsonValue();
    attributes.push(skillVal);
  }

  //creating a player object to store the data how i want (i know this is probably ugly code and could be done in a much better way)
  let player = {
    Name: attributes[0],
    Athleticism: attributes[1],
    Speed: attributes[2],
    Durability: attributes[3],
    Work_Ethic: attributes[4],  
    Stamina: attributes[5], 
    Strength: attributes[6],    
    Blocking: attributes[7],
    Tackling: attributes[8],    
    Hands: attributes[9],   
    Game_Instinct: attributes[10],
    Elusiveness: attributes[11],    
    Technique: attributes[12],
  };

  database.insert(player);
  await browser.close();
}

//For loop to loop through 1000 player links... Url.com is swapped in here because the actual url is ridiculously long and not important.
for(let i = 0; i <= 1000; i++){
  let link = 'https://url.com/?id='+i+'&section=Ratings';
  scrapeProduct(link);
  console.log("Player #" + i + " scrapped");
}

标签: javascriptweb-scrapingoptimizationpuppeteerpuppeteer-cluster

解决方案


如果您认为速度问题是每次运行时重新打开/关闭浏览器,请将浏览器移动到全局范围并将其初始化为 null。然后使用以下内容创建一个 init 函数:

async function init(){
  if(!browser)
    browser = await puppeteer.launch()
}

允许将页面传递给您的 scrapeProduct 函数。async function scrapeProduct(url)变成async function scrapeProduct(url,page). 替换await browser.close()await page.close()。现在您的循环将如下所示:

//For loop to loop through 1000 player links... Url.com is swapped in here because the actual url is ridiculously long and not important.
await init();
for(let i = 0; i <= 1000; i++){
  let link = 'https://url.com/?id='+i+'&section=Ratings';
  let page = await browser.newPage()
  scrapeProduct(link,page);
  console.log("Player #" + i + " scrapped");
}
await browser.close()

如果您想限制浏览器同时运行的页面数,您可以创建一个函数来执行此操作:

async function getTotalPages(){
  const allPages = await browser.pages()
  return allPages.length
}
async function newPage(){
  const MAX_PAGES = 5
  await new Promise(resolve=>{
    // check once a second to check on pages open
    const interval = setInterval(async ()=>{
      let totalPages = await getTotalPages()
      if(totalPages< MAX_PAGES){
        clearInterval(interval)
        resolve()
      }
    },1000)
  })
  return await browser.newPage()
}

如果你这样做了,在你的循环中你会替换let page = await browser.newPagelet page = await newPage()


推荐阅读