首页 > 解决方案 > 在抓取功能之外初始化 Puppeteer 浏览器

问题描述

我对 puppeteer 很陌生(我从今天开始)。我有一些代码按照我想要的方式工作,除了一个我认为使它效率极低的问题。我有一个函数可以将我通过可能具有增量 ID 的数千个 url 链接起来,以提取每个玩家的名称、位置和统计信息,然后将该数据插入到 neDB 数据库中。这是我的代码:

const puppeteer = require('puppeteer');
const Datastore = require('nedb');
const database = new Datastore('database.db');
database.loadDatabase();

async function scrapeProduct(url, id){
  const browser = await puppeteer.launch();
  const page = await browser.newPage();
  await page.goto(url);
  let attributes = [];

  const [name] = await page.$x('//*[@id="ctl00_ctl00_ctl00_Main_Main_name"]');
  const txt = await name.getProperty('innerText');
  const playerName = await txt.jsonValue();
  attributes.push(playerName);

  //Make sure that there is a legitimate player profile before trying to pull a bunch of 'undefined' information.
  if(playerName){
    const [role] = await page.$x('//*[@id="ctl00_ctl00_ctl00_Main_Main_position"]');
    const roleTxt = await role.getProperty('innerText');
    const playerRole = await roleTxt.jsonValue();
    attributes.push(playerRole);

    //Loop through the 12 attributes and pull their values.
    for(let i = 1; i < 13; i++){
      let vLink = '//*[@id="ctl00_ctl00_ctl00_Main_Main_SectionTabBox"]/div/div/div/div[1]/table/tbody/tr['+i+']/td[2]';
      const [e1] = await page.$x(vLink);
      const val = await e1.getProperty('innerText');
      const skillVal = await val.jsonValue();
      attributes.push(skillVal);
    }

    //Create a player profile to be pushed into the database. (I realize this is very wordy and ugly code)
    let player = {
      Name: attributes[0],
      Role: attributes[1],
      Athleticism: attributes[2],
      Speed: attributes[3],
      Durability: attributes[4],
      Work_Ethic: attributes[5],    
      Stamina: attributes[6],   
      Strength: attributes[7],  
      Blocking: attributes[8],
      Tackling: attributes[9],  
      Hands: attributes[10],    
      Game_Instinct: attributes[11],
      Elusiveness: attributes[12],  
      Technique: attributes[13],
      _id: id,
    };

      database.insert(player);
      console.log('player #' + id + " scraped.");
      await browser.close();
  } else {
    console.log("Blank profile");
    await browser.close();
  }
}

//Making sure the first URL is scraped before moving on to the next URL. (i removed the URL because its unreasonably long and is not important for this part).
(async () => {
  for(let i = 0; i <= 1000; i++){
    let link = 'https://url.com/Ratings.aspx?rid='+i+'&section=Ratings';
    await scrapeProduct(link, i);
  }
})();

我认为使这种效率如此低下的原因是,每次调用 scrapeProduct() 时,我都会创建一个新浏览器并创建一个新页面。相反,我相信创建 1 个浏览器和 1 个页面并更改页面 URL 会更有效

await page.goto(url)

我相信,为了做我想要在这里完成的事情,我需要搬家:

  const browser = await puppeteer.launch();
  const page = await browser.newPage();

在我的 scrapeProduct() 函数之外,但我似乎无法让它工作。每当我尝试时,我的函数中都会出现错误,提示该页面未定义。我对 puppeteer 非常陌生(从今天开始),我将不胜感激有关如何完成此任务的任何指导。非常感谢!

TL;博士

如何通过仅更改 await page.goto(url) 函数来创建一个函数可以重复使用的 1 个浏览器实例和 1 个页面实例。

标签: javascriptnode.jsweb-scrapingpuppeteerscreen-scraping

解决方案


为此,您只需将浏览器与您的请求分开,就像在一个类中一样,例如:

class PuppeteerScraper {
  async launch(options = {}) {
    this.browser = await puppeteer.launch(options);
    // you could reuse the page instance if it was defined here
  }

  /**
   * Pass the address and the function that will scrape your data,
   * in order to mantain the page inside this object
   */
  async goto(url, callback) {
    const page = await this.browser.newPage();
    await page.goto(url);

    /**evaluate its content */
    await callback(page);
    await page.close();
  }

  async close() {
    await this.browser.close();
  }
}

并且,要实现它:

/**
 * scrape function, takes the page instance as its parameters
 */
async function evaluate_page(page) {
  const titles = await page.$$eval('.col-xs-6 .star-rating ~ h3 a', (itens) => {
    const text_titles = [];
    for (const item of itens) {
      if (item && item.textContent) {
        text_titles.push(item.textContent);
      }
    }
    return text_titles;
  });
  console.log('titles', titles);
}

(async () => {
  const scraper = new PuppeteerScraper();
  await scraper.launch({ headless: false });

  for (let i = 1; i <= 6; i++) {
    let link = `https://books.toscrape.com/catalogue/page-${i}.html`;
    await scraper.goto(link, evaluate_page);
  }
  scraper.close();
})();

尽管如此,如果你想要更复杂的东西,你可以看看他们在Apify项目中的表现。


推荐阅读