javascript - 正确循环多个链接
问题描述
我对木偶师很陌生。我昨天开始,我正在尝试制作一个程序,它可以遍历一个 url,一个接一个地增量存储玩家 ID,并使用 neDB 保存玩家统计信息。有数千个链接需要翻阅,我发现如果我使用 for 循环,我的计算机基本上会崩溃,因为 1,000 个 Chromium 试图同时打开所有这些。有没有更好的方法或正确的方法来做到这一点?任何意见,将不胜感激。
const puppeteer = require('puppeteer');
const Datastore = require('nedb');
const database = new Datastore('database.db');
database.loadDatabase();
async function scrapeProduct(url){
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto(url);
let attributes = [];
//Getting player's name
const [name] = await page.$x('//*[@id="ctl00_ctl00_ctl00_Main_Main_name"]');
const txt = await name.getProperty('innerText');
const playerName = await txt.jsonValue();
attributes.push(playerName);
//Getting all 12 individual stats of the player
for(let i = 1; i < 13; i++){
let vLink = '//*[@id="ctl00_ctl00_ctl00_Main_Main_SectionTabBox"]/div/div/div/div[1]/table/tbody/tr['+i+']/td[2]';
const [e1] = await page.$x(vLink);
const val = await e1.getProperty('innerText');
const skillVal = await val.jsonValue();
attributes.push(skillVal);
}
//creating a player object to store the data how i want (i know this is probably ugly code and could be done in a much better way)
let player = {
Name: attributes[0],
Athleticism: attributes[1],
Speed: attributes[2],
Durability: attributes[3],
Work_Ethic: attributes[4],
Stamina: attributes[5],
Strength: attributes[6],
Blocking: attributes[7],
Tackling: attributes[8],
Hands: attributes[9],
Game_Instinct: attributes[10],
Elusiveness: attributes[11],
Technique: attributes[12],
};
database.insert(player);
await browser.close();
}
//For loop to loop through 1000 player links... Url.com is swapped in here because the actual url is ridiculously long and not important.
for(let i = 0; i <= 1000; i++){
let link = 'https://url.com/?id='+i+'§ion=Ratings';
scrapeProduct(link);
console.log("Player #" + i + " scrapped");
}
解决方案
如果您认为速度问题是每次运行时重新打开/关闭浏览器,请将浏览器移动到全局范围并将其初始化为 null。然后使用以下内容创建一个 init 函数:
async function init(){
if(!browser)
browser = await puppeteer.launch()
}
允许将页面传递给您的 scrapeProduct 函数。async function scrapeProduct(url)
变成async function scrapeProduct(url,page)
. 替换await browser.close()
为await page.close()
。现在您的循环将如下所示:
//For loop to loop through 1000 player links... Url.com is swapped in here because the actual url is ridiculously long and not important.
await init();
for(let i = 0; i <= 1000; i++){
let link = 'https://url.com/?id='+i+'§ion=Ratings';
let page = await browser.newPage()
scrapeProduct(link,page);
console.log("Player #" + i + " scrapped");
}
await browser.close()
如果您想限制浏览器同时运行的页面数,您可以创建一个函数来执行此操作:
async function getTotalPages(){
const allPages = await browser.pages()
return allPages.length
}
async function newPage(){
const MAX_PAGES = 5
await new Promise(resolve=>{
// check once a second to check on pages open
const interval = setInterval(async ()=>{
let totalPages = await getTotalPages()
if(totalPages< MAX_PAGES){
clearInterval(interval)
resolve()
}
},1000)
})
return await browser.newPage()
}
如果你这样做了,在你的循环中你会替换let page = await browser.newPage
为let page = await newPage()
推荐阅读
- ruby-on-rails - after_update 在更新 rails 之前触发事件
- python - 为什么在 pandas DataFrame 中更改 `__repr__` 不会改变其显示?
- oracle - 带聚合函数的 Oracle 动态 SQL 过程通过 ORDS 调用需要很长时间,但在 SQL Developer 中运行速度很快
- python - Pyinstaller 无法访问数据文件夹
- spring-boot - ignore_row_on_dupkey_index 不适用于 Spring Boot
- apache-flink - JSON 到 Avro 解码 - AvroTypeException:未找到预期的字段名称
- python - 如何修改字典中的值?
- wordpress - 如果帖子类型名称为“媒体”,则 WordPress 古腾堡编辑器不适用于自定义帖子类型
- google-apps-script - 显示包含来自另一个单元格的值的消息的自定义函数 Google 表格
- javascript - Javascript从对应的键列表中获取对象值