node.js - page.evaluate 用于动态网页抓取的 puppeteer 函数的参数
问题描述
我想为 page.evaluate() function() 传递参数以进行动态抓取,但没有什么对我有用。谁能帮我这个?我正在尝试使用 puppeteer 的 page.evaluate 的参数函数来抓取大量页面,但从 pharmavida 开始。我想通过参数传递每个页面的主 URL,从页面中提取每个会话并从每个会话中提取数据,但是因为它不能让我将参数传递给具有 page.evaluate 内部的函数......因为那样我想让它通过每个页面的部分进行动态抓取...我还尝试在 page.evaluate 之外放置一个 let 并将部分的父类的选择器类的元素传递给 querySelectorAll () 但是它说这个变量没有定义...... 当我将它作为字符串而不是参数放置时,
例子:
const data = await page.evaluate(function(params){
const myData = querySelectorAll(params.firstEleemntClass)
return{
data:myData
}
})
console.warn(data)//good data ruturn
但我所做的一切都不适合我......我想为多个页面和部分创建一个动态网络抓取:
const FarmaVidaHome = 'https://drogueriasfarmavida.com'
const FarmaTodoHome = 'https://www.farmatodo.com.ve'
const CruzVerde = 'https://www.cruzverde.com.co'
const LaBotica = 'https://www.tudrogueriavirtual.com/?v=9293'
module.exports = {
sites:[
{homeUrl:FarmaVidaHome, navigationType:'navbar',
fatherSectionClass:'.nav-top-link',
///////////////////////////////////////
data:{
productCardClass:'.product-type-simple',
paginationClass:'.woocommerce-pagination',
idClass:'.image-fade_in_back a',
product_nameClass:'.product-title',
imageClass:'.attachment-woocommerce_thumbnail',
categoryClass:'.product-cat',
priceClass:'.woocommerce-Price-amount'
}
}
]
}
const puppeteer = require('puppeteer')
const {sites} = require('./sites')
const {exploringPages} = require('./src/navigation/index')
const startScraping = async (datas) =>{
console.warn('THIS IS THE SITES-->', datas)
let dataAgruped = []
for (let i = 0; i < datas.length; i++) {
const pageItem = datas[i];
const response = await exploringPages(pageItem)
dataAgruped.push(pageItem)
}
// await exploringPages(datas)
}
startScraping(sites)
const exploringPages = async(thePage) =>{
console.warn('QUE VIENE AQUIII-->', thePage)
let myPage = thePage
const browser = await puppeteer.launch()
const page = await browser.newPage()
//await page.type('#selector', 'lo que quieres buscaar')
await page.goto(thePage.homeUrl)
let thisItem = thePage
const dataNavigation = await page.evaluate( ({thisItem})=>{
console.warn('PAGE thisItem EN IVALUATE-->', thisItem)
const $sections = document.querySelectorAll(thisItem.fatherSectionClass)
const data = []
$sections.forEach(($section) => {
data.push({
path:$section.getAttribute('href'),
// data:thisItem.data
})
});
return{
sections:data
}
}, {thisItem})
console.warn('this is the sections--->', dataNavigation)
// await exploringSections(dataNavigation.sections)
//await browser.close()
}
module.exports = {
exploringPages
}
终端响应:
node:22759) UnhandledPromiseRejectionWarning: TimeoutError: Navigation timeout of 30000 ms exceeded
at /Users/devios/Downloads/work/tests/node_modules/puppeteer/lib/cjs/puppeteer/common/LifecycleWatcher.js:106:111
(Use `node --trace-warnings ...` to show where the warning was created)
(node:22759) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). To terminate the node process on unhandled promise rejection, use the CLI flag `--unhandled-rejections=strict` (see https://nodejs.org/api/cli.html#cli_unhandled_rejections_mode). (rejection id: 1)
(node:22759) [DEP0018] DeprecationWarning: Unhandled promise rejections are deprecated. In the future, promise rejections that are not handled will terminate the Node.js process with a non-zero exit code.
解决方案
推荐阅读
- import - 将数据从 excel 导入到 teradata 错误错误代码 3706 Expected something between ')' 和 Insert
- r - 简单的 for 循环将列添加到列表中的数据框不起作用,这是怎么回事?
- java - java - 如何在java的命令行中屏蔽密码并显示光标移动?
- java - 静态内容页面在部署在 Kubernetes 集群中的 springboot 应用程序中不起作用
- excel - 如何获取任何 excel 文件并按标题名称将多列复制到新工作簿中(100,000 多个数据点)
- php - 自定义 WooCommerce“产品”列表 HTML 布局
- python - 获取列中值的计数并在图中显示它们的百分比
- php - 从服务器中删除 WordPress 文件
- php - Show first table value only php
- spring - Disable whole test (and loading context) based on Spring Property