首页 > 解决方案 > page.evaluate 用于动态网页抓取的 puppeteer 函数的参数

问题描述

我想为 page.evaluate() function() 传递参数以进行动态抓取,但没有什么对我有用。谁能帮我这个?我正在尝试使用 puppeteer 的 page.evaluate 的参数函数来抓取大量页面,但从 pharmavida 开始。我想通过参数传递每个页面的主 URL,从页面中提取每个会话并从每个会话中提取数据,但是因为它不能让我将参数传递给具有 page.evaluate 内部的函数......因为那样我想让它通过每个页面的部分进行动态抓取...我还尝试在 page.evaluate 之外放置一个 let 并将部分的父类的选择器类的元素传递给 querySelectorAll () 但是它说这个变量没有定义...... 当我将它作为字符串而不是参数放置时,

例子:

const data = await page.evaluate(function(params){
const myData = querySelectorAll(params.firstEleemntClass)
return{
data:myData
}
})
console.warn(data)//good data ruturn

但我所做的一切都不适合我......我想为多个页面和部分创建一个动态网络抓取:

const FarmaVidaHome = 'https://drogueriasfarmavida.com'
const FarmaTodoHome = 'https://www.farmatodo.com.ve'
const CruzVerde = 'https://www.cruzverde.com.co'
const LaBotica = 'https://www.tudrogueriavirtual.com/?v=9293'

module.exports = {
sites:[
    {homeUrl:FarmaVidaHome, navigationType:'navbar',
    fatherSectionClass:'.nav-top-link',
    ///////////////////////////////////////
    data:{
        productCardClass:'.product-type-simple',
        paginationClass:'.woocommerce-pagination',
        idClass:'.image-fade_in_back a',
        product_nameClass:'.product-title',
        imageClass:'.attachment-woocommerce_thumbnail',
        categoryClass:'.product-cat',
        priceClass:'.woocommerce-Price-amount'
        }
    }

  ]    
}


const puppeteer = require('puppeteer')
const {sites}  = require('./sites')
const {exploringPages} = require('./src/navigation/index')

const startScraping = async (datas) =>{
console.warn('THIS IS THE SITES-->', datas)  
let dataAgruped = []
     for (let i = 0; i < datas.length; i++) {
        const pageItem = datas[i]; 
      
         const response = await  exploringPages(pageItem)
         dataAgruped.push(pageItem)
     }
  // await exploringPages(datas)
}
startScraping(sites)






const exploringPages = async(thePage) =>{
console.warn('QUE VIENE AQUIII-->', thePage)
let myPage = thePage
  const browser = await puppeteer.launch()
    const page = await browser.newPage()
  //await page.type('#selector', 'lo que quieres buscaar')
    await page.goto(thePage.homeUrl)
    let thisItem = thePage
      const dataNavigation = await page.evaluate( ({thisItem})=>{
        console.warn('PAGE thisItem EN IVALUATE-->', thisItem)
        const $sections = document.querySelectorAll(thisItem.fatherSectionClass)
        const data = []
        $sections.forEach(($section) => {
          data.push({
            path:$section.getAttribute('href'),
           // data:thisItem.data
            })   
        });
        return{
          sections:data
        }
      }, {thisItem})
        console.warn('this is the sections--->', dataNavigation)
    // await  exploringSections(dataNavigation.sections)
  
 
  //await browser.close()
}

module.exports = {
    exploringPages
}

终端响应:

node:22759) UnhandledPromiseRejectionWarning: TimeoutError: Navigation timeout of 30000 ms exceeded
    at /Users/devios/Downloads/work/tests/node_modules/puppeteer/lib/cjs/puppeteer/common/LifecycleWatcher.js:106:111
(Use `node --trace-warnings ...` to show where the warning was created)
(node:22759) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). To terminate the node process on unhandled promise rejection, use the CLI flag `--unhandled-rejections=strict` (see https://nodejs.org/api/cli.html#cli_unhandled_rejections_mode). (rejection id: 1)
(node:22759) [DEP0018] DeprecationWarning: Unhandled promise rejections are deprecated. In the future, promise rejections that are not handled will terminate the Node.js process with a non-zero exit code.

标签: node.jspuppeteerscreen-scraping

解决方案


推荐阅读