首页 > 解决方案 > Puppeteer 抓取尝试总是以未定义的值结束

问题描述

简单的代码,应该可以工作,但事实并非如此。

const puppeteer = require ('puppeteer');

async function scrapeProduct(url) {
const browser = await puppeteer.launch({ headless:false });
const page = await browser.newPage();
await page.setExtraHTTPHeaders({
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36'
});
await page.goto(url)

const [el] = await page.$x('/html/body/main/div[1]/div/div/div[2]/h1');
const txt = await el.getProperty('txt')
const srcText = await txt.jsonValue()

console.log(srcText)
}
scrapeProduct('https://getbootstrap.com/')

//Same result on other urls as well.

我还尝试使用 querySelector 而不是 xPath,这在某些情况下有效,它会按预期记录节点的第一个值,但随后同一元素上的 querySelectorAll 将再次返回“未定义”。我到处寻找,但根本找不到解决方案。

标签: node.jsweb-scrapingpuppeteer

解决方案


我这样做

const puppeteer = require("puppeteer");

async function scrapeProduct(url) {
  const browser = await puppeteer.launch({ headless: false });
  const page = await browser.newPage();
  await page.setExtraHTTPHeaders({
    "user-agent":
      "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36",
  });
  await page.goto(url);

  // wait for elements defined by XPath appear in page
  await page.waitForXPath("/html/body/main/div[1]/div/div/div[2]/h1");

  // evaluate XPath expression of the target selector (it return array of ElementHandle)
  const headings = await page.$x("/html/body/main/div[1]/div/div/div[2]/h1");

  // prepare to get the textContent of the selector above (use page.evaluate)
  let textContent = await page.evaluate((el) => el.textContent, headings[0]);

  console.log(textContent);
}
scrapeProduct('https://getbootstrap.com/')

如果有帮助,请支持我的回答!


推荐阅读