首页 > 解决方案 > 当只有 $eval 给出结果时如何获取所有 innerHTML($$ 返回未定义)

问题描述

有一张表,我试图从每一行中提取 3 个信息。完成后,它将滚动到页面底部,单击“加载更多”,然后抓取新数据,依此类推,直到不再有“加载更多”按钮。

为了从表中提取所有数据,我使用了 $$eval 但这会导致未定义。但是,如果我改用 $eval,我会得到数据,但这只会从表的第一行中提取数据。为什么 $$eval 返回“未定义”,如果我不能使用它,我如何遍历表以使用 $eval 获取所有值?

    const puppeteer = require('puppeteer');

(async () => {
  const browser = await puppeteer.launch({ headless: false }); // default is true
  const page = await browser.newPage();
  await page.goto('someexamplesite.com', {
    waitUntil: 'domcontentloaded',
  });

  const ExerciseName = await page.$$eval(
    '.ExCategory-results > .ExResult-row:nth-child(2) > .ExResult-cell > .ExHeading > a',
    (e) => e.innerText
  );

  const muscleTargeted = await page.$$eval(
    ' .ExCategory-results > .ExResult-row:nth-child(2) > .ExResult-cell > .ExResult-muscleTargeted > a',
    (e) => e.innerText
  );

  const equipmentType = await page.$$eval(
    '.ExCategory-results > .ExResult-row:nth-child(2) > .ExResult-cell > .ExResult-equipmentType > a',
    (e) => e.innerText
  );

  //click on load more
  await page.waitForSelector(
    '#js-ex-content > #js-ex-category-body > .ExCategory-results > .ExLoadMore > .bb-flat-btn'
  );

  console.log({ ExerciseName, muscleTargeted, equipmentType });

    await browser.close();
})().catch((e) => {
  console.error(e);
});

我试图抓取的代码

<div class="ExCategory-results">
    <div class="ExCategory-resultsLoadIndicator" id="js-ex-finder-load-indicator">
      <div class="ExCategory-resultsLoadIndicatorBox">
        <div class="ExCategory-resultsLoadIndicatorSpinner bb-spinner-btn__spinner"></div>
      </div>
    </div>
        
          <div class="ExResult-row  flexo-container flexo-between" itemscope="" itemtype="http://schema.org/ExerciseAction">
            <div class="ExResult-cell ">
                <!-- using male photos -->
                <img class="ExImg ExResult-img  ls-is-cached lazyloaded" width="70" height="70" onerror="if (window._E_) _E_(this)" alt=" thumbnail image" src="https://www.websites.com/exercises/exerciseImages/sequences/742/Male/m/742_1.jpg" data-src="https://www.websites.com/exercises/exerciseImages/sequences/742/Male/m/742_1.jpg" itemprop="image">
            </div>
            <div class="ExResult-cell ExResult-cell--nameEtc">
              <h3 class="ExHeading ExResult-resultsHeading">
                <a href="/exercises/rickshaw-carry" itemprop="name">
                  Rickshaw Carry
                </a>
              </h3>
              <div class="ExResult-details ExResult-muscleTargeted">
                Muscle Targeted:
                <a href="/exercises/muscle/forearms">
                  Forearms
                </a>
              </div>
              <div class="ExResult-details ExResult-equipmentType">
                Equipment Type:
                <a href="/exercises/equipment/other">
                  Other
                </a>
              </div>
            </div>
            <div class="ExResult-cell ExResult-cell--rating">
              <div class="ExRating">
                <div class="ExRating-badge">
                  9.6
                </div>
                <div class="ExRating-description ExRating-description--Average">
                  Average
                </div>
              </div>
            </div>
          </div>        
        
          <div class="ExResult-row  flexo-container flexo-between" itemscope="" itemtype="http://schema.org/ExerciseAction">
            <div class="ExResult-cell ">
                <!-- using male photos -->
                <img class="ExImg ExResult-img  ls-is-cached lazyloaded" width="70" height="70" onerror="if (window._E_) _E_(this)" alt=" thumbnail image" src="https://www.websites.com/images/2020/xdb/cropped/xdb-50m-single-leg-leg-press-m1-square-600x600.jpg" data-src="https://www.websites.com/images/2020/xdb/cropped/xdb-50m-single-leg-leg-press-m1-square-600x600.jpg" itemprop="image">
            </div>
            <div class="ExResult-cell ExResult-cell--nameEtc">
              <h3 class="ExHeading ExResult-resultsHeading">
                <a href="/exercises/single-leg-press" itemprop="name">
                  Single-Leg Press
                </a>
              </h3>
              <div class="ExResult-details ExResult-muscleTargeted">
                Muscle Targeted:
                <a href="/exercises/muscle/quadriceps">
                  Quadriceps
                </a>
              </div>
              <div class="ExResult-details ExResult-equipmentType">
                Equipment Type:
                <a href="/exercises/equipment/machine">
                  Machine
                </a>
              </div>
            </div>
            <div class="ExResult-cell ExResult-cell--rating">
              <div class="ExRating">
                <div class="ExRating-badge">
                  9.6
                </div>
                <div class="ExRating-description ExRating-description--Average">
                  Average
                </div>
              </div>
            </div>
          </div>        

标签: javascriptweb-scrapingpuppeteer

解决方案


page.$$eval方法Array.from(document.querySelectorAll(selector))在后台运行,所以你得到的是一个数组。如果不迭代它或通过适当的索引(例如:)获取正确的元素,则不能(e) => e.innerText直接应用于数组(即使它的长度为),否则您将得到.1e[0].innerTextundefined

您可以使用 anArray.map来遍历匹配的元素并将innerText每个元素收集到一个数组中。

const exerciseName = await page.$$eval(
    '.ExCategory-results > .ExResult-row:nth-child(2) > .ExResult-cell > .ExHeading > a',
    elements => elements.map(el => el.innerText)
  )

输出:

[ 'Rickshaw Carry' ]

编辑:

for您可以通过(1)计算具有相同类名的元素,使用带索引的循环(最容易使用常规循环)来迭代行类:

const rowsCounts = await page.$$eval('.ExCategory-results > .ExResult-row', rows => rows.length)

然后 (2) 遍历 children .ExResult-row:nth-child(n) ...,并将innerTexts 收集到一个数组 ( exerciseNames) 中:

const exerciseNames = []
for (let i = 1; i < rowsCounts + 1; i++) { // you mignt need i = 2
  const exerciseName = await page.$eval(
    `.ExCategory-results > .ExResult-row:nth-child(${i}) > .ExResult-cell > .ExHeading > a`,
    el => el.innerText)
  exerciseNames.push(exerciseName)
}

输出:

[
  'Rickshaw Carry',
  'Single-Leg Press',
  'Landmine twist',
  'Weighted pull-up',
  'T-Bar Row with Handle',
  'Palms-down wrist curl over bench'
]

注意:循环应该从表单开始1,而不是0在这种情况下,因为没有“nth-child(0)”。在您的示例中,第一个也丢失了,因此您可能需要在2.


推荐阅读