javascript - 当只有 $eval 给出结果时如何获取所有 innerHTML($$ 返回未定义)
问题描述
有一张表,我试图从每一行中提取 3 个信息。完成后,它将滚动到页面底部,单击“加载更多”,然后抓取新数据,依此类推,直到不再有“加载更多”按钮。
为了从表中提取所有数据,我使用了 $$eval 但这会导致未定义。但是,如果我改用 $eval,我会得到数据,但这只会从表的第一行中提取数据。为什么 $$eval 返回“未定义”,如果我不能使用它,我如何遍历表以使用 $eval 获取所有值?
const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch({ headless: false }); // default is true
const page = await browser.newPage();
await page.goto('someexamplesite.com', {
waitUntil: 'domcontentloaded',
});
const ExerciseName = await page.$$eval(
'.ExCategory-results > .ExResult-row:nth-child(2) > .ExResult-cell > .ExHeading > a',
(e) => e.innerText
);
const muscleTargeted = await page.$$eval(
' .ExCategory-results > .ExResult-row:nth-child(2) > .ExResult-cell > .ExResult-muscleTargeted > a',
(e) => e.innerText
);
const equipmentType = await page.$$eval(
'.ExCategory-results > .ExResult-row:nth-child(2) > .ExResult-cell > .ExResult-equipmentType > a',
(e) => e.innerText
);
//click on load more
await page.waitForSelector(
'#js-ex-content > #js-ex-category-body > .ExCategory-results > .ExLoadMore > .bb-flat-btn'
);
console.log({ ExerciseName, muscleTargeted, equipmentType });
await browser.close();
})().catch((e) => {
console.error(e);
});
我试图抓取的代码
<div class="ExCategory-results">
<div class="ExCategory-resultsLoadIndicator" id="js-ex-finder-load-indicator">
<div class="ExCategory-resultsLoadIndicatorBox">
<div class="ExCategory-resultsLoadIndicatorSpinner bb-spinner-btn__spinner"></div>
</div>
</div>
<div class="ExResult-row flexo-container flexo-between" itemscope="" itemtype="http://schema.org/ExerciseAction">
<div class="ExResult-cell ">
<!-- using male photos -->
<img class="ExImg ExResult-img ls-is-cached lazyloaded" width="70" height="70" onerror="if (window._E_) _E_(this)" alt=" thumbnail image" src="https://www.websites.com/exercises/exerciseImages/sequences/742/Male/m/742_1.jpg" data-src="https://www.websites.com/exercises/exerciseImages/sequences/742/Male/m/742_1.jpg" itemprop="image">
</div>
<div class="ExResult-cell ExResult-cell--nameEtc">
<h3 class="ExHeading ExResult-resultsHeading">
<a href="/exercises/rickshaw-carry" itemprop="name">
Rickshaw Carry
</a>
</h3>
<div class="ExResult-details ExResult-muscleTargeted">
Muscle Targeted:
<a href="/exercises/muscle/forearms">
Forearms
</a>
</div>
<div class="ExResult-details ExResult-equipmentType">
Equipment Type:
<a href="/exercises/equipment/other">
Other
</a>
</div>
</div>
<div class="ExResult-cell ExResult-cell--rating">
<div class="ExRating">
<div class="ExRating-badge">
9.6
</div>
<div class="ExRating-description ExRating-description--Average">
Average
</div>
</div>
</div>
</div>
<div class="ExResult-row flexo-container flexo-between" itemscope="" itemtype="http://schema.org/ExerciseAction">
<div class="ExResult-cell ">
<!-- using male photos -->
<img class="ExImg ExResult-img ls-is-cached lazyloaded" width="70" height="70" onerror="if (window._E_) _E_(this)" alt=" thumbnail image" src="https://www.websites.com/images/2020/xdb/cropped/xdb-50m-single-leg-leg-press-m1-square-600x600.jpg" data-src="https://www.websites.com/images/2020/xdb/cropped/xdb-50m-single-leg-leg-press-m1-square-600x600.jpg" itemprop="image">
</div>
<div class="ExResult-cell ExResult-cell--nameEtc">
<h3 class="ExHeading ExResult-resultsHeading">
<a href="/exercises/single-leg-press" itemprop="name">
Single-Leg Press
</a>
</h3>
<div class="ExResult-details ExResult-muscleTargeted">
Muscle Targeted:
<a href="/exercises/muscle/quadriceps">
Quadriceps
</a>
</div>
<div class="ExResult-details ExResult-equipmentType">
Equipment Type:
<a href="/exercises/equipment/machine">
Machine
</a>
</div>
</div>
<div class="ExResult-cell ExResult-cell--rating">
<div class="ExRating">
<div class="ExRating-badge">
9.6
</div>
<div class="ExRating-description ExRating-description--Average">
Average
</div>
</div>
</div>
</div>
解决方案
该page.$$eval
方法Array.from(document.querySelectorAll(selector))
在后台运行,所以你得到的是一个数组。如果不迭代它或通过适当的索引(例如:)获取正确的元素,则不能(e) => e.innerText
直接应用于数组(即使它的长度为),否则您将得到.1
e[0].innerText
undefined
您可以使用 anArray.map
来遍历匹配的元素并将innerText
每个元素收集到一个数组中。
const exerciseName = await page.$$eval(
'.ExCategory-results > .ExResult-row:nth-child(2) > .ExResult-cell > .ExHeading > a',
elements => elements.map(el => el.innerText)
)
输出:
[ 'Rickshaw Carry' ]
编辑:
for
您可以通过(1)计算具有相同类名的元素,使用带索引的循环(最容易使用常规循环)来迭代行类:
const rowsCounts = await page.$$eval('.ExCategory-results > .ExResult-row', rows => rows.length)
然后 (2) 遍历 children .ExResult-row:nth-child(n) ...
,并将innerText
s 收集到一个数组 ( exerciseNames
) 中:
const exerciseNames = []
for (let i = 1; i < rowsCounts + 1; i++) { // you mignt need i = 2
const exerciseName = await page.$eval(
`.ExCategory-results > .ExResult-row:nth-child(${i}) > .ExResult-cell > .ExHeading > a`,
el => el.innerText)
exerciseNames.push(exerciseName)
}
输出:
[
'Rickshaw Carry',
'Single-Leg Press',
'Landmine twist',
'Weighted pull-up',
'T-Bar Row with Handle',
'Palms-down wrist curl over bench'
]
注意:循环应该从表单开始1
,而不是0
在这种情况下,因为没有“nth-child(0)”。在您的示例中,第一个也丢失了,因此您可能需要在2
.
推荐阅读
- c# - 捕获一帧视频文件(.Net核心)
- c# - C# Datatable 排序列没有相同的值
- github - Yocto - 使用 https 获取私有仓库
- apache-spark - pySpark (v2.4) DataFrameReader 为列名添加前导空格
- c# - Viewmodel 转换为 Json 失败
- ios - 具有交错开始时间的 CAKeyframeAnimation
- python - slackclient OSError: [Errno 24] 打开的文件太多
- laravel - 如何在 Laravel 的 Forge 中刷新 .ENV 文件?
- python - python计算客户端和服务器应用程序之间的带宽、错误包率和传输率
- .net - 找不到包“Microsoft.AspNet.WebApi.Client”.Net Core 2.1 的编译库位置