javascript - 如何从具有多个文档 html 的网页中获取元素的选择器?
问题描述
我尝试使用 puppeteer 从网页获取信息,但我没有找到我需要的选择器,我想那是因为该页面包含多个文档 html,我无法找到获取我需要的数据。
那是代码:
const puppeteer = require('puppeteer');
(async ()=>{
const browser = await puppeteer.launch({headless:false});
const page = await browser.newPage();
await page.goto('https://www.arrivia.com/careers/job-openings/');
await page.waitForSelector('.job-search-result');
const data = await page.evaluate(()=>{
const elements = document.querySelectorAll('.job-search-result .job-btn-container a');
vacancies = [];
for(element of elements){
vacancies.push(element.href);
}
return vacancies;
});
console.log(data.length);
const vacancies = [];
for (let i = 0; i <=2; i++){
var urljob = data[i];
await page.goto(data[i]);
await page.waitForSelector(".app-title"); //that´s one of the selectors that I can´t to find
from here I get an error`enter code here`
const jobs = await page.evaluate((urljob)=> {
const job = {};
job.title = document.querySelector(".app-title").innerText;
job.location = document.querySelector(".location").innerText;
job.url = urljob;
return job;close
});
vacancies.push(jobs);
}
console.log(vacancies);
//await page.screenshot({ path: 'xx1.jpg'});
await browser.close()
})();
解决方案
在 Puppeteer 中,iframe并不总是最容易处理的事情。但绕过此问题的一种方法可能是直接访问 iframe 的 URL,而不是访问托管 iframe 的页面。它也更快:
const puppeteer = require("puppeteer");
(async () => {
const browser = await puppeteer.launch({ headless: false, defaultViewport: null });
const page = await browser.newPage();
await page.goto("https://www.arrivia.com/careers/job-openings/", {
waitUntil: "domcontentloaded",
});
const jobUrls = await page.$$eval(".job-search-result .job-btn-container a",
els => els.map(el => el.href));
const vacancies = [];
for (let i = 0; i < 10; i++) { // don't forget to replace 10 with jobUrls.length later
const url = jobUrls[i];
const jobId = /job_id=(\d+)/.exec(url)[1]; // Extract the ID from the link
await page.goto(
`https://boards.greenhouse.io/embed/job_app?token=${jobId}`, // Go to iframe URL
{ waitUntil: "domcontentloaded" }
);
vacancies.push({
title: await page.$eval(".app-title", el => el.innerText),
location: await page.$eval(".location", el => el.innerText),
url,
});
}
console.log(vacancies);
await browser.close();
})();
输出:
[
{
title: 'Director of Account Management',
location: 'Scottsdale, AZ',
url: 'https://www.arrivia.com/careers/job/?job_id=2529695'
},
{
title: "Site Admin and Director's Assistant",
location: 'Albufeira, Portugal',
url: 'https://www.arrivia.com/careers/job/?job_id=2540303'
},
...
]
推荐阅读
- bash - 如何在bash shell中的case语句中使用数组?
- java - 如何从第二个活动编辑 MainActivity 中的 ArrayList
- javascript - 如何将url中用户可见的文本更改为asp.net中的自定义文本
- php - 在 SLIM 框架中创建一个全局变量以在另一个路由中使用
- java - Appium 驱动程序不建议在代码库中使用 removeapp() 函数
- php - PHP 中的日期时间总是返回 1970
- javascript - 在 Pug/Jade 模板引擎中显示图像
- python - 使用 for 循环和列表进行值提取
- python - Why is my python Mario program less comfortable ignoring the spaces?
- php - 在碳 php 中获取新月份的第一周