javascript - 使用 Puppeteer,您将如何从网站上抓取标题和图像,并将它们放在同一个对象中,以便图像与标题相关?
问题描述
我可以使用此代码在单独的变量中获取图像 src 和标题,
let theOfficeUrl =
"https://www.cardboardconnection.com/funko-pop-the-office-vinyl-figures";
let browser = await puppeteer.launch({
headless: true,
defaultViewport: null,
});
let page = await browser.newPage();
await page.goto(theOfficeUrl), { waitUntil: "networkidle2" };
let data = await page.evaluate(() => {
var image = Array.from(
document.querySelectorAll("div.post_anchor_divs.gallery img")
).map((image) => image.src);
// gives us an array off all h3 titles on page
var title = Array.from(document.querySelectorAll("h3")).map(
(title) => title.innerText
);
let forDeletion = ["", "Leave a Comment:"];
title = title.filter((item) => !forDeletion.includes(item));
return {
image,
title,
};
});
console.log("Running Scraper...");
console.log({ data });
console.log("======================");
})();
产生这样的结果
data: {
image: [Array of image srcs],
title: [Array of title text]
}
}
但我需要它们成为具有相应标题和图像 src 的对象数组,如下所示
{
data: [
{
item: {
title: "title from website",
image: "image src from website"
}
item: {
title: "title from website",
image: "image src from website"
}
item: {
title: "title from website",
image: "image src from website"
}
....so on
]
}
我遇到的问题是该网站没有在单独的 div 中包含每个图像和标题,它们都在一个容器 div 中,带有 h3 标签的标题没有类名,img 在 p 标签中,有时也是 h3 标签。我正在尝试抓取的网站
https://www.cardboardconnection.com/funko-pop-yu-gi-oh-vinyl-figures
试图刮掉 Funko Pop Yu-Gi-Oh!Figures Gallery 部分,其中包含 funko pop 的名称及其下方的图像。
对此有任何指示吗?
解决方案
在数据对象中获得各个数组后,您可以像这样创建所需的数组:
data = {
image: ["image1 src", "image2 src", "image3 src", "image4 src"],
title: ["title1", "title2", "title3", "title4"]
}
data_new = [];
for (i=0;i<data.image.length;i++) {
data_new.push({'image':data.image[i], 'title': data.title[i]})
}
这应该给你:
data_new = [
{
"image": "image1 src",
"title": "title1"
},
{
"image": "image2 src",
"title": "title2"
},
{
"image": "image3 src",
"title": "title3"
},
{
"image": "image4 src",
"title": "title4"
}
]