首页 > 解决方案 > 使用 Puppeteer,您将如何从网站上抓取标题和图像,并将它们放在同一个对象中,以便图像与标题相关?

问题描述

我可以使用此代码在单独的变量中获取图像 src 和标题,

  let theOfficeUrl =
    "https://www.cardboardconnection.com/funko-pop-the-office-vinyl-figures";

  let browser = await puppeteer.launch({
    headless: true,
    defaultViewport: null,
  });
  let page = await browser.newPage();

  await page.goto(theOfficeUrl), { waitUntil: "networkidle2" };

  let data = await page.evaluate(() => {
    var image = Array.from(
      document.querySelectorAll("div.post_anchor_divs.gallery img")
    ).map((image) => image.src);

    // gives us an array off all h3 titles on page
    var title = Array.from(document.querySelectorAll("h3")).map(
      (title) => title.innerText
    );
    let forDeletion = ["", "Leave a Comment:"];
    title = title.filter((item) => !forDeletion.includes(item));

    return {
      image,
      title,
    };
  });
  console.log("Running Scraper...");
  console.log({ data });
  console.log("======================");
})();

产生这样的结果

data: {
   image: [Array of image srcs],
   title: [Array of title text]
 }
}

但我需要它们成为具有相应标题和图像 src 的对象数组,如下所示

{
data: [
   {
   item: {
      title: "title from website",
      image: "image src from website"
   }
item: {
      title: "title from website",
      image: "image src from website"
   }
item: {
      title: "title from website",
      image: "image src from website"
   }
....so on
 ]
}


我遇到的问题是该网站没有在单独的 div 中包含每个图像和标题,它们都在一个容器 div 中,带有 h3 标签的标题没有类名,img 在 p 标签中,有时也是 h3 标签。我正在尝试抓取的网站

https://www.cardboardconnection.com/funko-pop-yu-gi-oh-vinyl-figures

试图刮掉 Funko Pop Yu-Gi-Oh!Figures Gallery 部分,其中包含 funko pop 的名称及其下方的图像。

对此有任何指示吗?

标签: javascriptweb-scrapingpuppeteer

解决方案


在数据对象中获得各个数组后,您可以像这样创建所需的数组:

data = {
    image: ["image1 src", "image2 src", "image3 src", "image4 src"],
    title: ["title1", "title2", "title3", "title4"]
}

data_new = [];
for (i=0;i<data.image.length;i++) {
  data_new.push({'image':data.image[i], 'title': data.title[i]})
}

这应该给你:

data_new = [
    {
        "image": "image1 src",
        "title": "title1"
    },
    {
        "image": "image2 src",
        "title": "title2"
    },
    {
        "image": "image3 src",
        "title": "title3"
    },
    {
        "image": "image4 src",
        "title": "title4"
    }
]

推荐阅读