首页 > 解决方案 > Scraper (puppeteer) 没有映射到我的数组 - JavaScript / React

问题描述

我用 puppeteer 写了一个网络爬虫。它从工作门户筛选工作。我可以筛选标题、位置和图像。

我的刮刀创建的数组如下所示:

[{
    "id": "2018-12-03T14:12:03Z",
    "position": "Frontend Entwickler React (w/m)",
    "company": "Muster AG",
    "image": "https://www.stepstone.de/upload_de/logo/blabla.gif",
    "date": "2018-12-03T14:12:03Z",
    "href": "https://www.stepstone.de/stellenangebote--Frontend-Entwickler"
  }] 

这是我的 scraper.js 的代码:

const fs = require('fs')
const path = require('path')
const puppeteer = require('puppeteer')

;(async () => {
  const browser = await puppeteer.launch()
  const page = await browser.newPage()
  await page.goto(
    'https://www.stepstone.de/5/ergebnisliste.html?stf=freeText&ns=1&qs=%5B%7B%22id%22%3A%22231794%22%2C%22description%22%3A%22Frontend-Entwickler%2Fin%22%2C%22type%22%3A%22jd%22%7D%2C%7B%22id%22%3A%22300000115%22%2C%22description%22%3A%22Deutschland%22%2C%22type%22%3A%22geocity%22%7D%5D&companyID=0&cityID=300000115&sourceOfTheSearchField=homepagemex%3Ageneral&searchOrigin=Homepage_top-search&ke=Frontend-Entwickler%2Fin&ws=Deutschland&ra=30'
  )

  const stepstone = await page.evaluate(() => {
    return Array.from(document.querySelectorAll('.job-element'), card => {
      const id = card.querySelector('time').getAttribute('datetime')
      const href = card
        .querySelector('.job-element__body > a')
        .getAttribute('href')
      const position = card
        .querySelector('.job-element__body__title')
        .textContent.trim()
        .replace(/^(.{45}[^\s]*).*/, '$1')
      const company = card
        .querySelector('.job-element__body__company')
        .textContent.trim()
        .replace(/^(.{20}[^\s]*).*/, '$1')
      const image_element = card.querySelector('.job-element__logo img')
      const image = image_element.dataset.src
        ? `https://www.stepstone.de${image_element.dataset.src}`
        : image_element.src
      const date = card.querySelector('time').getAttribute('datetime')

      return {
        id,
        position,
        company,
        image,
        date,
        href
      }
    })
  })

  fs.writeFile(
    path.join(__dirname, 'src/stepstone.json'),
    JSON.stringify(stepstone),
    err => {
      if (err) {
        console.error(err)
      } else {
        console.log('Great, it worked!')
      }
    }
  )

  await browser.close()
})()

我的方法:在刮掉头衔、职位等之后,我还想包括工作细节。所以我告诉我的爬虫转到存储此信息的数组中每个作业项的 href 链接。

并从该链接中获取工作详细信息类,就像上面一样。所以我尝试映射上面的数组并告诉刮板从每个 href 链接中抓取项目,如下所示:

stepstone.map(async stone => {
        const page = await browser.newPage()
        await page.goto(stone.href)
        const details = await page.evaluate(() => {
          return document.querySelector('card__body')
        })
        return {
          ...stone,
          details
        }
      })

我的问题: 但是,JSON 文件不会使用“详细信息”键(应保存来自 的信息'card__body')进行更新。

有什么建议么?谢谢!

标签: javascriptnode.jsweb-scrapingpuppeteer

解决方案


推荐阅读