首页 > 解决方案 > 带有 Cheerio 的选择器无法检索子项

问题描述

我相信这是一个错误。我正在尝试使用 request 和cheerio 编写一个简单的网络爬虫。

我如何尝试解决它:

  1. 是的,我尝试了其他方式来定义选择器。
  2. 是的,我调查了其他 stackoverflow 问题。
  3. 是的,我在cheerio github 上创建了一个问题,链接如下:https ://github.com/cheeriojs/cheerio/issues/1252
  4. 是的,我是一名专业的 Web 开发人员,这不是我第一次做 node.js

更新:有人指出后,问题是在我的页面被cheerio解析和遍历之后创建了所需的dom节点。所以我请求的页面部分根本不存在。任何想法如何绕过它?

我使用版本:

{
  "name": "discont",
  "version": "1.0.0",
  "description": "Find when the item is on sale",
  "main": "index.js",
  "license": "MIT",
  "devDependencies": {
    "express": "^4.16.4"
  },
  "dependencies": {
    "cheerio": "^1.0.0-rc.2",
    "ejs": "^2.6.1",
    "request": "^2.88.0"
  }
}

这是我要抓取的 HTML:

在此处输入图像描述

链接在这里: https ://www.asos.com/new-look-wide-fit/new-look-wide-fit-court-shoe/prd/10675413?clr=oatmeal&SearchQuery=&cid=6461&gridcolumn=1&gridrow=9&gridsize =4&pge=1&pgesize=72&totalstyles=826

这是我的代码:

request(url, options, function(error, response, html) {
    if (!error) {
      var $ = cheerio.load(html, { withDomLvl1: false });
      // console.log("product-price", $("div.product-price")[0].attribs);
      console.log("product-price", $("div#product-price > div"));
    }
  });

console.log 返回一个空数组(找不到嵌套的 div)。

这是我得到的回报:

initialize {
  options: 
   { withDomLvl1: false,
     normalizeWhitespace: false,
     xml: false,
     decodeEntities: true },
  _root: 
   initialize {
     '0': 
      { type: 'root',
        name: 'root',
        namespace: 'http://www.w3.org/1999/xhtml',
        attribs: {},
        'x-attribsNamespace': {},
        'x-attribsPrefix': {},
        children: [Array],
        parent: null,
        prev: null,
        next: null },
     options: 
      { withDomLvl1: false,
        normalizeWhitespace: false,
        xml: false,
        decodeEntities: true },
     length: 1,
     _root: [Circular] },
  length: 0,
  prevObject: 
   initialize {
     '0': 
      { type: 'root',
        name: 'root',
        namespace: 'http://www.w3.org/1999/xhtml',
        attribs: {},
        'x-attribsNamespace': {},
        'x-attribsPrefix': {},
        children: [Array],
        parent: null,
        prev: null,
        next: null },
     options: 
      { withDomLvl1: false,
        normalizeWhitespace: false,
        xml: false,
        decodeEntities: true },
     length: 1,
     _root: [Circular] } }

但是如果我将代码更改为

request(url, options, function(error, response, html) {
    if (!error) {
      var $ = cheerio.load(html, { withDomLvl1: false });
      // console.log("product-price", $("div.product-price")[0].attribs);
      console.log("product-price", $("div#product-price"));
    }
  });

我得到一个包含单个元素的数组:

initialize {
  '0': 
   { type: 'tag',
     name: 'div',
     namespace: 'http://www.w3.org/1999/xhtml',
     attribs: 
      { class: 'product-price',
        id: 'product-price',
        'data-bind': 'component: { name: "product-price", params: {state: state, showGermanVatMessage: false }}' },
     'x-attribsNamespace': { class: undefined, id: undefined, 'data-bind': undefined },
     'x-attribsPrefix': { class: undefined, id: undefined, 'data-bind': undefined },
     children: [],
     parent: 
      { type: 'tag',
        name: 'div',
        namespace: 'http://www.w3.org/1999/xhtml',
        attribs: [Object],
        'x-attribsNamespace': [Object],
        'x-attribsPrefix': [Object],
        children: [Array],
        parent: [Object],
        prev: [Object],
        next: [Object] },
     prev: 
      { type: 'text',
        data: '\n    ',
        parent: [Object],
        prev: [Object],
        next: [Circular] },
     next: 
      { type: 'text',
        data: '\n    ',
        parent: [Object],
        prev: [Circular],
        next: [Object] } },
  options: 
   { withDomLvl1: false,
     normalizeWhitespace: false,
     xml: false,
     decodeEntities: true },
  _root: 
   initialize {
     '0': 
      { type: 'root',
        name: 'root',
        namespace: 'http://www.w3.org/1999/xhtml',
        attribs: {},
        'x-attribsNamespace': {},
        'x-attribsPrefix': {},
        children: [Array],
        parent: null,
        prev: null,
        next: null },
     options: 
      { withDomLvl1: false,
        normalizeWhitespace: false,
        xml: false,
        decodeEntities: true },
     length: 1,
     _root: [Circular] },
  length: 1,
  prevObject: 
   initialize {
     '0': 
      { type: 'root',
        name: 'root',
        namespace: 'http://www.w3.org/1999/xhtml',
        attribs: {},
        'x-attribsNamespace': {},
        'x-attribsPrefix': {},
        children: [Array],
        parent: null,
        prev: null,
        next: null },
     options: 
      { withDomLvl1: false,
        normalizeWhitespace: false,
        xml: false,
        decodeEntities: true },
     length: 1,
     _root: [Circular] } }

但是,我看不到元素的子元素(子数组为空),并且我无法对对象执行任何方法,例如find()text()

欢迎任何帮助!

标签: node.jsweb-scrapingcheerio

解决方案


Cheerio 只能在诸如 XHR 之类的特殊事情发生之前访问 DOM。您将需要 puppeteer 或 nightmarejs 用于后 js 渲染的 DOM


推荐阅读