首页 > 解决方案 > 在`forEach`循环中中断`request`以提高效率

问题描述

我正在构建一个简单的网络爬虫来自动化时事通讯,这意味着我只需要浏览一定数量的页面。在这个例子中,这没什么大不了的,因为脚本只会爬取 3 个额外的页面。但对于不同的情况,这将是非常低效的。

所以我的问题是,有没有办法停止request()在这个forEach循环中执行?

或者我是否需要更改我的方法来逐一抓取页面,如本指南中所述。

脚本

'use strict';
var request = require('request');
var cheerio = require('cheerio');
var BASEURL = 'https://jobsite.procore.com';

scrape(BASEURL, getMeta);

function scrape(url, callback) {
  var pages = [];
  request(url, function(error, response, body) {
    if(!error && response.statusCode == 200) {

      var $ = cheerio.load(body);

      $('.left-sidebar .article-title').each(function(index) {
        var link = $(this).find('a').attr('href');
        pages[index] = BASEURL + link;
      });
      callback(pages, log);
    }
  });
}

function getMeta(pages, callback) {
  var meta = [];
  // using forEach's index does not work, it will loop through the array before the first request can execute
  var i = 0;
  // using a for loop does not work here
  pages.forEach(function(url) {
    request(url, function(error, response, body) {
      if(error) {
        console.log('Error: ' + error);
      }

      var $ = cheerio.load(body);

      var desc = $('meta[name="description"]').attr('content');
      meta[i] = desc.trim();

      i++;

      // Limit
      if (i == 6) callback(meta);
      console.log(i);
    });
  });
}

function log(arr) {
  console.log(arr);
}

输出

$ node crawl.js 
1
2
3
4
5
6
[ 'Find out why fall protection (or lack thereof) lands on the Occupational Safety and Health Administration (OSHA) list of top violations year after year.',
  'noneChances are you won’t be seeing any scented candles on the jobsite anytime soon, but what if it came in a different form? The allure of smell has conjured up some interesting scent technology in recent years. Take for example the Cyrano, a brushed-aluminum cylinder that fits in a cup holder. It’s Bluetooth-enabled and emits up to 12 scents or smelltracks that can be controlled using a smartphone app. Among the smelltracks: “Thai Beach Vacation.”',
  'The premise behind the hazard communication standard is that employees have a right to know the toxic substances and chemical hazards they could encounter while working. They also need to know the protective things they can do to prevent adverse effects of working with those substances. Here are the steps to comply with the standard.',
  'The Weitz Company has been using Procore on its projects for just under two years. Within that time frame, the national general contractor partnered with Procore to implement one of the largest technological advancements in its 163-year history.  Click here to learn more about their story and their journey with Procore.',
  'MGM Resorts International is now targeting Aug. 24 as the new opening date for the $960 million hotel and casino complex it has been building in downtown Springfield, Massachusetts.',
  'So, what trends are taking center stage this year? Below are six of the most prominent. Some of them are new, and some of them are continuations of current trends, but they are all having a substantial impact on construction and the structures people live and work in.' ]
7
8
9

标签: javascriptnode.jsasynchronous

解决方案


除了使用slice来限制您的选择之外,您还可以重构代码以重用某些功能。

对不起,我想了一秒钟后忍不住。

我们可以从重构开始:

const rp = require('request-promise-native');
const {load} = require('cheerio');

function scrape(uri, transform) {
  const options = {
    uri,
    transform: load
  };

  return rp(options).then(transform);
}

scrape(
  'https://jobsite.procore.com',
  ($) => $('.left-sidebar .article-title a').toArray().slice(0,6).map((linkEl) => linkEl.attribs.href)
).then((links) => Promise.all(
  links.map(
    (link) => scrape(
      `https://jobsite.procore.com/${link}`,
      ($) => $('meta[name="description"]').attr('content').trim()
    )
  )
)).then(console.log).catch(console.error);

虽然这确实使代码更加干燥和简洁,但它指出了可能需要改进的部分:链接的请求。

目前,它将几乎同时触发对原始页面上所有(或最多)6 个链接的请求。这可能是您想要的,也可能不是您想要的,这取决于在您提到的其他某个点将请求多少链接。

另一个潜在的问题是错误管理。正如重构所代表的那样,如果任何一个请求失败,那么所有请求都将被丢弃。

如果您喜欢这种方法,只需考虑几点。两者都可以通过多种方式解决。


推荐阅读