首页 > 解决方案 > 将 reddit 抓取的脚本执行时间延长到 5 分钟以上

问题描述

我正在尝试使用此处找到的代码收集提交给特定 subreddit 的所有帖子:https ://www.labnol.org/internet/web-scraping-reddit/28369/ 但是在此之前就达到了执行限制完成。

我正在寻找一种方法来延长脚本的运行时间,理想情况下,一旦我单击运行,它就根本不需要我的干预。

const getThumbnailLink_ = url => {
  if (!/^http/.test(url)) return '';
  return `=IMAGE("${url}")`;
};

const getHyperlink_ = (url, text) => {
  if (!/^http/.test(url)) return '';
  return `=HYPERLINK("${url}", "${text}")`;
};

const writeDataToSheets_ = data => {
  const values = data.map(r => [
    new Date(r.created_utc * 1000),
    r.title,
    getThumbnailLink_(r.thumbnail),
    getHyperlink_(r.url, 'Link'),
    getHyperlink_(r.full_link, 'Comments')
  ]);
  const sheet = SpreadsheetApp.getActiveSheet();
  sheet.getRange(sheet.getLastRow() + 1, 1, values.length, values[0].length).setValues(values);
  SpreadsheetApp.flush();
};

const isRateLimited_ = () => {
  const response = UrlFetchApp.fetch('https://api.pushshift.io/meta');
  const { server_ratelimit_per_minute: limit } = JSON.parse(response);
  return limit < 1;
};

const getAPIEndpoint_ = (subreddit, before = '') => {
  const fields = ['title', 'created_utc', 'url', 'thumbnail', 'full_link'];
  const size = 10000;
  const base = 'https://api.pushshift.io/reddit/search/submission';
  const params = { subreddit, size, fields: fields.join(',') };
  if (before) params.before = before;
  const query = Object.keys(params)
    .map(key => `${key}=${params[key]}`)
    .join('&');
  return `${base}?${query}`;
};

const scrapeReddit = (subreddit = 'AskMen') => {
  let before = '';
  do {
    const apiUrl = getAPIEndpoint_(subreddit, before);
    const response = UrlFetchApp.fetch(apiUrl);
    const { data } = JSON.parse(response);
    const { length } = data;
    before = length > 0 ? String(data[length - 1].created_utc) : '';
    if (length > 0) {
      writeDataToSheets_(data);
    }
  } while (before !== '' && !isRateLimited_());
};

标签: google-apps-script

解决方案


通常,最好的做法是优化您的脚本以不达到配额定义的执行时间。因此,在您的情况下,一种解决方案是减少每次执行的批量大小。在您链接的参考文献中,代码每批获取 1000 个帖子,您的代码获取 10000 个。

尝试使用较小的值来查看脚本执行时间是否不再超过配额。

const getAPIEndpoint_ = (subreddit, before = '') => {
  const fields = ['title', 'created_utc', 'url', 'thumbnail', 'full_link'];
  const size = 1000;
  const base = 'https://api.pushshift.io/reddit/search/submission';
  const params = { subreddit, size, fields: fields.join(',') };
  if (before) params.before = before;
  const query = Object.keys(params)
    .map(key => `${key}=${params[key]}`)
    .join('&');
  return `${base}?${query}`;
};

但是,如果您的业务需要超出配额,您可以升级到其中之一Google Workspace Basic, Business or Enterprise- 取决于您需要增加多少配额以及您愿意支付多少。

有关不同帐户和定价的更多信息,请参见此处


推荐阅读