首页 > 解决方案 > Using Google Sheets as a Web Scraper for a Site with Java

问题描述

I have been trying to use IMPORTXML() to scrape information from the following website.

https://libbyapp.com/library/lapl/search/audiobooks/query-john%20scalzi/language-en/page-1

After no success with the command and some digging online, it seems that this command will not work because the website uses java to render the information.

Is there any other command or method within google sheets to scrape some information from this web page? Specifically I'm trying to get the names and authors of the first 3 books listed.

标签: web-scrapinggoogle-sheetsgoogle-sheets-formula

解决方案


不,不是Java。JavaScript。非常不同的东西。

幸运的是,该站点有一个 JSON API!您可以在浏览器的网络监视器中亲自查看。这意味着没有纯公式,但我们可以编写一个简单的用户脚本(工具 > 脚本编辑器)。

我们制作这个用户定义的函数(粘贴到编辑器中):

function getAuthors(url, showHeaders) 
{
  let jsondata = UrlFetchApp.fetch(url);
  let object   = JSON.parse(jsondata.getContentText());
  
  let bookData = showHeaders ? 
    [ ['Title', 'Author(s)'] ] : 
    [];

  bookData = bookData.concat(
      object.items.map(item => (
      // Title is always first, followed by any authors:
      [item.title].concat(
        item.creators
          // Add more creator roles here:
          .filter(creator => creator.role === 'Author')
          // Take creator's name
          .map(author => author.name)
          // Comma-separated
          .join(',')
      ))
    )
  );
  
  return bookData;
}

使用网络监视器,我确定感兴趣的 URL 是

https://thunder.api.overdrive.com/v2/libraries/lapl/media?overdriveFormats=true&mediaType=audiobook&query=john%20scalzi&language=en&page=1&perPage=24&x-client-id=dewey

然后在您的工作表中,您可以使用以下公式:

=getAuthors(A1)

或者

=getAuthors(A1, 1)

如果你想要标题。

如果你想要前三个,那么只需使用ARRAY_CONSTRAIN截断表。


推荐阅读