首页 > 解决方案 > 使用 rvest 和 purrr 从多个链接中提取数据

问题描述

我有一个 DF 中的链接列表,我想在其上运行一个函数以从每个链接中提取数据。

图书馆和数据:

library(rvest)
library(tidyverse)

link_df <- tribble(~title, ~episode, ~link,
        "a", "1", "https://www.backlisted.fm/episodes/1-j-l-carr-a-month-in-the-country",
        "b", "2", "https://www.backlisted.fm/episodes/2-jean-rhys-good-morning-midnight",
        "c", "3", "https://www.backlisted.fm/episodes/3-david-nobbs-1")

我试图从这个这个答案中获取碎片,但在某处缺少一步:

recs_extract <- function(df){
  
  pages <- df %>% map(read_html, url = link)
  
data <- pages %>% 
  map_dfr(. %>% 
          html_nodes(css = "ul li") %>% 
            html_text() %>% 
            tibble(title = .) %>% 
            slice(12:n()-2) %>% 
            separate(col = title,
                     into = c("author", "titles"),
                     sep = "-" ) %>% 
            separate(titles, 
                     into = c(paste("books", 1:15)),
                     sep = ",", 
                     extra = "drop") %>% 
            mutate(across(where(is.character), str_trim)) %>% 
            janitor::remove_empty(which = "cols") %>% 
            pivot_longer(cols = contains("books"),
                         names_to = NULL, 
                         values_to = "Title", 
                         values_drop_na = TRUE)
          )
  
}

此功能适用于一个链接:

link_df$link[1] %>% map(recs_extract)

[[1]]
# A tibble: 15 x 2
   author                       Title                                   
   <chr>                        <chr>                                   
 1 J L Carr                     A Month in the Country                  
 2 J L Carr                     Harpole and Foxberrow General Publishers
 3 J L Carr                     The Battle of Pollocks Crossing         
 4 Vasily Grossman              Life and Fate                           
 5 Mr Bingo                     Hate Mail                               
 6 William S Burroughs          Naked Lunch                             
 7 Nancy Mitford                Love in a Cold Climate                  
 8 J Arthur Gibbs               A Cotswold Village                      
 9 Giuseppe Tomasi di Lampedusa The Leopard                             
10 W N P Barbellion             Journal of a Disappointed Man           
11 Lissa Evans                  Their Finest Hour and a Half            
12 Lissa Evans                  Crooked Heart                           
13 Byron Rogers                 The Last Englishman                     
14 Andy Miller                  Tilting at Windmills                    
15 William Golding              Darkness Visible                        

我先放置在嵌套的df中吗?如何跑遍每一个环节和店铺?

#doesn't work
link_df %>% 
  group_by(title) %>%
  nest() %>% 
  mutate(data = map(data, recs_extract, link))

谢谢,很抱歉这么长的帖子。

标签: rdplyrtidyrrvest

解决方案


你可以map像这样使用:

library(dplyr)
library(purrr)

link_df %>% mutate(data = map(link, recs_extract))


# A tibble: 3 x 4
#  title episode link                                                                 data             
#  <chr> <chr>   <chr>                                                                <list>           
#1 a     1       https://www.backlisted.fm/episodes/1-j-l-carr-a-month-in-the-country <tibble [15 × 2]>
#2 b     2       https://www.backlisted.fm/episodes/2-jean-rhys-good-morning-midnight <tibble [17 × 2]>
#3 c     3       https://www.backlisted.fm/episodes/3-david-nobbs-1                   <tibble [18 × 2]>

推荐阅读