r - 使用 rvest 和 purrr 从多个链接中提取数据
问题描述
我有一个 DF 中的链接列表,我想在其上运行一个函数以从每个链接中提取数据。
图书馆和数据:
library(rvest)
library(tidyverse)
link_df <- tribble(~title, ~episode, ~link,
"a", "1", "https://www.backlisted.fm/episodes/1-j-l-carr-a-month-in-the-country",
"b", "2", "https://www.backlisted.fm/episodes/2-jean-rhys-good-morning-midnight",
"c", "3", "https://www.backlisted.fm/episodes/3-david-nobbs-1")
recs_extract <- function(df){
pages <- df %>% map(read_html, url = link)
data <- pages %>%
map_dfr(. %>%
html_nodes(css = "ul li") %>%
html_text() %>%
tibble(title = .) %>%
slice(12:n()-2) %>%
separate(col = title,
into = c("author", "titles"),
sep = "-" ) %>%
separate(titles,
into = c(paste("books", 1:15)),
sep = ",",
extra = "drop") %>%
mutate(across(where(is.character), str_trim)) %>%
janitor::remove_empty(which = "cols") %>%
pivot_longer(cols = contains("books"),
names_to = NULL,
values_to = "Title",
values_drop_na = TRUE)
)
}
此功能适用于一个链接:
link_df$link[1] %>% map(recs_extract)
[[1]]
# A tibble: 15 x 2
author Title
<chr> <chr>
1 J L Carr A Month in the Country
2 J L Carr Harpole and Foxberrow General Publishers
3 J L Carr The Battle of Pollocks Crossing
4 Vasily Grossman Life and Fate
5 Mr Bingo Hate Mail
6 William S Burroughs Naked Lunch
7 Nancy Mitford Love in a Cold Climate
8 J Arthur Gibbs A Cotswold Village
9 Giuseppe Tomasi di Lampedusa The Leopard
10 W N P Barbellion Journal of a Disappointed Man
11 Lissa Evans Their Finest Hour and a Half
12 Lissa Evans Crooked Heart
13 Byron Rogers The Last Englishman
14 Andy Miller Tilting at Windmills
15 William Golding Darkness Visible
我先放置在嵌套的df中吗?如何跑遍每一个环节和店铺?
#doesn't work
link_df %>%
group_by(title) %>%
nest() %>%
mutate(data = map(data, recs_extract, link))
谢谢,很抱歉这么长的帖子。
解决方案
你可以map
像这样使用:
library(dplyr)
library(purrr)
link_df %>% mutate(data = map(link, recs_extract))
# A tibble: 3 x 4
# title episode link data
# <chr> <chr> <chr> <list>
#1 a 1 https://www.backlisted.fm/episodes/1-j-l-carr-a-month-in-the-country <tibble [15 × 2]>
#2 b 2 https://www.backlisted.fm/episodes/2-jean-rhys-good-morning-midnight <tibble [17 × 2]>
#3 c 3 https://www.backlisted.fm/episodes/3-david-nobbs-1 <tibble [18 × 2]>
推荐阅读
- bash - 使用 find 排除所有子目录的文件
- r - 在R中提取向量中的最后一个序列数
- c# - 如何使用 C# 在 google sheet 中查找或搜索内容
- kivy - 哪个是知道哪个按钮在 kivy 中引起回调的正确方法?
- runtime - 多项式时间内的实重量背包
- java - 如何修复 ScrollView 内的 ConstraintLayout 不滑动?
- python-3.x - 如何更新权重和偏差
- arrays - 合并数组时继承自定义方法
- ios - 线程 1:在 swift 中发送信号 SIGABRT
- python - 除了 OSX 上现有的 pyenv 安装之外,如何安装 Anaconda?