r - 为什么我不能阅读使用 rvest 进行网页抓取的可点击链接?
问题描述
我正在尝试抓取这个网站。
单击每个标题后,我需要的内容就可用。例如,如果我这样做(我正在使用 SelectorGadget),我可以获得我想要的内容:
library("rvest")
url_boe ="https://www.bankofengland.co.uk/speech/2021/june/andrew-bailey-reuters-events-global-responsible-business-2021"
sample_text = html_text(html_nodes(read_html(url_boe), "#output .page-section"))
但是,我需要获取网站中每个可点击链接的每个文本。所以我通常这样做:
url_boe = "https://www.bankofengland.co.uk/news/speeches"
html_attr(html_nodes(read_html(url_boe), "#SearchResults .exclude-navigation"), name = "href")
我得到一个空对象。我尝试了代码的不同变体,但结果相同。
如何阅读这些链接,然后将第一部分中的代码应用于所有链接?
谁能帮我?
谢谢!
解决方案
正如@KonradRudolph 之前所指出的,链接是动态插入到网页中的。因此,我使用RSelenium
and生成了一个代码rvest
来解决这个问题:
library(rvest)
library(RSelenium)
# URL
url = "https://www.bankofengland.co.uk/news/speeches"
# Base URL
base_url = "https://www.bankofengland.co.uk"
# Instantiate a Selenium server
rD <- rsDriver(browser=c("chrome"), chromever="91.0.4472.19")
# Assign the client to an object
rem_dr <- rD[["client"]]
# Navigate to the URL
rem_dr$navigate(url)
# Get page HTML
page <- read_html(rem_dr$getPageSource()[[1]])
# Extract links and concatenate them with the base_url
links <- page %>%
html_nodes(".release-speech") %>%
html_attr('href') %>%
paste0(base_url, .)
# Get links names
links_names <- page %>%
html_nodes('#SearchResults .exclude-navigation') %>%
html_text()
# Keep only even results to deduplicate
links_names <- links_names[c(FALSE, TRUE)]
# Create a data.frame with the results
df <- data.frame(links_names, links)
# Close the client and the server
rem_dr$close()
rD$server$stop()
生成的 data.frame 如下所示:
> head(df)
links_names
1 Stablecoins: What’s old is new again - speech by Christina Segal-Knowles
2 Tackling climate for real: progress and next steps - speech by Andrew Bailey
3 Tackling climate for real: the role of central banks - speech by Andrew Bailey
4 What are government bond yields telling us about the economic outlook? - speech by Gertjan Vlieghe
5 Responsible openness in the Insurance Sector - speech by Anna Sweeney
6 Cyber Risk: 2015 to 2027 and the Penrose steps - speech by Lyndon Nelson
links
1 https://www.bankofengland.co.uk/speech/2021/june/christina-segal-knowles-speech-at-the-westminster-eforum-poicy-conference
2 https://www.bankofengland.co.uk/speech/2021/june/andrew-bailey-bis-bank-of-france-imf-ngfs-green-swan-conference
3 https://www.bankofengland.co.uk/speech/2021/june/andrew-bailey-reuters-events-global-responsible-business-2021
4 https://www.bankofengland.co.uk/speech/2021/may/gertjan-vlieghe-speech-hosted-by-the-department-of-economics-and-the-ipr
5 https://www.bankofengland.co.uk/speech/2021/may/anna-sweeney-association-of-british-insurers-prudential-regulation
6 https://www.bankofengland.co.uk/speech/2021/may/lyndon-nelson-the-8th-operational-resilience-and-cyber-security-summit
推荐阅读
- python - 在python tkinter中用新标签替换标签
- azure-active-directory - Microsoft Teams,使用 MS Graph API 获取所有聊天消息
- regex - 正则表达式省略字符但仍替换(负字符类)
- postgresql - 使用 Sequelize 时,如何编写按位(或)运算?
- python - 获取特定列中行的值
- python - Matplotlib 开放不断打开新的图表,而不是更新一个当前
- javascript - 当 Web 组件属性异步更新时如何更新 React 道具?
- palantir-foundry - 您可以在 Palantir Foundry 的操作中编辑的内容是否有任何限制?
- android - 如何在活动的 onSaveInstanceState() 之后杀死所有打开的片段
- java - ID 为整数时按 ID 搜索显示错误