首页 > 解决方案 > 为什么我不能阅读使用 rvest 进行网页抓取的可点击链接?

问题描述

我正在尝试抓取这个网站

单击每个标题后,我需要的内容就可用。例如,如果我这样做(我正在使用 SelectorGadget),我可以获得我想要的内容:


library("rvest")

url_boe ="https://www.bankofengland.co.uk/speech/2021/june/andrew-bailey-reuters-events-global-responsible-business-2021"

sample_text = html_text(html_nodes(read_html(url_boe), "#output .page-section"))

但是,我需要获取网站中每个可点击链接的每个文本。所以我通常这样做:


url_boe = "https://www.bankofengland.co.uk/news/speeches"


html_attr(html_nodes(read_html(url_boe), "#SearchResults .exclude-navigation"), name = "href")


我得到一个空对象。我尝试了代码的不同变体,但结果相同。

如何阅读这些链接,然后将第一部分中的代码应用于所有链接?

谁能帮我?

谢谢!

标签: rweb-scrapingrvest

解决方案


正如@KonradRudolph 之前所指出的,链接是动态插入到网页中的。因此,我使用RSeleniumand生成了一个代码rvest来解决这个问题:

library(rvest)
library(RSelenium)

# URL
url = "https://www.bankofengland.co.uk/news/speeches"

# Base URL
base_url = "https://www.bankofengland.co.uk"

# Instantiate a Selenium server
rD <- rsDriver(browser=c("chrome"), chromever="91.0.4472.19")

# Assign the client to an object
rem_dr <- rD[["client"]]

# Navigate to the URL
rem_dr$navigate(url)

# Get page HTML
page <- read_html(rem_dr$getPageSource()[[1]])

# Extract links and concatenate them with the base_url
links <- page %>%
  html_nodes(".release-speech") %>%
  html_attr('href') %>%
  paste0(base_url, .)

# Get links names
links_names <- page %>%
  html_nodes('#SearchResults .exclude-navigation') %>%
  html_text()

# Keep only even results to deduplicate
links_names <- links_names[c(FALSE, TRUE)]

# Create a data.frame with the results
df <- data.frame(links_names, links)

# Close the client and the server
rem_dr$close()
rD$server$stop()

生成的 data.frame 如下所示:

> head(df)
                                                                                         links_names
1                           Stablecoins: What’s old is new again - speech by Christina Segal-Knowles
2                       Tackling climate for real: progress and next steps - speech by Andrew Bailey
3                     Tackling climate for real: the role of central banks - speech by Andrew Bailey
4 What are government bond yields telling us about the economic outlook? - speech by Gertjan Vlieghe
5                              Responsible openness in the Insurance Sector - speech by Anna Sweeney
6                           Cyber Risk: 2015 to 2027 and the Penrose steps - speech by Lyndon Nelson
                                                                                                                       links
1 https://www.bankofengland.co.uk/speech/2021/june/christina-segal-knowles-speech-at-the-westminster-eforum-poicy-conference
2           https://www.bankofengland.co.uk/speech/2021/june/andrew-bailey-bis-bank-of-france-imf-ngfs-green-swan-conference
3             https://www.bankofengland.co.uk/speech/2021/june/andrew-bailey-reuters-events-global-responsible-business-2021
4   https://www.bankofengland.co.uk/speech/2021/may/gertjan-vlieghe-speech-hosted-by-the-department-of-economics-and-the-ipr
5         https://www.bankofengland.co.uk/speech/2021/may/anna-sweeney-association-of-british-insurers-prudential-regulation
6     https://www.bankofengland.co.uk/speech/2021/may/lyndon-nelson-the-8th-operational-resilience-and-cyber-security-summit

推荐阅读