r - 使用 R 抓取带有“查看更多”的连续页面
问题描述
我是 R 新手,需要在本网站https://www.healthnewsreview.org/news-release-reviews/上的帖子上刮掉标题和日期
使用 rvest 我能够编写基本代码来获取信息:
url <- 'https://www.healthnewsreview.org/?post_type=news-release-review&s='
webpage <- read_html(url)
date_data_html <- html_nodes(webpage,'span.date')
date_data <- html_text(date_data_html)
head(date_data)
webpage <- read_html(url)
title_data_html <- html_nodes(webpage,'h2')
title_data <- html_text(title_data_html)
head(title_data)
但由于网站一开始只显示10个项目,然后你必须点击“查看更多”,我不知道如何抓取整个网站。谢谢!!
解决方案
引入第三方依赖项应该作为最后的手段。RSelenium(最初被 r2evans 假定为唯一的解决方案)在绝大多数情况下都不是必需的,包括现在。(对于使用像 SharePoint 这样的可怕技术的糟糕网站来说,这是必要的,因为在没有浏览器上下文的情况下维护状态比它的价值更痛苦)。)
如果我们从主页开始:
library(rvest)
pg <- read_html("https://www.healthnewsreview.org/news-release-reviews/")
我们可以得到第一组链接(其中 10 个):
pg %>%
html_nodes("div.item-content") %>%
html_attr("onclick") %>%
gsub("^window.location.href='|'$", "", .)
## [1] "https://www.healthnewsreview.org/news-release-review/more-unwarranted-hype-over-the-unique-benefits-of-proton-therapy-this-time-in-combo-with-thermal-therapy/"
## [2] "https://www.healthnewsreview.org/news-release-review/caveats-and-outside-expert-balance-speculative-claim-that-anti-inflammatory-diet-might-benefit-bipolar-disorder-patients/"
## [3] "https://www.healthnewsreview.org/news-release-review/plug-for-study-of-midwifery-for-low-income-women-is-fuzzy-on-benefits-costs/"
## [4] "https://www.healthnewsreview.org/news-release-review/tiny-safety-trial-prematurely-touts-clinical-benefit-of-cancer-vaccine-for-her2-positive-cancers/"
## [5] "https://www.healthnewsreview.org/news-release-review/claim-that-milk-protein-alleviates-chemotherapy-side-effects-based-on-study-of-just-12-people/"
## [6] "https://www.healthnewsreview.org/news-release-review/observational-study-cant-prove-surgery-better-than-more-conservative-prostate-cancer-treatment/"
## [7] "https://www.healthnewsreview.org/news-release-review/recap-of-mental-imagery-for-weight-loss-study-requires-that-readers-fill-in-the-blanks/"
## [8] "https://www.healthnewsreview.org/news-release-review/bmjs-attempt-to-hook-readers-on-benefits-of-golf-slices-way-out-of-bounds/"
## [9] "https://www.healthnewsreview.org/news-release-review/time-to-test-all-infants-gut-microbiomes-or-is-this-a-product-in-search-of-a-condition/"
## [10] "https://www.healthnewsreview.org/news-release-review/zika-vaccine-for-brain-cancer-pr-release-headline-omits-crucial-words-in-mice/"
我猜你想抓取那些^^的内容,所以就这样吧。
但是,有那个讨厌的“查看更多”按钮。
当你点击它时,它会发出这个POST
请求:
curlconverter
我们可以将其转换为可调用函数httr
(鉴于此任务的不可能性,该函数可能不存在)。我们可以将该函数调用包装在另一个带有分页参数的函数中:
view_more <- function(current_offset=10) {
httr::POST(
url = "https://www.healthnewsreview.org/wp-admin/admin-ajax.php",
httr::add_headers(
`X-Requested-With` = "XMLHttpRequest"
),
body = list(
action = "viewMore",
current_offset = as.character(as.integer(current_offset)),
page_id = "22332",
btn = "btn btn-gray",
active_filter = "latest"
),
encode = "form"
) -> res
list(
links = httr::content(res) %>%
html_nodes("div.item-content") %>%
html_attr("onclick") %>%
gsub("^window.location.href='|'$", "", .),
next_offset = current_offset + 4
)
}
现在,我们可以运行它(因为它默认为10
在第一次查看更多点击中发出):
x <- view_more()
str(x)
## List of 2
## $ links : chr [1:4] "https://www.healthnewsreview.org/news-release-review/university-pr-misleads-with-claim-that-preliminary-blood-t"| __truncated__ "https://www.healthnewsreview.org/news-release-review/observational-study-on-testosterone-replacement-therapy-fo"| __truncated__ "https://www.healthnewsreview.org/news-release-review/recap-of-lung-cancer-screening-test-relies-on-hyperbole-co"| __truncated__ "https://www.healthnewsreview.org/news-release-review/ties-to-drugmaker-left-out-of-postpartum-depression-drug-study-recap/"
## $ next_offset: num 14
我们可以将新的偏移量传递给另一个调用:
y <- view_more(x$next_offset)
str(y)
## List of 2
## $ links : chr [1:4] "https://www.healthnewsreview.org/news-release-review/sweeping-claims-based-on-a-single-case-study-of-advanced-c"| __truncated__ "https://www.healthnewsreview.org/news-release-review/false-claims-of-benefit-weaken-news-release-on-experimenta"| __truncated__ "https://www.healthnewsreview.org/news-release-review/contrary-to-claims-heart-scans-dont-save-lives-but-subsequ"| __truncated__ "https://www.healthnewsreview.org/news-release-review/breastfeeding-for-stroke-prevention-kudos-to-heart-associa"| __truncated__
## $ next_offset: num 18
您可以完成抓取初始文章计数(位于主页上)并进行数学运算以将其置于循环中并有效停止的困难部分。
注意:如果您正在执行此抓取以存档整个网站(无论是为他们还是独立),因为它在年底即将死去,您应该对此发表评论,并且我对该用例有更好的建议而不是手动编码任何编程语言。有免费的工业“站点保护”框架旨在保护这些类型的垂死资源。如果您只需要文章内容,那么迭代器和自定义刮板可能是(但显然是不可能的)选择。
另请注意,分页增量4
是网站在您真正按下按钮时所做的事情,因此这只是模仿该功能。
推荐阅读
- client - 如果本地领事客户端在节点中关闭怎么办
- excel - 由于工作表名称中的空格,VBA 在工作簿之间传输数据错误
- reactjs - useEffect 不会从 LocalStorage 设置当前状态
- visual-studio - 只有第一个 DataRow 在 MSTest 中执行
- qualtrics - 向 Qualtrics 提交调查答案
- command - 我的世界防止暴民消失
- c - 我的代码只显示我最后输入的输入。我附上了我的完整代码。请帮我找出我做错的地方
- python - Python-binance 包装器 ConnectionClosedError: code = 1006
- python-3.x - 请帮助我进行以下熊猫数据框操作
- javascript - ReactJS - 使用 Javascript 检查本地文件是否存在