web-scraping - RSelenium 删除雅虎财经新闻头条
问题描述
我想从雅虎获得一家公司的新闻标题。我RSelenium
用来启动远程浏览器并接受 cookie。我找到了 surroung css 类“StretchedBox”,通过浏览器检查我可以从字面上看到标题。如何存储这些标题?接下来,我想向下滚动RSelenium
并保存更多这些元素(比如几天)。
library('RSelenium')
# Start Remote Browser
rD <- rsDriver(port = 4840L, browser = c("firefox"))
remDr <- rD[["client"]]
# Navigate to Yahoo Finance News for Specific Company
# This takes unusual long time
remDr$navigate("https://finance.yahoo.com/quote/AAPL/news?p=AAPL")
# Get "accept all cookies" botton
webElems <- remDr$findElements(using = "xpath", "//button[starts-with(@class, 'btn primary')]")
# We can check if we did get the proper button by checking the text of the element:
unlist(lapply(webElems, function(x) {x$getElementText()}))
# We found the two button, and we want to click the first one:
webElems[[1]]$clickElement()
# wait for page loading
Sys.sleep(5)
# I am looking for news headline in or after the StretchedBox
boxes <- remDr$findElements(using = "class", "StretchedBox")
boxes[1] # empty
boxes[[1]]$browserName
解决方案
最后,我找到了一个 xpath,我可以从中getElementText
获取新闻文章的标题。
library('RSelenium')
# Start Browser
rD <- rsDriver(port = 4835L, browser = c("firefox"))
remDr <- rD[["client"]]
# Navigate to Yahoo Financial News
remDr$navigate("https://finance.yahoo.com/quote/AAPL/news?p=AAPL")
# Click Accept Cookies
webElems <- remDr$findElements(using = "xpath", "//button[starts-with(@class, 'btn primary')]")
unlist(lapply(webElems, function(x) {x$getElementText()}))
webElems[[1]]$clickElement()
# extract headlines from html/css by xpath
headlines <- remDr$findElements(using = "xpath", "//h3[@class = 'Mb(5px)']//a")
# extract headline text
headlines <- sapply(headlines, function(x){x$getElementText()})
headlines[1]
[[1]]
[1] "What Kind Of Investors Own Most Of Apple Inc. (NASDAQ:AAPL)?"
推荐阅读
- .net - 5秒查看部署状态
- firebase - Flutter Firebase Auth:如何在 Flutter Drawer 中显示用户电子邮件和姓名
- c# - C# WPF - 创建自定义控件。不确定如何正确对齐文本
- vuejs2 - 这个来自 bootstrap vue 文档的示例是否错误/过时?
- python - 如何使用 Python 使用 Regex 编译括号中的数字列表
- split - Flutter:使用 TapGestureRecognizer 更改 TextSpan 的文本样式
- excel - 如何创建宏以有条件地逐行打印单个单元格内容
- docker - 如何在 Docker 中通过 ASP.NET Core 2.2 应用程序支持生产环境
- c# - 如何从调用该 PowerShell 脚本的 C# 代码中回答 Read-Host?
- powerbi - 尝试根据过滤器和 ALLEXCEPT 计算每日百分比