r - 使用 rvest 抓取 java 脚本对象
问题描述
我正在尝试从网页中抓取 java 脚本对象。我按照建议尝试了 JIRA API,但没有收到活动日志。我找到了一个解释如何抓取 java 脚本对象的网站。例如,见下文
https://datascienceplus.com/scraping-javascript-rendered-web-content-using-r/
我遵循了这个例子,但我发现很难理解我需要发送什么作为 xpath 信息来列出活动日志。我正在尝试抓取网页底部所有选项卡容器下的活动日志。
library(rvest)
library(V8)
#URL with js-rendered content to be scraped
link<- 'https://issues.apache.org/jira/browse/AMQCPP-645'
#Read the html page content and extract all javascript codes that are inside a list
#html<- getURL(link, followlocation = TRUE)
emailjs <- read_html(link) %>% html_nodes(xpath = "//div") %>% html_text()
ct <- v8()
#parse the html content from the js output and print it as text
read_html(ct$eval(gsub('document.write','',emailjs))) %>%
html_text()
我希望得到这样的输出:
rows emailjs
1 S A created issue - 25/Apr/19 15:48 Highlight in document.
2 Justin Bertram made changes - 25/Apr/19 17:53 Field Original Value
New
Value Comment [ I'm using Firefox, and it's working no problem. It's
just HTML so there shouldn't be any browser compatibility issues.
My guess is that Firefox is holding on to an older, cached version or
something. Try opening a "private browsing" window and trying it from
there. ] Highlight in document.
3 Timothy Bish made changes - 25/Apr/19 18:10 Resolution Fixed [ 1 ]
Status
Open [ 1 ] Closed [ 6 ] Highlight in document.
4 Timothy Bish made transition - 25/Apr/19 18:10 Open Closed 2h 22m 1
建议将不胜感激。谢谢!
解决方案
您可以模仿页面发出的 POST 请求并添加一个必需的标头。然后 html 解析响应以获取所需的内容。您可能需要做更多的字符串整理。
library(httr)
library(rvest)
library(magrittr)
headers = c('X-Requested-With' = 'XMLHttpRequest')
data = '[{"name":"jira.viewissue.tab.clicked","properties":{"inNewWindow":false,"keyboard":false,"context":"unknown","tab":"com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel","tabPosition":1},"timeDelta":-4904},{"name":"jira.viewissue.tab.clicked","properties":{"inNewWindow":false,"keyboard":false,"context":"unknown","tab":"com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel","tabPosition":0},"timeDelta":-4178}]'
rows <- read_html(httr::POST(url = 'https://issues.apache.org/jira/browse/AMQCPP-645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel&_=1570029676497', httr::add_headers(.headers=headers), body = data))%>%
html_nodes('.issuePanelWrapper .issue-data-block')%>%
html_text()%>%
gsub('\\s+|\n+', ' ', .)
推荐阅读
- javafx - JavaFX 根加载问题
- html - 响应式图像、浏览器和网络速度?
- javascript - 如果拆分失败,那么做其他事情吗?
- sql-server - Visual Studio 2015 SQL Schema 与非标准端口比较
- arrays - 没有名字的Scala数组
- python - 在 pandas 数据框中添加错误日志消息行
- rest - DocuSign REST API 返回 HTML?
- python - np.where 在函数内部不起作用,但在外部起作用
- android - 在 Android API 26 中发送本地通知
- c# - 无法弄清楚为什么 getConnectionString 不断传递 NULL