首页 > 解决方案 > 使用 rvest 抓取 java 脚本对象

问题描述

我正在尝试从网页中抓取 java 脚本对象。我按照建议尝试了 JIRA API,但没有收到活动日志。我找到了一个解释如何抓取 java 脚本对象的网站。例如,见下文

https://datascienceplus.com/scraping-javascript-rendered-web-content-using-r/

我遵循了这个例子,但我发现很难理解我需要发送什么作为 xpath 信息来列出活动日志。我正在尝试抓取网页底部所有选项卡容器下的活动日志。

library(rvest)
library(V8)
#URL with js-rendered content to be scraped

link<- 'https://issues.apache.org/jira/browse/AMQCPP-645'
#Read the html page content and extract all javascript codes that are inside a list
#html<- getURL(link, followlocation = TRUE)
 emailjs <- read_html(link) %>% html_nodes(xpath = "//div") %>% html_text()


  ct <- v8()
 #parse the html content from the js output and print it as text
   read_html(ct$eval(gsub('document.write','',emailjs))) %>% 
   html_text()

我希望得到这样的输出:

       rows  emailjs
        1      S A created issue - 25/Apr/19 15:48 Highlight in document.    
        2      Justin Bertram made changes - 25/Apr/19 17:53 Field Original Value 
      New 
     Value  Comment [ I'm using Firefox, and it's working no problem. It's 
     just HTML so    there shouldn't be any browser compatibility issues. 
     My guess is that Firefox  is holding on to an older, cached version or 
     something. Try opening a "private browsing" window and trying it from 
     there. ] Highlight in document.

       3      Timothy Bish made changes - 25/Apr/19 18:10 Resolution Fixed [ 1 ] 
        Status 
      Open [ 1 ] Closed [ 6 ] Highlight in document.
       4       Timothy Bish made transition - 25/Apr/19 18:10 Open Closed 2h 22m 1

建议将不胜感激。谢谢!

标签: rweb-scrapingv8rvest

解决方案


您可以模仿页面发出的 POST 请求并添加一个必需的标头。然后 html 解析响应以获取所需的内容。您可能需要做更多的字符串整理。

library(httr)
library(rvest)
library(magrittr)

headers = c('X-Requested-With' = 'XMLHttpRequest')

data = '[{"name":"jira.viewissue.tab.clicked","properties":{"inNewWindow":false,"keyboard":false,"context":"unknown","tab":"com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel","tabPosition":1},"timeDelta":-4904},{"name":"jira.viewissue.tab.clicked","properties":{"inNewWindow":false,"keyboard":false,"context":"unknown","tab":"com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel","tabPosition":0},"timeDelta":-4178}]'

rows <- read_html(httr::POST(url = 'https://issues.apache.org/jira/browse/AMQCPP-645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel&_=1570029676497', httr::add_headers(.headers=headers), body = data))%>%
        html_nodes('.issuePanelWrapper .issue-data-block')%>%
        html_text()%>% 
        gsub('\\s+|\n+', ' ', .)

推荐阅读