r - 使用带有变量标签的 rvest 进行抓取

问题描述

我的问题

我正在尝试从此 URL 中抓取文档：

url <- https://www.bger.ch/ext/eurospider/live/de/php/aza/http/index.php?lang=de&type=simple_query&query_words=&lang=de&top_subcollection_aza=all&from_date=01.01.2017&to_date=05.01.2017&x=0&y=0

单个感兴趣文档的代码如下所示：

<span class="rank_title">
                  <a href="https://www.bger.ch/ext/eurospider/live/de/php/aza/http/index.php?lang=de&amp;type=highlight_simple_query&amp;page=1&amp;from_date=01.01.2017&amp;to_date=05.01.2017&amp;sort=relevance&amp;insertion_date=&amp;top_subcollection_aza=all&amp;query_words=&amp;rank=5&amp;azaclir=aza&amp;highlight_docid=aza%3A%2F%2F05-01-2017-2C_826-2015&amp;number_of_ranks=67" title="Seite mit hervorgehobenen Suchbegriffen öffnen">05.01.2017 2C 826/2015</a>
</span>
   <span class="published_info small normal">
      <a href="https://www.bger.ch/ext/eurospider/live/de/php/aza/http/index.php?lang=de&amp;type=highlight_simple_query&amp;page=1&amp;from_date=01.01.2017&amp;to_date=05.01.2017&amp;sort=relevance&amp;insertion_date=&amp;top_subcollection_aza=all&amp;query_words=&amp;highlight_docid=atf%3A%2F%2F143-I-73%3Ade&amp;azaclir=aza">publiziert</a>
   </span>
<div class="rank_data">
      <div class="court small normal">
      IIe Cour de droit public
   </div>

      <div class="subject small normal">
      Finances publiques &amp; droit fiscal
   </div>

      <div class="object small normal">
      Impôts communal et cantonal 2009, impôt sur la fortune; estimation de titres non cotés, garantie de la propriété
   </div>
   </div>               </li>

我对以下课程感兴趣："rank_title"、"published info small normal"和。我想将这些信息存储在数据框中。"subject small normal""object small normal"

但是，并非所有文档都具有所有类（例如，在此页面上，只有一个文档具有"published info small normal"该类。

如果"published info small normal"可用，我主要感兴趣的是提取该文档的标题，在此示例中：

143 我 73

编辑如果脚本只提取“publiziert”（如果"published info small normal"可用），那就没问题了。

我的方法

我发现一篇似乎对我的问题非常有用的帖子 Scraping with rvest - complete with NAs when tag is not present

我开始实施这个：

library(XML)
doc <- xmlTreeParse(url, asText = TRUE, useInternalNodes = TRUE)

但是，我不知道如何实现变量节点的代码。

标签： rxmlweb-scrapingrvest

找到了解决方案：

#read the html
pg <- read_html("url <- https://www.bger.ch/ext/eurospider/live/de/php/aza/http/index.php?lang=de&type=simple_query&query_words=&lang=de&top_subcollection_aza=all&from_date=01.01.2017&to_date=05.01.2017&x=0&y=0")

xdf <- pg %>% 
        html_nodes("div.ranklist_content ol li")  %>%    # select enclosing nodes
        # iterate over each, pulling out desired parts and coerce to data.frame
      map_df(~list(link = html_nodes(.x, ".rank_title a") %>% 
                     html_attr("href") %>% 
                     {if(length(.) == 0) NA else .},    # replace length-0 elements with NA
                 title = html_nodes(.x, ".rank_title a") %>% 
                   html_text() %>% 
                   {if(length(.) == 0) NA else .},
                 publication_link = html_nodes(.x, ".published_info a") %>% 
                    html_attr("href") %>% 
                 {if(length(.) == 0) NA else .},  

                  publication = html_nodes(.x, ".published_info a") %>% 
                   html_text() %>% 
                   {if(length(.) == 0) NA else .},

                 court = html_nodes(.x, ".rank_data .court") %>% 
                   html_text(trim=TRUE) %>% 
                   {if(length(.) == 0) NA else .},

                 subject = html_nodes(.x,  ".rank_data .subject") %>% 
                   html_text(trim=TRUE) %>% 
                   {if(length(.) == 0) NA else .},
                 object = html_nodes(.x,   ".rank_data .object") %>% 
                   html_text(trim=TRUE) %>% 
                   {if(length(.) == 0) NA else .}))

如果有人能帮我提取class="published_info small normal".

r - 使用带有变量标签的 rvest 进行抓取

问题描述

解决方案

推荐阅读