首页 > 解决方案 > 使用 R 和 Rvest 抓取网页时出现问题

问题描述

我使用下面的代码从网页中提取表格:

library(rvest)
library(dplyr)

#Link to site and then getting html code. 
link <- "https://www.stats.gov.sa/en/915"
page <- read_html(link)

#extract table from html
files <- page %>%
    html_nodes("table") %>%
    .[[1]] %>%
    html_table()

但是,我得到的结果与网页上的结果不同。结果如下所示:

A tibble: 1 × 4 Name Report PeriodPeriodicity 下载

1 请稍等...请稍等...请稍等...请稍等...

我想知道有没有一种方法可以在不使用 Rselenium 的情况下通过 Web 浏览器查看表格。这是因为这似乎不适用于 r studio online

标签: rweb-scrapingrvest

解决方案


解决方案可能是RSelenium

下面是一个简单的例子

library(RSelenium)
library(rvest)
library(dplyr)
#Your URL
URL <- "https://www.stats.gov.sa/en/915"
#Open the browser by RSelenium
rD <- RSelenium::rsDriver(browser = "firefox", port = 4544L, verbose = F)
remDr <- rD[["client"]]
#Open the page into browser
remDr$navigate(URL)
#Get the table that you see
remDr$getPageSource()[[1]] %>% 
  read_html() %>%
  html_table()


    [[1]]
# A tibble: 13 x 4
   Name                           `Report Period` Periodicity Download
   <chr>                                    <int> <chr>       <lgl>   
 1 Ar-Riyad Region                           2017 Annual      NA      
 2 Makkah Al-Mokarramah Region               2017 Annual      NA      
 3 Al-Madinah Al-Monawarah Region            2017 Annual      NA      
 4 Al-Qaseem Region                          2017 Annual      NA      
 5 Eastern Region                            2017 Annual      NA      
 6 Aseer Region                              2017 Annual      NA      
 7 Tabouk Region                             2017 Annual      NA      
 8 Hail Region                               2017 Annual      NA      
 9 Northern Borders Region                   2017 Annual      NA      
10 Jazan Region                              2017 Annual      NA      
11 Najran Region                             2017 Annual      NA      
12 Al-Baha Region                            2017 Annual      NA      
13 Al-Jouf Region                            2017 Annual      NA 

推荐阅读