首页 > 解决方案 > 使用 Rvest 从多个页面中抓取文本、表格并将两者结合起来

问题描述

我有一种情况,我想跨不同的 url 抓取多个表。我确实设法刮了一页,但是当我尝试跨页面刮并将表格堆叠为数据框/列表时,我的功能失败了。

library(rvest)
library(tidyverse)
library(purrr)

   index <-225:227
          urls <- paste0("https://lsgkerala.gov.in/en/lbelection/electdmemberdet/2010/", index)
          
         
          get_gram <- function(url){
               urls %>%
                    read_html() %>%
                    html_nodes(xpath = '//*[@id="block-zircon-content"]/a[2]') %>%
                    html_text() -> temp
               urls %>% 
                    read_html() %>%
                    html_nodes(xpath = '//*[@id="block-zircon-content"]/table') %>% 
                    html_table() %>% 
                    as.data.frame() %>% add_column(newcol=str_c(temp))
          }
#results <- map_df(urls,get_gram) Have commented this out, but this is what i 
# used to get the table when the index just had one element and it worked.

results <- list()
results[[i]] <- map_df(urls,get_gram)

我想我在必须堆叠 map_df 输出的步骤上步履蹒跚,我提前感谢您的帮助!

标签: rweb-scrapingscreen-scrapingpurrrrvest

解决方案


您正在传递url给函数并urls在函数体中使用。试试这个版本:

library(rvest)
library(dplyr)

index <-225:227
urls <- paste0("https://lsgkerala.gov.in/en/lbelection/electdmemberdet/2010/", index)

get_gram <- function(url){
  webpage <- url %>%  read_html() 
  webpage %>%
    html_nodes(xpath = '//*[@id="block-zircon-content"]/a[2]') %>%
    html_text() -> temp
  webpage %>%
    html_nodes(xpath = '//*[@id="block-zircon-content"]/table') %>% 
    html_table() %>% 
    as.data.frame() %>% add_column(newcol=temp)
}
result <- purrr::map_df(urls,get_gram)


推荐阅读