首页 > 解决方案 > for循环:无法将数据帧存储在列表中

问题描述

for我使用循环创建了一个函数(我知道这不是在 R 中创建循环的最佳方法,但不知道此任务的任何其他方式)。该功能的目的是获取一组 google 搜索 URL,将它们抓取为 SERP(搜索引擎结果页面)并在其中运行主题分析。

该功能有效,我的问题出在这行代码中:

out_list[(4*(url - 1)) + (topics - 1)] <- df_LDA

它不会将完整的df_LDA数据框存储到第一列 ( ) 中,out_list而只会存储到第一列 ( topic) 中。

df_LDA有4 列:topicterm和。betaterm2

我收到此错误消息:

There were 32 warnings (use warnings() to see them)
> warnings()
Warning messages:
1: In out_list[(4 * (url - 1)) + (topics - 1)] <- df_LDA :
  number of items to replace is not a multiple of replacement length

这是我的代码,以使示例可重现。

library(tidyverse); library(rvest); library(googlesheets4); library(tidytext)
library(topicmodels)

# Read Google Sheet with queries
gs4_deauth()
queries <- read_sheet("1LarbbUErfnJIe00XKvXXkkfuTFOSNzDNUeuXTkxeIe0", 
                      sheet = 1, range = "A2:B")

scrap_google <- function(df){
  out_list <- vector(mode = "list", length = nrow(df)*4)
  for(url in 1:nrow(df)){
    df_scrap <- read_html(as.character(df[url,2])) %>% 
      html_nodes(xpath = "//div/div/div/div/div/div/div/text()") %>% 
      html_text() %>% 
      .[. != ">"] %>% 
      .[!. %in% c("Todos los resultados", "De cualquier fecha", 
                  "Buscar páginas en Español", " · ", " ")] %>% 
      .[!str_detect(., pattern = "[\n]")] %>% 
      .[nzchar(.)] %>% 
      tibble(id = seq(1:length(.)), title  = ., text = .) %>% 
      unnest_tokens(word, text) %>% 
      anti_join(tibble(word = tm::stopwords("spanish"))) %>% 
      count(word, id) %>%
      cast_dtm(id, word, n) %>% 
      as.matrix()
    for(topics in 2:5){
      df_LDA <- LDA(x = df_scrap, k = topics, method = "Gibbs", 
          control = list(seed = 1)) %>% 
        tidy(matrix = "beta") %>% 
        group_by(topic) %>% 
        top_n(7, beta) %>% 
        ungroup() %>% 
        mutate(term2 = fct_reorder(term, beta)) 
      names(out_list)[(4*(url - 1)) + (topics - 1)] <- 
        paste(as.character(df[url, 1]), topics, sep = "_")
      out_list[(4*(url - 1)) + (topics - 1)] <- df_LDA
      print(ggplot(df_LDA, aes(term2, beta, fill = as.factor(topic))) + 
              geom_col(show.legend = FALSE) + 
              facet_wrap(~ topic, scales = 'free') +
              coord_flip() + ggtitle(df[url, 1]))  
    }
  }
}

# Run analysis  
beta_list <- scrap_google(df = queries) 

标签: rlistweb-scraping

解决方案


推荐阅读