r - for循环：无法将数据帧存储在列表中

问题描述

for我使用循环创建了一个函数（我知道这不是在 R 中创建循环的最佳方法，但不知道此任务的任何其他方式）。该功能的目的是获取一组 google 搜索 URL，将它们抓取为 SERP（搜索引擎结果页面）并在其中运行主题分析。

该功能有效，我的问题出在这行代码中：

out_list[(4*(url - 1)) + (topics - 1)] <- df_LDA

它不会将完整的df_LDA数据框存储到第一列 ( ) 中，out_list而只会存储到第一列 ( topic) 中。

df_LDA有4 列：topic、term和。betaterm2

我收到此错误消息：

There were 32 warnings (use warnings() to see them)
> warnings()
Warning messages:
1: In out_list[(4 * (url - 1)) + (topics - 1)] <- df_LDA :
  number of items to replace is not a multiple of replacement length

这是我的代码，以使示例可重现。

library(tidyverse); library(rvest); library(googlesheets4); library(tidytext)
library(topicmodels)

# Read Google Sheet with queries
gs4_deauth()
queries <- read_sheet("1LarbbUErfnJIe00XKvXXkkfuTFOSNzDNUeuXTkxeIe0", 
                      sheet = 1, range = "A2:B")

scrap_google <- function(df){
  out_list <- vector(mode = "list", length = nrow(df)*4)
  for(url in 1:nrow(df)){
    df_scrap <- read_html(as.character(df[url,2])) %>% 
      html_nodes(xpath = "//div/div/div/div/div/div/div/text()") %>% 
      html_text() %>% 
      .[. != ">"] %>% 
      .[!. %in% c("Todos los resultados", "De cualquier fecha", 
                  "Buscar páginas en Español", " · ", " ")] %>% 
      .[!str_detect(., pattern = "[\n]")] %>% 
      .[nzchar(.)] %>% 
      tibble(id = seq(1:length(.)), title  = ., text = .) %>% 
      unnest_tokens(word, text) %>% 
      anti_join(tibble(word = tm::stopwords("spanish"))) %>% 
      count(word, id) %>%
      cast_dtm(id, word, n) %>% 
      as.matrix()
    for(topics in 2:5){
      df_LDA <- LDA(x = df_scrap, k = topics, method = "Gibbs", 
          control = list(seed = 1)) %>% 
        tidy(matrix = "beta") %>% 
        group_by(topic) %>% 
        top_n(7, beta) %>% 
        ungroup() %>% 
        mutate(term2 = fct_reorder(term, beta)) 
      names(out_list)[(4*(url - 1)) + (topics - 1)] <- 
        paste(as.character(df[url, 1]), topics, sep = "_")
      out_list[(4*(url - 1)) + (topics - 1)] <- df_LDA
      print(ggplot(df_LDA, aes(term2, beta, fill = as.factor(topic))) + 
              geom_col(show.legend = FALSE) + 
              facet_wrap(~ topic, scales = 'free') +
              coord_flip() + ggtitle(df[url, 1]))  
    }
  }
}

# Run analysis  
beta_list <- scrap_google(df = queries)

标签： rlistweb-scraping

r - for循环：无法将数据帧存储在列表中

问题描述

解决方案

推荐阅读