首页 > 解决方案 > 宽格式:计算特定批次列的行均值的函数,然后按比例放大多个批次

问题描述

这是我之前关于构建用于计算行均值的函数的帖子的后续问题。

我想使用该apply族的任何函数来迭代我的数据集,并且每次计算我指定的一组列的行平均值(这是函数所做的)。不幸的是,我错过了一些我应该调整的关键apply(),因为我得到了一个我无法解决的错误。

示例数据

capital_cities_df <-
 data.frame("europe_paris" = 1:10, 
           "europe_london" = 11:20, 
           "europe_rome" = 21:30,
           "asia_bangkok" = 31:40,
           "asia_tokyo" = 41:50,
           "asia_kathmandu" = 51:60)

set.seed(123)
capital_cities_df <- as.data.frame(lapply(capital_cities_df, 
function(cc) cc[ sample(c(TRUE, NA),
                         prob = c(0.70, 0.30),
                         size = length(cc), 
                         replace = TRUE) ]))

> capital_cities_df

   europe_paris europe_london europe_rome asia_bangkok asia_tokyo asia_kathmandu
1             1            NA          NA           NA         41             NA
2            NA            12          22           NA         42             52
3             3            NA          23           33         43             NA
4            NA            14          NA           NA         NA             NA
5            NA            15          25           35         45             NA
6             6            NA          NA           36         NA             56
7            NA            17          NA           NA         NA             57
8            NA            18          NA           38         48             NA
9            NA            19          NA           39         49             NA
10           10            NA          30           40         NA             60

自定义函数

library(dplyr)
library(rlang)

continent_mean <- function(df, continent)  {
  df %>%
    select(starts_with(continent)) %>%
    dplyr::mutate(!!quo_name(continent) := rowMeans(., na.rm = TRUE))
}

## works for a single case:
continent_mean(capital_cities_df, "europe")

   europe_paris europe_london europe_rome europe
1             1            NA          21     11
2             2            12          22     12
3             3            NA          23     13
4             4            14          NA      9
5            NA            15          25     20
6             6            16          26     16
7            NA            17          NA     17
8            NA            18          NA     18
9            NA            19          NA     19
10           10            20          30     20

尝试将函数应用于数据,但未成功

apply(
  capital_cities_df,
  MARGIN = 2,
  FUN = continent_mean(capital_cities_df, continent = "europe")
)

Error in match.fun(FUN) : 
  'continent_mean(capital_cities_df, continent = "europe")' is not a function, character or symbol

中的任何其他参数组合apply()也不起作用,sapply. 这种不成功的使用尝试apply仅适用于我希望获得平均值的一种类型的列(“欧洲”)。但是,我的最终目标是能够通过c("europe", "asia", etc.)apply因此我可以一键获得自定义函数来为我指定的所有列组创建行均值列。

我的代码有什么问题?

谢谢!

编辑 2019 年 8 月 19 日

我正在尝试 A. Suliman 建议的解决方案(见下文)。它确实适用于我在此处发布的示例数据,但在尝试将其扩展到我的真实数据集时不起作用,我需要对其他列进行子集化(而不仅仅是“大陆”批次)。更具体地说,在我的真实数据中,我有一个 ID 列,当我应用我的定制函数时,我想将它与其他数据一起输出。

包括“ID”列的示例数据

capital_cities_df <- data.frame(
    "europe_paris" = 1:10,
    "europe_london" = 11:20,
    "europe_rome" = 21:30,
    "asia_bangkok" = 31:40,
    "asia_tokyo" = 41:50,
    "asia_kathmandu" = 51:60)
  
set.seed(123)
capital_cities_df <- as.data.frame(lapply(df, function(cc) cc[ sample(c(TRUE, NA),
                                                 prob = c(0.70, 0.30),
                                                 size = length(cc), 
                                                 replace = TRUE) ]))

id <- 1:10
capital_cities_df <- cbind(id, capital_cities_df)

> capital_cities_df

   id europe_paris europe_london europe_rome asia_bangkok asia_tokyo asia_kathmandu
1   1            1            NA          NA           NA         41             NA
2   2           NA            12          22           NA         42             52
3   3            3            NA          23           33         43             NA
4   4           NA            14          NA           NA         NA             NA
5   5           NA            15          25           35         45             NA
6   6            6            NA          NA           36         NA             56
7   7           NA            17          NA           NA         NA             57
8   8           NA            18          NA           38         48             NA
9   9           NA            19          NA           39         49             NA
10 10           10            NA          30           40         NA             60

我的功能(也编辑为选择id

continent_mean <- function(df, continent)  {
  df %>%
    select(., id, starts_with(continent)) %>%
    dplyr::mutate(!!quo_name(continent) := rowMeans(., na.rm = TRUE))
}

> continent_mean(capital_cities_df, "europe") ## works in a single run

   id europe_paris europe_london europe_rome    europe
1   1            1            NA          NA  1.000000
2   2           NA            12          22 12.000000
3   3            3            NA          23  9.666667
4   4           NA            14          NA  9.000000
5   5           NA            15          25 15.000000
6   6            6            NA          NA  6.000000
7   7           NA            17          NA 12.000000
8   8           NA            18          NA 13.000000
9   9           NA            19          NA 14.000000
10 10           10            NA          30 16.666667

试图在单次使用之外应用该功能(基于 A. Suliman 的方法)——未成功

continents <- c("europe", "asia") 
lst <- lapply(continents, function(x) continent_mean(df=capital_cities_df[, grep(x, names(capital_cities_df))], continent=x))

## or:
purrr::map_dfc(continents, ~continent_mean(df=capital_cities_df[, grep(.x, names(capital_cities_df))], continent=.x))

无论哪种情况,我都会收到各种错误消息:

inds_combine(.vars, ind_list) 中的错误:位置必须在 0 和 n 之间

在其他时间:

错误:无效的列索引:变量的 NA:'NA' = 'NA'

我想要的只是一个简单的函数,让我根据要运行的列的规范计算行均值,但由于某种原因,这变得令人讨厌。尽管我很想弄清楚我的代码出了什么问题,但如果有人对整个过程有更好的总体解决方案,我也将不胜感激。

谢谢!

标签: rdplyrapplysapply

解决方案


用于lapply循环continents然后用于grep选择当前的列continent

continents <- c("europe", "asia") 
lst <- lapply(continents, function(x) continent_mean(df=capital_cities_df[, grep(x, names(capital_cities_df))], continent=x))
#To a dataframe not a list
do.call(cbind, lst)

使用map_dfcfrompurrr我们可以一步得到结果

purrr::map_dfc(continents, ~continent_mean(df=capital_cities_df[, grep(.x, names(capital_cities_df))], continent=.x))

更新:

#grep will return column positions when they match with "europe" or "asia", e.g
> grep("europe", names(capital_cities_df))
[1] 2 3 4
#If we need the column names then we add value=TRUE to grep 
> grep("europe", names(capital_cities_df), value = TRUE)
[1] "europe_paris"  "europe_london" "europe_rome" 

所以要添加一个新列,我们可以c()像往常一样使用该函数并调用该函数

#NOTE: Here I'm using the old function without select
lst <- lapply(continents, function(x) continent_mean(df=capital_cities_df[, c('id',grep(x, names(capital_cities_df), value = TRUE))], continent=x))
do.call(cbind, lst)
id europe_paris europe_london europe_rome   europe id asia_bangkok asia_tokyo asia_kathmandu     asia
1   1            1            NA          NA  1.00000  1           NA         41             51 31.00000
2   2           NA            12          22 12.00000  2           NA         42             52 32.00000
3   3            3            13          23 10.50000  3           33         43             NA 26.33333
4   4           NA            14          NA  9.00000  4           NA         44             54 34.00000
5   5           NA            15          25 15.00000  5           35         45             55 35.00000
6   6            6            NA          NA  6.00000  6           36         46             56 36.00000
7   7            7            17          27 14.50000  7           NA         47             57 37.00000
8   8           NA            18          28 18.00000  8           38         48             NA 31.33333
9   9            9            19          29 16.50000  9           39         49             NA 32.33333
10 10           10            NA          30 16.66667 10           40         NA             60 36.66667

#We have one problem, id column gets duplicated, map_dfc with select will solve this issue
purrr::map_dfc(continents, ~continent_mean(df=capital_cities_df[, c('id',grep(.x, names(capital_cities_df), value = TRUE))], continent=.x)) %>%
#Don't select any column name ends with id followed by one digit
select(-matches('id\\d')) 

如果您想使用新功能,select那么只需capital_cities_df不通过grep,例如使用map_dfc

purrr::map_dfc(continents, ~continent_mean(df=capital_cities_df, continent=.x)) %>% 
select(-matches('id\\d'))

更正:在continent_mean

continent_mean <- function(df, continent)  {
  df %>%
    select(., id, starts_with(continent)) %>%
    #Exclude id from the rowMeans calculation 
    dplyr::mutate(!!quo_name(continent) := rowMeans(.[grep(continent, names(.))], na.rm = TRUE))
} 

推荐阅读