首页 > 解决方案 > mutate_if 仅应用于一列(看不到我的错误)

问题描述

正如标题所暗示的,我看不出使用 mutate_if 哪里出错了。

可重现的例子

# Create a data frame
 df <- structure(list(dates = structure(c(17897, 17898, 17899, 17900, 17901, 17902, 17903, 17904, 17905, 17906),
                                   class = "Date"),
                 item_1 = c(NA, 1, 2, 3, 4, 5, 6, 7, 0, 8),
                 item_2 = c(NA, NA, NA, NA, 1, 2, 3, 0, 0, 9),
                 item_3 = c(NA, NA, NA, 8, 9, 10, 11, 0, 0, 2),
                 item_4 = c(NA, NA, 1, 2, 3, 4, 5, 6, 0, 0)),
            class = "data.frame", row.names = c(NA, -10L)) 

> df
        dates item_1 item_2 item_3 item_4
1  2019-01-01     NA     NA     NA     NA
2  2019-01-02      1     NA     NA     NA
3  2019-01-03      2     NA     NA      1
4  2019-01-04      3     NA      8      2
5  2019-01-05      4      1      9      3
6  2019-01-06      5      2     10      4
7  2019-01-07      6      3     11      5
8  2019-01-08      7      0      0      6
9  2019-01-09      0      0      0      0
10 2019-01-10      8      9      2      0


# Create a function to be used with mutate_if
my_fx <- function(x) {
    if_else(!is.na(x), cumprod( c(100, 1 + x[-1] / 100) ), NA_real_)
}


# Create a new data frame using mutate_if on original data frame
new_df <- df %>%
mutate_if(.predicate = is.numeric,
          .funs      = funs(index_val = my_fx)
          ) 

> new_df
        dates item_1 item_2 item_3 item_4 item_1_index_val item_2_index_val item_3_index_val item_4_index_val
1  2019-01-01     NA     NA     NA     NA               NA               NA               NA               NA
2  2019-01-02      1     NA     NA     NA         101.0000               NA               NA               NA
3  2019-01-03      2     NA     NA      1         103.0200               NA               NA               NA
4  2019-01-04      3     NA      8      2         106.1106               NA               NA               NA
5  2019-01-05      4      1      9      3         110.3550               NA               NA               NA
6  2019-01-06      5      2     10      4         115.8728               NA               NA               NA
7  2019-01-07      6      3     11      5         122.8251               NA               NA               NA
8  2019-01-08      7      0      0      6         131.4229               NA               NA               NA
9  2019-01-09      0      0      0      0         131.4229               NA               NA               NA
10 2019-01-10      8      9      2      0         141.9367               NA               NA               NA

我期望的输出是该函数将应用于其他列(例如,“item_2”,创建一个新的“item_2_index_val”),但这些列都将出现 NA。

我看不出我在这里缺少什么,但我希望它很简单。谢谢您的帮助!

标签: rdplyr

解决方案


问题在于如何在NA中使用cumprod这些NA值,即使只有一个NA. 为了确保我们只将 应用cumprod到非 NA 元素中,提取具有索引 ('i1') -> 的非 NA 元素x[i1],删除第一个元素,与 100 连接,应用cumprod,然后replaceNA向量与'val' 基于'i1'

my_fx <- function(x) {
     new <- rep(NA_real_, length(x))
     i1 <- !is.na(x)
     val <- cumprod( c(100, 1 + x[i1][-1] / 100) )
     replace(new, i1, val)
   }
df %>% 
    mutate_if(is.numeric, list(index_val = ~ my_fx(.)))
#        dates item_1 item_2 item_3 item_4 item_1_index_val item_2_index_val item_3_index_val item_4_index_val
#1  2019-01-01     NA     NA     NA     NA               NA               NA               NA               NA
#2  2019-01-02      1     NA     NA     NA         100.0000               NA               NA               NA
#3  2019-01-03      2     NA     NA      1         102.0000               NA               NA         100.0000
#4  2019-01-04      3     NA      8      2         105.0600               NA         100.0000         102.0000
#5  2019-01-05      4      1      9      3         109.2624         100.0000         109.0000         105.0600
#6  2019-01-06      5      2     10      4         114.7255         102.0000         119.9000         109.2624
#7  2019-01-07      6      3     11      5         121.6091         105.0600         133.0890         114.7255
#8  2019-01-08      7      0      0      6         130.1217         105.0600         133.0890         121.6091
#9  2019-01-09      0      0      0      0         130.1217         105.0600         133.0890         121.6091
#10 2019-01-10      8      9      2      0         140.5314         114.5154         135.7508         121.6091

此外,由于NA它们位于顶部,因此可以更轻松地实现

f1 <- function(x) cumprod( c(100, 1 + x[-1] / 100))
df %>%
    mutate_if(is.numeric, list(index_val = ~ 
                c(rep(NA_real_, sum(is.na(.))), f1(na.omit(.)))))

另一种选择是data.table

library(data.table)
nm1 <- names(df)[-1]
nm2 <- paste0(nm1, "_indexval") 
setDT(df)[, (nm2) := NA_real_]
f1 <- function(x) cumprod( c(100, 1 + x[-1] / 100))
for(j in seq_along(nm1)) {
   i1 <- which(!is.na(df[[nm1[j]]]))
    set(df, i = i1, j = nm2[j], value = f1(df[[nm1[j]]][i1]))
  }

df
#         dates item_1 item_2 item_3 item_4 item_1_indexval item_2_indexval item_3_indexval item_4_indexval
# 1: 2019-01-01     NA     NA     NA     NA              NA              NA              NA              NA
# 2: 2019-01-02      1     NA     NA     NA        100.0000              NA              NA              NA
# 3: 2019-01-03      2     NA     NA      1        102.0000              NA              NA        100.0000
# 4: 2019-01-04      3     NA      8      2        105.0600              NA        100.0000        102.0000
# 5: 2019-01-05      4      1      9      3        109.2624        100.0000        109.0000        105.0600
# 6: 2019-01-06      5      2     10      4        114.7255        102.0000        119.9000        109.2624
# 7: 2019-01-07      6      3     11      5        121.6091        105.0600        133.0890        114.7255
# 8: 2019-01-08      7      0      0      6        130.1217        105.0600        133.0890        121.6091
# 9: 2019-01-09      0      0      0      0        130.1217        105.0600        133.0890        121.6091
#10: 2019-01-10      8      9      2      0        140.5314        114.5154        135.7508        121.6091

关键是我们只删除了x[-1]可能是 NA 的第一个元素,但还有其他元素是NA


推荐阅读