首页 > 解决方案 > R通过分组变量在第一次出现值时使用条件语句创建新变量

问题描述

我有以下数据框,每个用户有多个任务分数:

sampDF<-structure(list(User = structure(c(1L, 1L, 2L, 2L, 2L), .Label = 
c("A1", 
"A2"), class = "factor"), Task.Name = structure(c(1L, 2L, 1L, 
3L, 4L), .Label = c("T1", "T2", "T3", "T4"), class = "factor"), 
Max.Score = c(0.93, 0.95, 0.78, 0.87, 0.96)), class = "data.frame", row.names = 
c(NA, 
-5L))

我想计算一个新变量 ( ),仅当值是且仅添加到第一个值per时,才Score.Plus5将常量值 (0.05) 添加到该值,否则我想要原始值。Max.Score<0.90<0.90UserMax.Score

我尝试了以下操作dplyr

sampDF2 <- sampDF %>% 
group_by(User) %>%
arrange(User, Max.Score) %>%
mutate(Score.Plus5 = ifelse(first(Max.Score <0.90), Max.Score + 0.05, 
Max.Score))

这会导致每个 ID重复Max.Score值加上常数或原始值。Max.Score

我想要的结果是:

sampDF3<-structure(list(User = structure(c(1L, 1L, 2L, 2L, 2L), .Label = 
c("A1", 
"A2"), class = "factor"), Task.Name = structure(c(1L, 2L, 1L, 
3L, 4L), .Label = c("T1", "T2", "T3", "T4"), class = "factor"), 
Max.Score = c(0.93, 0.95, 0.78, 0.87, 0.96), Score.Plus5 = c(0.93, 
0.95, 0.83, 0.87, 0.96)), class = "data.frame", row.names = c(NA, 
-5L))

dplyr使用or实现此结果的最有效方法是data.table什么?

标签: rdplyrdata.table

解决方案


在您的情况下,我认为这不是dplyr::first正确的选择,因为first值小于0,90true组的所有行。因此,您可以更好地使用row_number() == 1. 因此解决方案将是:

library(dplyr)
sampDF %>% 
  group_by(User) %>%
  arrange(User, Max.Score) %>%
  mutate(Score.Plus5 = ifelse(row_number()==1 & Max.Score <0.90, Max.Score + 0.05, 
                Max.Score))

# # A tibble: 5 x 4
# # Groups: User [2]
#   User   Task.Name Max.Score Score.Plus5
#   <fctr> <fctr>        <dbl>       <dbl>
# 1 A1     T1            0.930       0.930
# 2 A1     T2            0.950       0.950
# 3 A2     T1            0.780       0.830
# 4 A2     T3            0.870       0.870
# 5 A2     T4            0.960       0.960

推荐阅读