首页 > 解决方案 > 使用 select、group_by 和 mutate 对具有 dplyr 的组进行跨行求和

问题描述

问题:我在一个汽车市场上制作了一个总市场份额变量,该市场销售了 286 种不同的车型,总共售出了 501 辆汽车。此组份额仅基于汽车特性:cat=“紧凑型”、“中型”、“大型”和 yr=77、78、79、80、81,以及份额,一个小的双变量;市场上共有15组。

我找到的最接近的答案:community.rstudio 上的 mishabalyasin:“使用 tidyeval 计算按行总计和比例?” 链接到 community.rstudio 上的帖子

应用 select-split-combine 的原则是我最接近得到正确答案的是 15 个组(15 x 3(cat, yr, s)):

df<- blp %>% 
  select(cat,yr,s) %>%
  group_by(cat,yr) %>% 
  summarise(group_share = sum(s))

#in my actual data, this is what fills by group share to get what I want, but this isn't the desired pipele-based answer
blp$group_share=0 #initializing the group_share, the 50th col
for(i in 1:501){
  for(j in 1:15){
    if((blp[i,31]==df[j,1])&&(blp[i,3]==df[j,2])){ #if(sameCat & sameYr){blpGS=dfGS}
      blp[i,50]=df[j,3]
      }
  }
}

这很棒,但我知道这可以一举完成......希望从我上面描述的内容中可以清楚地看到这个想法。一个简单的修复可能是一个循环,并由 cat 和 yr 上的条件设置,这会有所帮助,但我真的想更好地使用 dplyr 处理数据,因此,沿着这条线获得流水线答案的任何见解都是精彩的。

网站示例:下面的示例不适用于我提供的代码,但这是我的数据的“外观”。份额是一个因素存在问题。

#45 obs, 3 cats, 5 yrs
cat=c( "compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large")
yr=c(77,78,79,80,81,77,78,79,80,81,77,78,79,80,81,77,78,79,80,81,77,78,79,80,81,77,78,79,80,81,77,78,79,80,81,77,78,79,80,81,77,78,79,80,81)
s=c(.001,.0005,.002,.0001,.0002,.001,.0005,.002,.0001,.0002,.001,.0005,.002,.0001,.0002,.001,.0005,.002,.0001,.0002,.001,.0005,.002,.0001,.0002,.001,.0005,.002,.0001,.0002,.001,.0005,.002,.0001,.0002,.001,.0005,.002,.0001,.0002,.001,.0005,.002,.0001,.0002)

blp=as.data.frame(cbind(unlist(lapply(cat,as.character,stringsAsFactors=FALSE)),as.numeric(yr),unlist(as.numeric(s))))

names(blp)<-c("cat","yr","s")
head(blp)

#note: one example of a group share would be summing the share from
(group_share.blp.large.81.s=(blp[cat== "large" &yr==81,]))

#works thanks to akrun: applying the code I provided for what leads to the 15 groups 
df <- blp %>% 
    select(cat,yr,s) %>%
    group_by(cat,yr) %>% 
    summarise(group_share = sum(as.numeric(as.character(s)))) 
#manually filling doesn't work, but this is what I'd want if I didn't want pipelining
blp$group_share=0
for(i in 1:45){
        if( ((blp[i,1])==(df[j,1])) && (as.numeric(blp[i,2])==as.numeric(df[j,2]))){ #if(sameCat & sameYr){blpGS=dfGS}
          blp[i,4]=df[j,3];
    }
  }

标签: rdplyr

解决方案


如果我正确理解了您的问题,这应该会有所帮助!这里唯一的区别是,您可以使用 mutate 保留原始列并向它们添加聚合列,而不是使用将自动生成分组列和汇总列的汇总。

# Sample input
## 45 obs, 3 cats, 5 yrs
cat <- c( "compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large")

yr <- c(77,78,79,80,81,77,78,79,80,81,77,78,79,80,81,77,78,79,80,81,77,78,79,80,81,77,78,79,80,81,77,78,79,80,81,77,78,79,80,81,77,78,79,80,81)

s <- c(.001,.0005,.002,.0001,.0002,.001,.0005,.002,.0001,.0002,.001,.0005,.002,.0001,.0002,.001,.0005,.002,.0001,.0002,.001,.0005,.002,.0001,.0002,.001,.0005,.002,.0001,.0002,.001,.0005,.002,.0001,.0002,.001,.0005,.002,.0001,.0002,.001,.0005,.002,.0001,.0002)

# Calculation
blp <- 
  data.frame(cat, yr, s, stringsAsFactors = FALSE) %>% # To create dataframe
  group_by(cat, yr) %>% # Grouping by category and year
  mutate(group_share = sum(s, na.rm = TRUE)) %>% # Calculating sum share per category/year 
  ungroup()

预期产出 预期产出


推荐阅读