首页 > 解决方案 > 如何对变量进行分组以计算 R 中的新字段?

问题描述

我有一个示例数据框,如下所示:

year state district individual_vote total_vote  candidate 
 2010 AZ     1          200             600        a
 2010 AZ     1          400             600        b
 2010 AZ     2          100             300        c
 2010 AZ     2          200             300        d
 2010 MA     1          100            200         e
 2010 MA     2          100            200         f
 2005 AZ     1          100            150         g
 2005 AZ     1          150            200         h

我想计算

  1. 谁是赢家

2.获胜者的得票差额(获胜者得票与第二名之差)。

我如何将它们分组year, state, district并为每个候选人计算这两个字段?谢谢!

标签: r

解决方案


此代码可以为您提供获得所需内容的途径,因为我不清楚一些定义:

library(dplyr)
#Code
new <- df %>%
  group_by(year,state,district) %>%
  mutate(Ratio=individual_vote/total_vote,
         Winner=candidate[which.max(Ratio)])

输出:

# A tibble: 8 x 8
# Groups:   year, state, district [5]
   year state district individual_vote total_vote candidate Ratio Winner
  <int> <chr>    <int>           <int>      <int> <chr>     <dbl> <chr> 
1  2010 AZ           1             200        600 a         0.333 b     
2  2010 AZ           1             400        600 b         0.667 b     
3  2010 AZ           2             100        300 c         0.333 d     
4  2010 AZ           2             200        300 d         0.667 d     
5  2010 MA           1             100        200 e         0.5   e     
6  2010 MA           2             100        200 f         0.5   f     
7  2005 AZ           1             100        150 g         0.667 h     
8  2005 AZ           1             150        200 h         0.75  h     

使用的一些数据:

#Data
df <- structure(list(year = c(2010L, 2010L, 2010L, 2010L, 2010L, 2010L, 
2005L, 2005L), state = c("AZ", "AZ", "AZ", "AZ", "MA", "MA", 
"AZ", "AZ"), district = c(1L, 1L, 2L, 2L, 1L, 2L, 1L, 1L), individual_vote = c(200L, 
400L, 100L, 200L, 100L, 100L, 100L, 150L), total_vote = c(600L, 
600L, 300L, 300L, 200L, 200L, 150L, 200L), candidate = c("a", 
"b", "c", "d", "e", "f", "g", "h")), class = "data.frame", row.names = c(NA, 
-8L))

更新:

#Code 2
newdf <- df %>%
  arrange(year,state,district,desc(individual_vote)) %>%
  group_by(year,state,district) %>%
  mutate(Winner=candidate[which.max(individual_vote)],
         Diff=c(NA,abs(diff(individual_vote))),
         Margin=ifelse(row_number()==2,Diff,NA)) %>%
  fill(Margin,.direction = "downup") %>%
  mutate(Margin=ifelse(is.na(Margin),individual_vote,Margin)) %>%
  select(-Diff)

输出:

# A tibble: 8 x 8
# Groups:   year, state, district [5]
   year state district individual_vote total_vote candidate Winner Margin
  <int> <chr>    <int>           <int>      <int> <chr>     <chr>   <int>
1  2005 AZ           1             150        200 h         h          50
2  2005 AZ           1             100        150 g         h          50
3  2010 AZ           1             400        600 b         b         200
4  2010 AZ           1             200        600 a         b         200
5  2010 AZ           2             200        300 d         d         100
6  2010 AZ           2             100        300 c         d         100
7  2010 MA           1             100        200 e         e         100
8  2010 MA           2             100        200 f         f         100

推荐阅读