首页 > 解决方案 > 如何计算同一组中同一列中 2 个非日期值之间的差异?

问题描述

这是这篇文章的后续问题:在 R 中,我如何计算其中一行的列值大于另一行的列值的分组对的数量?

这是我对数据集 df1 的输入:

structure(list(Name = c("A.J. Ellis", "A.J. Ellis", "A.J. Pierzynski", 
"A.J. Pierzynski", "Aaron Boone", "Adam Kennedy", "Adam Melhuse", 
"Adrian Beltre", "Adrian Beltre", "Adrian Gonzalez", "Alan Zinter", 
"Albert Pujols", "Albert Pujols"), Age = c(37, 36, 37, 36, 36, 
36, 36, 37, 36, 36, 36, 37, 36), Year = c(2018, 2017, 2014, 2013, 
2009, 2012, 2008, 2016, 2015, 2018, 2004, 2017, 2016), Tm = c("SDP", 
"MIA", "TOT", "TEX", "HOU", "LAD", "TOT", "TEX", "TEX", "NYM", 
"ARI", "LAA", "LAA"), Lg = c("NL", "NL", "ML", "AL", "NL", "NL", 
"ML", "AL", "AL", "NL", "NL", "AL", "AL"), G = c(66, 51, 102, 
134, 10, 86, 15, 153, 143, 54, 28, 149, 152), PA = c(183, 163, 
362, 529, 14, 201, 32, 640, 619, 187, 40, 636, 650)), row.names = c(NA, 
13L), class = "data.frame")

这是我之前的问题的代码正确匹配对:

df1 %>%
  arrange(Name, Age) %>%
  group_by(Name) %>%
  filter(last(G) < first(G))

每个分组对有两个观察值。每个还有一个名为 G 的列和一个名为 Year 的列。

以下是使用上述代码对数据进行分组后的样子:https ://www.dropbox.com/s/hh2qgkbn​​4cy4k4l/Data%20after%20grouping.png?dl=0

现在,我想知道每个匹配对的是“37 岁”值和“36 岁”值之间“G 列”值的差异:(36 岁值)-(37 岁)价值)。阴性结果是可以的。

此外,对于数据集中所有匹配的对,我想要这些差异的总和。

标签: rdplyr

解决方案


如果我理解正确:

df <- structure(list(Name = c("A.J. Ellis", "A.J. Ellis", "A.J. Pierzynski", 
                        "A.J. Pierzynski", "Aaron Boone", "Adam Kennedy", "Adam Melhuse", 
                        "Adrian Beltre", "Adrian Beltre", "Adrian Gonzalez", "Alan Zinter", 
                        "Albert Pujols", "Albert Pujols"), Age = c(37, 36, 37, 36, 36, 
                                                                   36, 36, 37, 36, 36, 36, 37, 36), Year = c(2018, 2017, 2014, 2013, 
                                                                                                             2009, 2012, 2008, 2016, 2015, 2018, 2004, 2017, 2016), Tm = c("SDP", 
                                                                                                                                                                           "MIA", "TOT", "TEX", "HOU", "LAD", "TOT", "TEX", "TEX", "NYM", 
                                                                                                                                                                           "ARI", "LAA", "LAA"), Lg = c("NL", "NL", "ML", "AL", "NL", "NL", 
                                                                                                                                                                                                        "ML", "AL", "AL", "NL", "NL", "AL", "AL"), G = c(66, 51, 102, 
                                                                                                                                                                                                                                                         134, 10, 86, 15, 153, 143, 54, 28, 149, 152), PA = c(183, 163, 
                                                                                                                                                                                                                                                                                                              362, 529, 14, 201, 32, 640, 619, 187, 40, 636, 650)), row.names = c(NA, 
                                                                                                                                                                                                                                                                                                                                                                                  13L), class = "data.frame")
df1 <- df %>%
  arrange(Name, Age) %>%
  group_by(Name) %>%
  filter(last(G) < first(G)) %>% 
  mutate(g_diff = G[1] - G[2]) %>% 
  ungroup() %>% 
  mutate(sum_g_diff = sum(unique(g_diff)))

> df1
# A tibble: 4 x 9
  Name              Age  Year Tm    Lg        G    PA g_diff sum_g_diff
  <chr>           <dbl> <dbl> <chr> <chr> <dbl> <dbl>  <dbl>      <dbl>
1 A.J. Pierzynski    36  2013 TEX   AL      134   529     32         35
2 A.J. Pierzynski    37  2014 TOT   ML      102   362     32         35
3 Albert Pujols      36  2016 LAA   AL      152   650      3         35
4 Albert Pujols      37  2017 LAA   AL      149   636      3         35

或者,如果 的累积总和(运行总和)g_diff是所需的输出(不汇总数据):

df1 %>%
  group_by(Name) %>%
  mutate(cols = c(g_diff[1], rep(0, n() -1))) %>%
  ungroup() %>%
  mutate(cum_sum = cumsum(cols)) %>%
  select(-cols)

# A tibble: 4 x 9
  Name              Age  Year Tm    Lg        G    PA g_diff cum_sum
  <chr>           <dbl> <dbl> <chr> <chr> <dbl> <dbl>  <dbl>   <dbl>
1 A.J. Pierzynski    36  2013 TEX   AL      134   529     32      32
2 A.J. Pierzynski    37  2014 TOT   ML      102   362     32      32
3 Albert Pujols      36  2016 LAA   AL      152   650      3      35
4 Albert Pujols      37  2017 LAA   AL      149   636      3      35

(此解决方案基于问题)


推荐阅读