首页 > 解决方案 > R summarise_at 动态地按条件:某些列的平均值,其他列的总和

问题描述

我愿意,但条件是summarise_at()

编辑#1:我在标题中动态添加了这个词:当我在其中使用vars(c())它时,summarise_at()它是为了快速清晰的示例,但实际上它是为了使用contains(),starts_with()matches(,, perl=TRUE),因为我有 50 列,有很多sum()和一些mean().

目标是使用tbl()..%>% group_by() ... %>% summarise_at()...%>% collect().

编辑#2:我添加了第二个示例中生成的 SQL 示例

library(tidyverse)
(mtcars 
  %>% group_by(carb)
  %>% summarise_at(vars(c("mpg","cyl","disp")), list (~mean(.),~sum(.)))
  # I don't want this line below, I would like a conditional in summarise_at() because I have 50 columns in my real case
  %>% select(carb,cyl_mean,disp_mean,mpg_sum)
)
#> # A tibble: 6 x 4
#>    carb cyl_mean disp_mean mpg_sum
#>   <dbl>    <dbl>     <dbl>   <dbl>
#> 1     1     4.57      134.   177. 
#> 2     2     5.6       208.   224  
#> 3     3     8         276.    48.9
#> 4     4     7.2       309.   158. 
#> 5     6     6         145     19.7
#> 6     8     8         301     15

Created on 2020-02-19 by the reprex package (v0.3.0)

这可行,但我只想要 mpg 的总和,并且只意味着 cyl 和 disp:

library(RSQLite)
library(dbplyr)
library(tidyverse)
library(DBI)

db <- dbConnect(SQLite(),":memory:")

dbCreateTable(db, "mtcars_table", mtcars)

(tbl( db, build_sql( con=db,"select * from mtcars_table" ))
  %>% group_by(carb)
  %>% summarise_at(vars(c("mpg","cyl","disp")), list (~mean(.),~sum(.)))
  %>% select(carb,cyl_mean,disp_mean,mpg_sum)                   
  %>% show_query()
)
#> <SQL>
#> Warning: Missing values are always removed in SQL.[...]  to silence this warning
#> SELECT `carb`, `cyl_mean`, `disp_mean`, `mpg_sum`
#> FROM (SELECT `carb`, AVG(`mpg`) AS `mpg_mean`, AVG(`cyl`) AS `cyl_mean`, AVG(`disp`) AS `disp_mean`, SUM(`mpg`) AS `mpg_sum`, SUM(`cyl`) AS `cyl_sum`, SUM(`disp`) AS `disp_sum`
#> FROM (select * from mtcars_table)
#> GROUP BY `carb`)
#> # Source:   lazy query [?? x 4]
#> # Database: sqlite 3.30.1 [:memory:]
#> # … with 4 variables: carb <dbl>, cyl_mean <lgl>, disp_mean <lgl>,
#> #   mpg_sum <lgl>

我尝试了所有这样的可能性,但它不起作用或产生错误。

(mtcars %>% group_by(carb)%>% summarise_at(vars(c("mpg","cyl","disp")),ifelse(vars(contains(names(.),"mpg")),list(sum(.)),list(mean(.)))) )

不好,列太多

library(tidyverse)
(mtcars %>% group_by(carb)%>% summarise_at(vars(c("mpg","cyl","disp")),ifelse ((names(.)=="mpg"), list(~sum(.)) , list(~mean(.)))))
#> # A tibble: 6 x 34
#>    carb mpg_sum cyl_sum disp_sum mpg_mean..2 cyl_mean..2 disp_mean..2
#>   <dbl>   <dbl>   <dbl>    <dbl>       <dbl>       <dbl>        <dbl>
#> 1     1   177.       32     940.        25.3        4.57         134.
#> 2     2   224        56    2082.        22.4        5.6          208.
#> 3     3    48.9      24     827.        16.3        8            276.
#> 4     4   158.       72    3088.        15.8        7.2          309.
#> 5     6    19.7       6     145         19.7        6            145 
#> 6     8    15         8     301         15          8            301 
#> # … with 27 more variables: mpg_mean..3 <dbl>, cyl_mean..3 <dbl>,
#> #   disp_mean..3 <dbl>, mpg_mean..4 <dbl>, cyl_mean..4 <dbl>,
#> #   disp_mean..4 <dbl>, mpg_mean..5 <dbl>, cyl_mean..5 <dbl>,
#> #   disp_mean..5 <dbl>, mpg_mean..6 <dbl>, cyl_mean..6 <dbl>,
#> #   disp_mean..6 <dbl>, mpg_mean..7 <dbl>, cyl_mean..7 <dbl>,
#> #   disp_mean..7 <dbl>, mpg_mean..8 <dbl>, cyl_mean..8 <dbl>,
#> #   disp_mean..8 <dbl>, mpg_mean..9 <dbl>, cyl_mean..9 <dbl>,
#> #   disp_mean..9 <dbl>, mpg_mean..10 <dbl>, cyl_mean..10 <dbl>,
#> #   disp_mean..10 <dbl>, mpg_mean..11 <dbl>, cyl_mean..11 <dbl>,
#> #   disp_mean..11 <dbl>

其他一些尝试和评论:我希望有条件sum(.)mean(.)取决于summarise().

如果它不仅接受原始函数,那就太好了。

最后,使用andtbl()..%>% group_by() ... %>% summarise_at()...%>% collect()生成条件 SQL 。AVG()SUM()

T-SQL 函数类似~(convert(varchar())适用于mutate_at()和类似~AVG()适用于,summarise_at()但我到达同一点:summarise_at()根据列的名称,条件不起作用。

:)

标签: sqlrsql-serverdplyrdbplyr

解决方案


一个选项是group_by“carb”,然后将sum“mpg”创建为另一个分组变量,然后summarise_at与所需的其余变量一起使用

library(dplyr)
mtcars %>%
    group_by(carb) %>%
    group_by(mpg_sum = sum(mpg), .add = TRUE) %>%
    summarise_at(vars(cyl, disp), list(mean = mean))
# A tibble: 6 x 4
# Groups:   carb [6]
#   carb mpg_sum cyl_mean disp_mean
#  <dbl>   <dbl>    <dbl>     <dbl>
#1     1   177.      4.57      134.
#2     2   224       5.6       208.
#3     3    48.9     8         276.
#4     4   158.      7.2       309.
#5     6    19.7     6         145 
#6     8    15       8         301 

或者使用 的devel版本dplyr,这可以summarise通过将列块across和单列自己包装并在其上应用不同的功能来完成

mtcars %>%
  group_by(carb) %>% 
  summarise(across(one_of(c("cyl", "disp")), list(mean = mean)), 
            mpg_sum = sum(mpg))
# A tibble: 6 x 4
#   carb cyl_mean disp_mean mpg_sum
#  <dbl>    <dbl>     <dbl>   <dbl>
#1     1     4.57      134.   177. 
#2     2     5.6       208.   224  
#3     3     8         276.    48.9
#4     4     7.2       309.   158. 
#5     6     6         145     19.7
#6     8     8         301     15  

注意:在即将发布的版本中, etc. 将被具有默认功能 ( )summarise_at/summarise_if/mutate_at/mutate_if/...的动词取代acrosssummarise/mutate/filter/...


推荐阅读