首页 > 解决方案 > 如何使用顺序过滤器的结果来识别更大的组

问题描述

这个问题是这个问题的延续

这是我的原始示例代码。我试图首先识别我的较大数据集中的所有组,其中恰好有一种情况 x = "Yes" 并且 y 是所有 x = "Yes" 中的最小值(可能有多个 x = "Yes" 在一个给定组)。

理想情况下,我想找到一种更好的方法来管理其他情况,以及出现需要区别对待的多种情况。

structure(list(type = c(7345L, 7345L, 7345L, 7345L, 7345L, 
7345L, 7345L, 7345L, 7345L, 7345L, 7345L, 7345L, 7345L, 7345L, 
7345L, 7345L, 7345L, 7345L, 7345L, 7345L, 7345L, 7345L, 7345L
), x= structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L
), .Label = c("No", "Yes"), class = "factor"), y = c(1.66703903751618, 
0, 0.899002060282742, 1.77844476717205, 0.858205995526113, 1.77844476717205, 
0.894654725714929, 2.28497216539696, 0, 0.899002060282742, 2.28497216539696, 
2.85895315127563, 2.85895315127563, 0, 2.85895315127563, 0.858205995526113, 
0.894654725714929, 1.66703903751618, 1.66703903751618, 0, 0, 
1.66703903751618, 0.894654725714929), z = c(6.67, 
0, 3.33, 6.67, 3.33, 6.67, 2, 6.67, 3.33, 3.33, 2, 3.33, 3.33, 
2, 3.33, 6.67, 6.67, 6.67, 2, 6.67, 3.33, 6.67, 2)), row.names = c(NA, 
-23L), class = c("tbl_df", "tbl", "data.frame"))

# And the code that I attempted:

test <- test %>%
          group_by(type) %>%
          arrange(type) %>%
          filter(sum(y == min(y)  & x == "Yes") == 1) %>%
          ungroup()

    test <- test %>%
    group_by(type) %>%
      mutate(x = case_when(y == min(y)  & x == "Yes" ~ "Yes",
                           TRUE ~ "No"))

基本上我试图给一个 x 分配一个“是”。如果有多个,则平局被 y 打破。如果还有平局,则平局被 z 打破。等等,我希望。我最初问题的解决方案帮助我从组中识别出正确的行,但作为副作用,该行与组分离。这使我无法进行下一步:将 x = "No" 分配给顺序条件/过滤器不适用的所有行。

这是较长代码的一部分,我试图在其中确定每个组的 x =“是”。前面的步骤(未在此处粘贴)将所有类型组收集到test(sub ) 数据框。

最初,我尝试制作一长串 ifelse() 语句,这些语句在每次分叉后变得更加具体,从而产生多个端节点,然后我可以从中分配正确的 x =“是”。IE

还尝试使用 case_when() 和 if() 来解决这个问题,但它很快变得难以管理并且不起作用。

test <- test %>%
          group_by(type) %>%
          arrange(type) %>%
          filter(x == "Yes") %>%
             filter(y == min(y)) %>%
               ungroup()

    test <- test %>%
    group_by(type) %>%
      mutate(x = case_when(y == min(y)  & x == "Yes" ~ "Yes",
                           TRUE ~ "No"))

更新 这里是另外4组

structure(list(type= c(104554L, 104554L, 104554L, 104554L, 
104554L, 104554L, 104421L, 104421L, 104421L, 104421L, 104421L, 
104421L, 104421L, 104421L, 104421L, 104421L, 104421L, 104421L, 
104421L, 104421L, 104421L, 104421L, 104421L, 104421L, 104421L, 
104421L, 104421L, 104421L, 104421L, 104421L, 104421L, 104421L, 
104421L, 104421L, 104421L, 104421L, 104421L, 102112L, 102112L, 
102112L, 60782L, 60782L, 60782L, 60782L, 60782L, 60782L, 60782L, 
60782L, 60782L, 60782L, 60782L, 60782L, 60782L, 60782L, 60782L, 
60782L), x = structure(c(1L, 1L, 2L, 2L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 2L, 1L, 1L, 2L, 1L, 2L, 
1L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 
2L, 1L, 2L, 1L, 1L, 1L, 2L, 1L, 1L, 2L, 1L, 1L, 2L, 2L, 1L, 1L, 
2L, 1L, 2L), .Label = c("No", "Yes"), class = "factor"), y = c(0, 
0, 0, 0, 0, 0, 1.2946042141164, 2.67234885411271, 1.40369726013573, 
1.98085120336689, 2.74669393960781, 0.886078834754662, 1.40369726013573, 
0, 0.678362545627373, 0, 1.66399485899006, 1.66399485899006, 
0, 1.98085120336689, 1.66399485899006, 1.2946042141164, 1.66399485899006, 
0, 0.875240408099169, 0, 1.2946042141164, 1.2946042141164, 1.2946042141164, 
1.96099458530063, 1.28997868511804, 0.886078834754662, 1.66399485899006, 
2.32273877338854, 1.66399485899006, 1.2946042141164, 0, 0, 1.28997868511804, 
0, 0.720735853734961, 1.86840611019821, 1.48552201407479, 0, 
0.970616854290719, 0.720735853734961, 0.720735853734961, 0.721774434375742, 
1.86840611019821, 0, 0.539800019998205, 0, 1.86840611019821, 
0, 1.48552201407479, 0), z = c(7.1428, 10.7142, 
25, 17.8571, 14.2857, 25, 1.6666, 3.9583, 1.25, 3.9583, 3.9583, 
3.9583, 3.9583, 3.125, 5, 0, 1.25, 3.9583, 3.125, 5, 0, 2.7083, 
3.9583, 1.25, 1.25, 5.2083, 3.9583, 5, 2.7083, 3.9583, 2.7083, 
3.9583, 5.2083, 5, 2.7083, 5, 1.25, 33.3333, 10, 25, 4.31, 4.31, 
4.31, 10.15, 4.31, 4.31, 10.15, 4.31, 4.31, 7.61, 4.31, 4.31, 
4.31, 10.15, 4.31, 10.15), a = structure(c(1L, 1L, 
2L, 2L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 2L, 1L, 1L, 
2L, 1L, 2L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 
1L, 1L, 1L, 2L, 1L, 2L, 1L, 1L, 1L, 2L, 1L, 1L, 2L, 1L, 1L, 2L, 
2L, 1L, 1L, 2L, 1L, 2L), .Label = c("No", "Yes"), class = "factor")), row.names = c(NA, 
-56L), class = c("tbl_df", "tbl", "data.frame"))

这是期望的结果:

# A tibble: 56 x 5
        type   x                        y                z   a
        <int> <fct>                     <dbl>          <dbl> <fct>      
 1     104554 No                        0               7.14 No         
 2     104554 No                        0              10.7  No         
 3     104554 Yes                       0              25    Yes        
 4     104554 No                        0              17.9  Yes        
 5     104554 No                        0              14.3  No         
 6     104554 No                        0              25    Yes        
 7     104421 No                        1.29            1.67 No         
 8     104421 No                        2.67            3.96 No         
 9     104421 No                        1.40            1.25 No         
10     104421 No                        1.98            3.96 No         
11     104421 No                        2.75            3.96 No         
12     104421 No                        0.886           3.96 No         
13     104421 No                        1.40            3.96 No         
14     104421 No                        0               3.12 Yes        
15     104421 No                        0.678           5    No         
16     104421 No                        0               0    Yes        
17     104421 No                        1.66            1.25 No         
18     104421 No                        1.66            3.96 No         
19     104421 No                        0               3.12 Yes        
20     104421 No                        1.98            5    No         
21     104421 No                        1.66            0    Yes        
22     104421 No                        1.29            2.71 No         
23     104421 No                        1.66            3.96 No         
24     104421 No                        0               1.25 No         
25     104421 No                        0.875           1.25 No         
26     104421 Yes                       0               5.21 Yes        
27     104421 No                        1.29            3.96 No         
28     104421 No                        1.29            5    No         
29     104421 No                        1.29            2.71 No         
30     104421 No                        1.96            3.96 No         
31     104421 No                        1.29            2.71 No         
32     104421 No                        0.886           3.96 No         
33     104421 No                        1.66            5.21 Yes        
34     104421 No                        2.32            5    No         
35     104421 No                        1.66            2.71 No         
36     104421 No                        1.29            5    No         
37     104421 No                        0               1.25 No         
38     102112 Yes                       0              33.3  Yes        
39     102112 No                        1.29           10    No         
40     102112 No                        0              25    Yes        
41      60782 No                        0.721           4.31 No         
42      60782 No                        1.87            4.31 No         
43      60782 No                        1.49            4.31 No         
44      60782 Yes                       0              10.2  Yes        
45      60782 No                        0.971           4.31 No         
46      60782 No                        0.721           4.31 No         
47      60782 No                        0.721          10.2  Yes        
48      60782 No                        0.722           4.31 No         
49      60782 No                        1.87            4.31 No         
50      60782 No                        0               7.61 Yes        
51      60782 No                        0.540           4.31 Yes        
52      60782 No                        0               4.31 No         
53      60782 No                        1.87            4.31 No         
54      60782 Yes                       0              10.2  Yes        
55      60782 No                        1.49            4.31 No         
56      60782 Yes                       0              10.2  Yes 

最后一组结果是 3 行并列。我选组的时候没有注意到。但是,这些无法确定单个 x = Yes 的情况发生在实际数据中。有超过 10 000 个组,最终需要手动确定一些组。

在一般层面上,我希望找到一种更智能的方式来编辑具有这些要求的数据,即我试图用上面的要点说明的结构。我认为这通常很有用,而不仅仅是特定于这些特定数据。

标签: rif-statementdplyr

解决方案


您可以尝试根据您的条件建立组。如果x == "Yes"只返回一个元素,则该元素满足该子集的min(y)andmax(z)条件。因此,您可以应用所有条件,而无需使用多种情况:

library(dplyr)
library(purrr)

test %>%
  group_by(type, x_yes = x == "Yes") %>%
  mutate(y_min = min(y) == y) %>%
  group_by(y_min, .add = TRUE) %>%
  mutate(z_max = max(z) == z) %>%
  group_by(z_max, .add = TRUE) %>%
  group_split() %>%
  map_df(~ .x %>%
           add_count() %>% 
           mutate(outcome = ifelse(
             all(x_yes, y_min, z_max, n == 1),
             "Yes",
             "No")
             )
         )

返回

# A tibble: 56 x 10
     type x         y     z a     x_yes y_min z_max     n outcome
    <int> <fct> <dbl> <dbl> <fct> <lgl> <lgl> <lgl> <int> <chr>  
 1  60782 No    0.721  4.31 No    FALSE FALSE TRUE      9 No     
 2  60782 No    1.87   4.31 No    FALSE FALSE TRUE      9 No     
 3  60782 No    1.49   4.31 No    FALSE FALSE TRUE      9 No     
 4  60782 No    0.971  4.31 No    FALSE FALSE TRUE      9 No     
 5  60782 No    0.721  4.31 No    FALSE FALSE TRUE      9 No     
 6  60782 No    0.722  4.31 No    FALSE FALSE TRUE      9 No     
 7  60782 No    1.87   4.31 No    FALSE FALSE TRUE      9 No     
 8  60782 No    1.87   4.31 No    FALSE FALSE TRUE      9 No     
 9  60782 No    1.49   4.31 No    FALSE FALSE TRUE      9 No     
10  60782 No    0      4.31 No    FALSE TRUE  TRUE      1 No     
11  60782 Yes   0.540  4.31 Yes   TRUE  FALSE FALSE     1 No     
12  60782 Yes   0.721 10.2  Yes   TRUE  FALSE TRUE      1 No     
13  60782 Yes   0      7.61 Yes   TRUE  TRUE  FALSE     1 No     
14  60782 Yes   0     10.2  Yes   TRUE  TRUE  TRUE      3 No     
15  60782 Yes   0     10.2  Yes   TRUE  TRUE  TRUE      3 No     
16  60782 Yes   0     10.2  Yes   TRUE  TRUE  TRUE      3 No     
17 102112 No    1.29  10    No    FALSE TRUE  TRUE      1 No     
18 102112 Yes   0     25    Yes   TRUE  TRUE  FALSE     1 No     
19 102112 Yes   0     33.3  Yes   TRUE  TRUE  TRUE      1 Yes    
20 104421 No    1.29   1.67 No    FALSE FALSE FALSE    18 No     
21 104421 No    2.67   3.96 No    FALSE FALSE FALSE    18 No     
22 104421 No    1.40   1.25 No    FALSE FALSE FALSE    18 No     
23 104421 No    1.98   3.96 No    FALSE FALSE FALSE    18 No     
24 104421 No    2.75   3.96 No    FALSE FALSE FALSE    18 No     
25 104421 No    0.886  3.96 No    FALSE FALSE FALSE    18 No     
26 104421 No    1.40   3.96 No    FALSE FALSE FALSE    18 No     
27 104421 No    1.66   1.25 No    FALSE FALSE FALSE    18 No     
28 104421 No    1.66   3.96 No    FALSE FALSE FALSE    18 No     
29 104421 No    1.29   2.71 No    FALSE FALSE FALSE    18 No     
30 104421 No    1.66   3.96 No    FALSE FALSE FALSE    18 No     
31 104421 No    0.875  1.25 No    FALSE FALSE FALSE    18 No     
32 104421 No    1.29   3.96 No    FALSE FALSE FALSE    18 No     
33 104421 No    1.29   2.71 No    FALSE FALSE FALSE    18 No     
34 104421 No    1.96   3.96 No    FALSE FALSE FALSE    18 No     
35 104421 No    1.29   2.71 No    FALSE FALSE FALSE    18 No     
36 104421 No    0.886  3.96 No    FALSE FALSE FALSE    18 No     
37 104421 No    1.66   2.71 No    FALSE FALSE FALSE    18 No     
38 104421 No    0.678  5    No    FALSE FALSE TRUE      5 No     
39 104421 No    1.98   5    No    FALSE FALSE TRUE      5 No     
40 104421 No    1.29   5    No    FALSE FALSE TRUE      5 No     
41 104421 No    2.32   5    No    FALSE FALSE TRUE      5 No     
42 104421 No    1.29   5    No    FALSE FALSE TRUE      5 No     
43 104421 No    0      1.25 No    FALSE TRUE  TRUE      2 No     
44 104421 No    0      1.25 No    FALSE TRUE  TRUE      2 No     
45 104421 Yes   1.66   0    Yes   TRUE  FALSE FALSE     1 No     
46 104421 Yes   1.66   5.21 Yes   TRUE  FALSE TRUE      1 No     
47 104421 Yes   0      3.12 Yes   TRUE  TRUE  FALSE     3 No     
48 104421 Yes   0      0    Yes   TRUE  TRUE  FALSE     3 No     
49 104421 Yes   0      3.12 Yes   TRUE  TRUE  FALSE     3 No     
50 104421 Yes   0      5.21 Yes   TRUE  TRUE  TRUE      1 Yes    
51 104554 No    0      7.14 No    FALSE TRUE  FALSE     3 No     
52 104554 No    0     10.7  No    FALSE TRUE  FALSE     3 No     
53 104554 No    0     14.3  No    FALSE TRUE  FALSE     3 No     
54 104554 No    0     25    Yes   FALSE TRUE  TRUE      1 No     
55 104554 Yes   0     17.9  Yes   TRUE  TRUE  FALSE     1 No     
56 104554 Yes   0     25    Yes   TRUE  TRUE  TRUE      1 Yes  

如果对于给定的类型x == "Yes"y == min(y)此子组和z == max(z)此子子组,并且结果组中只有一个元素,则结果为"Yes",否则为"No"

这与您想要的输出不同,我认为实际情况存在一些问题,但我认为您可以使用类似的方法解决它们。


推荐阅读