首页 > 解决方案 > Collapse rows if present in same groups

问题描述

Hello everyone one I have a dataframe such as :

  Group       family
1     A      Canidae
2     B      Canidae
3     A      Felidae
4     B      Canidae
5     C Elephantidae
6     C    Galinacae
7     D    Galinacae
8     D     Siuridae
9     E       Apidae

And I would like to collapse Group groups where family is present (for instance :

Canidae is present in A, and B so I collaspe and add in family2 all unique values of each groups

Group family2
A,B   Canidae,Felidae 

then I continue and I see that Elephantidae and Galinacae are both in C and thta Galinacae is also in D so I collapse :

Group family2
A,B   Canidae,Felidae 
C,D   Elephantidae,Galinacae,Siuridae 

At the end we should get :

Group family2
A,B   Canidae,Felidae 
C,D   Elephantidae,Galinacae,Siuridae 
E     Apidae 

do someone have an idea ?

Here are the data in order to to a such as thing please ? thank you very much for your help and time .

Here is the data if it can helps :

structure(list(Group = structure(c(1L, 2L, 1L, 2L, 3L, 3L, 4L, 
4L, 5L), .Label = c("A", "B", "C", "D", "E"), class = "factor"), 
    family = structure(c(2L, 2L, 4L, 2L, 3L, 5L, 5L, 6L, 1L), .Label = c("Apidae", 
    "Canidae", "Elephantidae", "Felidae", "Galinacae", "Siuridae"
    ), class = "factor")), class = "data.frame", row.names = c(NA, 
-9L))

标签: rdplyr

解决方案


Here is my solution with some lookup function

# A lookup function that look for intersect between group 
# if there are at least one intersect - those group will be combined
look_up_group <- function(one_group, lookup_list) {
  matched_list <- map(lookup_list, function(x) {
    tryCatch(
      {
        intersect(x, one_group)
      }, error = function(e) {
        stop(paste0("Error in lookup function: one_group=", one_group, "; x=", x))
      }) 
  })
  
  index <- which(unlist(map(matched_list, function(x) { length(x) > 0 })))
  sort(unique(unlist(lookup_list[index])))
}


df %>%
  # First remove all duplicated rows - exactly the same for both Group, Family
  filter(!duplicated(.)) %>%
  # arrange in alphabetical order
  arrange(Group, family) %>%
  # create a Group_2 which is combination of all Group for each family
  group_by(family) %>%
  mutate(Group_2 = list(Group)) %>%
  ungroup() %>%
  # Create Group_3 which is the full combined Group for all intersect Group
  mutate(Group_3 = map(.[["Group_2"]], function(x) { look_up_group(one_group = x, lookup_list = .[["Group_2"]]) })) %>%
  # Combine all Group_3 into a Group_final
  mutate(Group_final = unlist(map(Group_3, function(x) { paste (x, collapse = ",")} ))) %>%
  # Finally put them all together.
  select(Group_final, family) %>%
  group_by(Group_final) %>%
  summarize(family = paste(family, collapse = ","), .groups = "drop")

Here is the final output

  Group_final family                                   
* <chr>       <chr>                                    
1 A,B         Canidae,Felidae,Canidae                  
2 C,D         Elephantidae,Galinacae,Galinacae,Siuridae
3 E           Apidae   

To make it easier to understand here is the detail of each steps

First step

# remove duplicate & create variable Group_2
tmp <- df %>%
  filter(!duplicated(.)) %>%
  arrange(Group, family) %>%
  group_by(family) %>%
  mutate(Group_2 = list(Group)) %>%
  ungroup()

we have this data -

  Group family       Group_2  
  <fct> <fct>        <list>   
1 A     Canidae      <fct [2]>
2 A     Felidae      <fct [1]>
3 B     Canidae      <fct [2]>
4 C     Elephantidae <fct [1]>
5 C     Galinacae    <fct [2]>
6 D     Galinacae    <fct [2]>
7 D     Siuridae     <fct [1]>
8 E     Apidae       <fct [1]>

Where Group_2 would look like this

> tmp$Group_2
[[1]]
[1] A B
Levels: A B C D E

[[2]]
[1] A
Levels: A B C D E

[[3]]
[1] A B
Levels: A B C D E

[[4]]
[1] C
Levels: A B C D E

[[5]]
[1] C D
Levels: A B C D E

[[6]]
[1] C D
Levels: A B C D E

[[7]]
[1] D
Levels: A B C D E

[[8]]
[1] E
Levels: A B C D E

Then next step create Group_3 and combine them into Group_final

# Create Group_3
tmp <- tmp %>% 
  mutate(Group_3 = map(.[["Group_2"]],
    function(x) { look_up_group(one_group = x, lookup_list = .[["Group_2"]]) })) %>%
  mutate(Group_final = unlist(map(Group_3, function(x) { paste (x, collapse = ",")} )))

Here is the new tmp

# A tibble: 8 x 5
  Group family       Group_2   Group_3   Group_final
  <fct> <fct>        <list>    <list>    <chr>      
1 A     Canidae      <fct [2]> <fct [2]> A,B        
2 A     Felidae      <fct [1]> <fct [2]> A,B        
3 B     Canidae      <fct [2]> <fct [2]> A,B        
4 C     Elephantidae <fct [1]> <fct [2]> C,D        
5 C     Galinacae    <fct [2]> <fct [2]> C,D        
6 D     Galinacae    <fct [2]> <fct [2]> C,D        
7 D     Siuridae     <fct [1]> <fct [2]> C,D        
8 E     Apidae       <fct [1]> <fct [1]> E     

Then final step combine put family in Group_final together

tmp %>%
  select(Group_final, family) %>%
  group_by(Group_final) %>%
  summarize(family = paste(family, collapse = ","), .groups = "drop")

The final results

# A tibble: 3 x 2
  Group_final family                                   
* <chr>       <chr>                                    
1 A,B         Canidae,Felidae,Canidae                  
2 C,D         Elephantidae,Galinacae,Galinacae,Siuridae
3 E           Apidae 

[Update: added tryCatch for debug]


推荐阅读