首页 > 解决方案 > Labeling Parent ID then Merging Back with Dataframe

问题描述

I am trying to label rows with the id of the row above it (their parent id). In the example below, I have a tibble with different things you would say to a person. They are classified into greetings, farewells, questions, etc. Assuming the first entry of each classification/group is the root, I am trying to label the first child (the second entry) with the ID of the root.

The code below is able to label the second child, however, the tibble I end up with is missing two entries because of a filter. The filter is important because, in reality, the dataset is more complicated, so it (most likely) needs to stay.

How can I merge back the newly labeled tibble back with the original tibble? Also, if there are ways to do this within a pipe chain, that's even better.

library(dplyr)

test_df <- tibble(msg_id = as.character(c(1, 2, 3, 4, 5, 6, 7, 8)), 
                  msg_group = c("greeting", "greeting", "greeting", "greeting", 
                                "farewell", "farewell", "question", "question"),
                  content = c("hello", "hey there", "morning", "howdy", "bye", 
                              "see ya", "how are you", "who are you"),
                  parent_id = NA_character_)

labeling_test <- test_df %>%
  group_by(msg_group) %>%
  mutate(rank = rank(msg_id)) %>%
  filter(rank <= 2)

#sorts these into ranks within each group
#rank 1 is the root, rank 2 will be the first child of root

for(i in seq(1, nrow(labeling_test), 2)){
  labeling_test[i + 1,]$parent_id <- labeling_test[i,]$msg_id
}

#label the even number items with id of the item before it
#in terms of this code, label rank 2 with the id of rank 1,
#rank 4 with the id of rank 3...

labeling_test

The end goal would be a dataframe that looks like:

# A tibble: 8 x 6
   msg_id     msg_group content     parent_id  rank
    <chr>     <chr>     <chr>       <chr>     <dbl>
1       1     greeting  hello       NA            1
2       2     greeting  hey there   1             2
3       3     greeting  morning     NA            3
4       4     greeting  howdy       NA            4
5       5     farewell  bye         NA            1
6       6     farewell  see ya      5             2
7       7     question  how are you NA            1
8       8     questions who are you 7             1

The end goal is actually to turn an email thread into a tree structure. Labeling the first two emails is easy because they are the oldest and the second oldest. After that it becomes more complicated. The tricky part with gmail threads is they don't store the parent message (or I haven't found where it's stored). So you have to use the content of the message to label the parents. Additionally, using the timestamp of the email doesn't work either because people can reply individually to messages and start new branches where time doesn't relate to their position in the branch.

Not that this is important to the question above. If you know of something around this topic that would be cool too.

标签: rdplyrdata.tree

解决方案


我认为加入/合并操作是最有效的:

test_df %>%
  group_by(msg_group) %>%
  mutate(rank = rank(group_id)) %>%
  filter(rank <= 2) %>%
  ungroup() %>%
  select(msg_id, rank) %>%
  left_join(test_df, ., by = "msg_id")
# # A tibble: 8 x 6
#   group_id msg_id msg_group content     parent_id  rank
#      <dbl> <chr>  <chr>     <chr>       <chr>     <dbl>
# 1        1 1      greeting  hello       <NA>          1
# 2        2 2      greeting  hey there   <NA>          2
# 3        3 3      greeting  morning     <NA>         NA
# 4        4 4      greeting  howdy       <NA>         NA
# 5        1 5      farewell  bye         <NA>          1
# 6        2 6      farewell  see ya      <NA>          2
# 7        1 7      question  how are you <NA>          1
# 8        2 8      questions who are you <NA>          1

编辑:也许你不需要加入/合并,只需就地变异

test_df %>%
  group_by(msg_group) %>%
  mutate(parent_id = if_else(row_number() == 2, msg_id[1], NA_character_))
# # A tibble: 8 x 4
# # Groups:   msg_group [3]
#   msg_id msg_group content     parent_id
#   <chr>  <chr>     <chr>       <chr>    
# 1 1      greeting  hello       <NA>     
# 2 2      greeting  hey there   1        
# 3 3      greeting  morning     <NA>     
# 4 4      greeting  howdy       <NA>     
# 5 5      farewell  bye         <NA>     
# 6 6      farewell  see ya      5        
# 7 7      question  how are you <NA>     
# 8 8      question  who are you 7        

(我认为没有必要为此创建rank,但这并没有什么坏处。)


推荐阅读