r - Labeling Parent ID then Merging Back with Dataframe
问题描述
I am trying to label rows with the id of the row above it (their parent id). In the example below, I have a tibble with different things you would say to a person. They are classified into greetings, farewells, questions, etc. Assuming the first entry of each classification/group is the root, I am trying to label the first child (the second entry) with the ID of the root.
The code below is able to label the second child, however, the tibble I end up with is missing two entries because of a filter. The filter is important because, in reality, the dataset is more complicated, so it (most likely) needs to stay.
How can I merge back the newly labeled tibble back with the original tibble? Also, if there are ways to do this within a pipe chain, that's even better.
library(dplyr)
test_df <- tibble(msg_id = as.character(c(1, 2, 3, 4, 5, 6, 7, 8)),
msg_group = c("greeting", "greeting", "greeting", "greeting",
"farewell", "farewell", "question", "question"),
content = c("hello", "hey there", "morning", "howdy", "bye",
"see ya", "how are you", "who are you"),
parent_id = NA_character_)
labeling_test <- test_df %>%
group_by(msg_group) %>%
mutate(rank = rank(msg_id)) %>%
filter(rank <= 2)
#sorts these into ranks within each group
#rank 1 is the root, rank 2 will be the first child of root
for(i in seq(1, nrow(labeling_test), 2)){
labeling_test[i + 1,]$parent_id <- labeling_test[i,]$msg_id
}
#label the even number items with id of the item before it
#in terms of this code, label rank 2 with the id of rank 1,
#rank 4 with the id of rank 3...
labeling_test
The end goal would be a dataframe that looks like:
# A tibble: 8 x 6
msg_id msg_group content parent_id rank
<chr> <chr> <chr> <chr> <dbl>
1 1 greeting hello NA 1
2 2 greeting hey there 1 2
3 3 greeting morning NA 3
4 4 greeting howdy NA 4
5 5 farewell bye NA 1
6 6 farewell see ya 5 2
7 7 question how are you NA 1
8 8 questions who are you 7 1
The end goal is actually to turn an email thread into a tree structure. Labeling the first two emails is easy because they are the oldest and the second oldest. After that it becomes more complicated. The tricky part with gmail threads is they don't store the parent message (or I haven't found where it's stored). So you have to use the content of the message to label the parents. Additionally, using the timestamp of the email doesn't work either because people can reply individually to messages and start new branches where time doesn't relate to their position in the branch.
Not that this is important to the question above. If you know of something around this topic that would be cool too.
解决方案
我认为加入/合并操作是最有效的:
test_df %>%
group_by(msg_group) %>%
mutate(rank = rank(group_id)) %>%
filter(rank <= 2) %>%
ungroup() %>%
select(msg_id, rank) %>%
left_join(test_df, ., by = "msg_id")
# # A tibble: 8 x 6
# group_id msg_id msg_group content parent_id rank
# <dbl> <chr> <chr> <chr> <chr> <dbl>
# 1 1 1 greeting hello <NA> 1
# 2 2 2 greeting hey there <NA> 2
# 3 3 3 greeting morning <NA> NA
# 4 4 4 greeting howdy <NA> NA
# 5 1 5 farewell bye <NA> 1
# 6 2 6 farewell see ya <NA> 2
# 7 1 7 question how are you <NA> 1
# 8 2 8 questions who are you <NA> 1
编辑:也许你不需要加入/合并,只需就地变异
test_df %>%
group_by(msg_group) %>%
mutate(parent_id = if_else(row_number() == 2, msg_id[1], NA_character_))
# # A tibble: 8 x 4
# # Groups: msg_group [3]
# msg_id msg_group content parent_id
# <chr> <chr> <chr> <chr>
# 1 1 greeting hello <NA>
# 2 2 greeting hey there 1
# 3 3 greeting morning <NA>
# 4 4 greeting howdy <NA>
# 5 5 farewell bye <NA>
# 6 6 farewell see ya 5
# 7 7 question how are you <NA>
# 8 8 question who are you 7
(我认为没有必要为此创建列rank
,但这并没有什么坏处。)
推荐阅读
- php - 刷新网页后表单数据尝试重新发送
- nuget - NuGet 使用 NuGetCommand@2 Azure Pipeline 任务还原 package.config 和 PackageReference
- apache-kafka-streams - 在创建它的同一个应用程序中查询 KTable
- c# - 从 C# 为运行时版本 3.x 禁用 Azure 函数?
- ios - 如何更改 Flutter 许可证的文本颜色?
- flutter - Flutter BloC 在注销时刷新流
- python - 带有 for 循环的蛮力脚本,没有 itertools
- python - AttributeError:“张量”对象没有属性“to_sparse”
- flutter - 无法显示来自 api 响应颤动的图像
- excel - 在excel中将一些数据从一个表移动到另一个表