首页 > 解决方案 > 使用 dplyr 根据时间序列数据中的特定因子水平创建新变量

问题描述

我有一些时间序列数据,其中序列的步骤(范围从 1 到 8)及其主题(>100)都被编码为单个变量中的字符因子级别。这是一个最小的示例(我省略了每个 id 中会增加的时间戳):

id <- c(1,rep(2,5),rep(3,4))
step <- c("call", "call", "agent", "forest", "forward", "resolved", "call", "agent", "beach", "resolved")
(df <- data.frame(id,step))
   id     step
1   1     call
2   2     call
3   2    agent
4   2   forest
5   2  forward
6   2 resolved
7   3     call
8   3    agent
9   3    beach
10  3 resolved

我现在想将此信息拆分为两个专用变量(步骤和主题),从而将数据框缩小为行并使其更宽,同时还为时间序列的每一行重复主题并在没有时添加“NA”话题。使用 base R 将其拆分为两个数据帧并将它们重新合并在一起即可完成工作:

step <- subset(df, step %in% c("call", "agent", "forward", "resolved"))
topic <- subset(df, step %in% c("forest", "beach"))
topic$topic <- topic$step
topic$step <- NULL
(newdf <- merge(step,topic, all=TRUE))
  id     step  topic
1  1     call   <NA>
2  2     call forest
3  2    agent forest
4  2  forward forest
5  2 resolved forest
6  3     call  beach
7  3    agent  beach
8  3 resolved  beach

不过这有点笨拙,我正在寻找一种更优雅的 dplyr/tidyverse 方法。pivot_wider() 似乎无法做到这一点。有任何想法吗?

标签: rdplyr

解决方案


感谢您提供问题的最小示例

id <- c(1,rep(2,5),rep(3,4))
step <- c("call", "call", "agent", "forest", "forward",
  "resolved", "call", "agent", "beach", "resolved")
df <- data.frame(id,step)
df
#>    id     step
#> 1   1     call
#> 2   2     call
#> 3   2    agent
#> 4   2   forest
#> 5   2  forward
#> 6   2 resolved
#> 7   3     call
#> 8   3    agent
#> 9   3    beach
#> 10  3 resolved

这是使用 tidyverse 的可能解决方案

library(dplyr)
library(tidyr)

df %>% 
  # define in column type_c if step is an step or a topic
  # you need a unique id for each row to use pivot_wider in this case
  mutate(
    type_c = if_else(step %in% c("forest", "beach"), "topic", "step"), 
    unique_id = 1:nrow(df)) %>% 
  pivot_wider(names_from = type_c, values_from = c(id, step)) %>% 
  mutate(id = coalesce(id_step, id_topic)) %>%
  select(id, step = step_step, topic = step_topic) %>% 
  # Need group_by to apply the function fill 
  group_by(id) %>%
  # fill replaces NA, in each id,  with a value found in any direction "downup"
  fill(topic, .direction = "downup") %>% 
  # get rid off the NA in column step that pivot_wider created for each topic
  filter(!is.na(step)) 
#> # A tibble: 8 x 3
#> # Groups:   id [3]
#>      id step     topic 
#>   <dbl> <chr>    <chr> 
#> 1     1 call     <NA>  
#> 2     2 call     forest
#> 3     2 agent    forest
#> 4     2 forward  forest
#> 5     2 resolved forest
#> 6     3 call     beach 
#> 7     3 agent    beach 
#> 8     3 resolved beach

reprex 包(v0.3.0)于 2021-06-08 创建


推荐阅读