首页 > 解决方案 > 在满足某个值的点周围重新定位时间序列列 - R 数据帧

问题描述

我有一些时间序列数据,我正在尝试对其进行一些生存分析,并且我对事件发生前 X 年内发生的趋势感兴趣。

换句话说,我想采用当前映射到特定年份的值并将它们移动,以便它们代表事件发生前的 X 年。

例如,如果我对 1990 年到 2010 年间的每一年都有观察,我当前的数据框看起来像这样:

+------+------+------+------+------+------+------+-----+
| Unit | 1990 | 1991 | 1992 | 1994 | 1995 | 1996 | ... |
+------+------+------+------+------+------+------+-----+
| A    |   80 |   75 |   45 |    0 |    0 |    0 |     |
| B    |   50 |   40 |    0 |    0 |    0 |    0 |     |
| C    |   90 |   90 |   89 |   87 |    0 |    0 |     |
+------+------+------+------+------+------+------+-----+

我希望它看起来像这样:

+------+-----+-----+-----+-----+-----+---+-----+
| Unit | X-5 | X-4 | X-3 | X-2 | X-1 | X |...  |
+------+-----+-----+-----+-----+-----+---+-----+
| A    | NA  | NA  | 80  |  75 |  45 | 0 |     |
| B    | NA  | NA  | NA  |  50 |  40 | 0 |     |
| C    | NA  | 90  | 90  |  89 |  87 | 0 |     |
+------+-----+-----+-----+-----+-----+---+-----+

或者,如果 R 中有一个包可以自动执行此操作(即,一个分析此类趋势的生存分析包),我很乐意提供建议。

标签: rdataframesurvival-analysis

解决方案


它有点混乱,可能会有所改进,但作为起点,它可能对您有用。我在函数前面添加了包名。

# Create tibble / data frame
df <- tibble::tibble("Unit" = c("A","B","C"),
                     "1990" = c(80,50,90),
                     "1991" = c(75,40,90),
                     "1992" = c(45,0,89),
                     "1994" = c(0,0,87),
                     "1995" = c(0,0,0))

# Transform from wide to long format
# and add an index per unit
df_g <- df %>%
  tidyr::gather(key = "year", value = "val", 2:6) %>% 
  dplyr::arrange(Unit, year) %>% 
  dplyr::group_by(Unit) %>% 
  dplyr::mutate(.index = 1 : dplyr::n())

df_g
# # A tibble: 15 x 4
# # Groups:   Unit [3]
#    Unit  year    val .index
#    <chr> <chr> <dbl>  <int>
#  1 A     1990     80      1
#  2 A     1991     75      2
#  3 A     1992     45      3
#  4 A     1994      0      4
#  5 A     1995      0      5
#  6 B     1990     50      1
#  7 B     1991     40      2
#  8 B     1992      0      3
#  9 B     1994      0      4
# 10 B     1995      0      5
# 11 C     1990     90      1
# 12 C     1991     90      2
# 13 C     1992     89      3
# 14 C     1994     87      4
# 15 C     1995      0      5

# Identify the first year per unit with the value 0
zeroes <- df_g %>% 
  dplyr::filter(val == 0) %>% 
  dplyr::group_by(Unit) %>% 
  dplyr::filter(dplyr::row_number() == 1) %>% 
  dplyr::select(-c(year, val)) %>% 
  dplyr::rename(zero = .index)

zeroes
# # A tibble: 3 x 2
# # Groups:   Unit [3]
#   Unit   zero
#   <chr> <int>
# 1 A         4
# 2 B         3
# 3 C         5

# Add that information with a join operation
# and create the new column names
df_z <- df_g %>% 
  dplyr::left_join(zeroes, by="Unit") %>% 
  dplyr::mutate(step = .index - zero,
                new_name = paste0("X", ifelse(step >= 0, "+", "-"), abs(step))) %>% 
  dplyr::select(Unit, new_name, val)

df_z
# # A tibble: 15 x 3
# # Groups:   Unit [3]
#    Unit  new_name   val
#    <chr> <chr>    <dbl>
#  1 A     X-3         80
#  2 A     X-2         75
#  3 A     X-1         45
#  4 A     X+0          0
#  5 A     X+1          0
#  6 B     X-2         50
#  7 B     X-1         40
#  8 B     X+0          0
#  9 B     X+1          0
# 10 B     X+2          0
# 11 C     X-4         90
# 12 C     X-3         90
# 13 C     X-2         89
# 14 C     X-1         87
# 15 C     X+0          0

# Spread to wide format again
df_transformed <- df_z %>% 
  tidyr::spread(key = "new_name", value = "val")

df_transformed
# # A tibble: 3 x 8
# # Groups:   Unit [3]
#   Unit  `X-1` `X-2` `X-3` `X-4` `X+0` `X+1` `X+2`
#   <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 A        45    75    80    NA     0     0    NA
# 2 B        40    50    NA    NA     0     0     0
# 3 C        87    89    90    90     0    NA    NA

如果您发现您更喜欢使用长格式,您可以跳过最后一个转换,也许使用“step”列而不是“new_name”列。

希望这是有用的:)


推荐阅读