首页 > 解决方案 > 半融化数据的 Tidyr 解决方案

问题描述

我正在尝试清理一些泵站数据,这些数据来自工厂操作员手动输入基于 excel 的日志工作簿值的日期和停止/启动量。使这件事变得棘手的是,这三个值是作为跨多列的重复行输入的。有点难以口头描述(更不用说寻找是否有人有类似的问题),因此称其为“半融化”。这是其中一些的 dput 的样子:

structure(list(X1 = c("DATE", "STOP", "START", "DATE", "STOP", 
"START", "DATE", "STOP", "START", "DATE"), X2 = c(43466, 654896, 
654276, 43470, 657669, 656819, 43474, 660160, 659368, 43478), 
X3 = c("DATE", "STOP", "START", "DATE", "STOP", "START", 
"DATE", "STOP", "START", "DATE"), X4 = c(43467, 655298, 654896, 
43471, 658268, 657669, 43475, 660977, 660160, 43479), X5 = c("DATE", 
"STOP", "START", "DATE", "STOP", "START", "DATE", "STOP", 
"START", "DATE"), X6 = c("43468", "655959", "655298", "43472", 
"658620", "658268", "43476", "661774", "660977", "43480"), 
X7 = c("DATE", "STOP", "START", "DATE", "STOP", "START", 
"DATE", "STOP", "START", "DATE"), X8 = c("43469", "656819", 
"655959", "43473", "659368", "658620", "43477", "662673", 
"661774", "43481")), row.names = c(NA, 10L), class = "data.frame")

我想把它清理成一个包含三列 DATE、START 和 STOP 的时间序列。看起来像这样的东西:

     Date  Start   Stop
1  43466 654276 654896
2  43470 656819 657669
3  43474 659368 660160
4  43478 662673 663168
5  43482 665148 665951
6  43486 667944 668537
7  43490 670950 671692
8  43494 673621 674418
9  43497 676090 676884
10 43501 678559 679399

我从来没有对收集和传播函数有很好的感觉(仍然更喜欢melt和dcast),但令我高兴的是,我看到了更新的函数pivot_longer和pivot_wider。我在上述任何函数中都有一个整洁的解决方案,但我一直被那些希望当前列名(“X1”到“X8”)有意义的函数所困扰,但实际上它们是任意的。

有什么建议吗?

标签: rtidyr

解决方案


这是一种方法 -

df2 <- as.matrix(df)
rbind(df2[,1:2], df2[,3:4], df2[,5:6], df2[,7:8]) %>% 
  as_tibble() %>%
  mutate(id = cumsum(X1 == "DATE")) %>% 
  spread(X1, X2, convert = T) %>% 
  arrange(DATE, START, STOP)

# A tibble: 16 x 4
      id  DATE  START   STOP
   <int> <int>  <int>  <int>
 1     1 43466 654276 654896
 2     5 43467 654896 655298
 3     9 43468 655298 655959
 4    13 43469 655959 656819
 5     2 43470 656819 657669
 6     6 43471 657669 658268
 7    10 43472 658268 658620
 8    14 43473 658620 659368
 9     3 43474 659368 660160
10     7 43475 660160 660977
11    11 43476 660977 661774
12    15 43477 661774 662673
13     4 43478     NA     NA
14     8 43479     NA     NA
15    12 43480     NA     NA
16    16 43481     NA     NA

原始数据 -

df
      X1     X2    X3     X4    X5     X6    X7     X8
1   DATE  43466  DATE  43467  DATE  43468  DATE  43469
2   STOP 654896  STOP 655298  STOP 655959  STOP 656819
3  START 654276 START 654896 START 655298 START 655959
4   DATE  43470  DATE  43471  DATE  43472  DATE  43473
5   STOP 657669  STOP 658268  STOP 658620  STOP 659368
6  START 656819 START 657669 START 658268 START 658620
7   DATE  43474  DATE  43475  DATE  43476  DATE  43477
8   STOP 660160  STOP 660977  STOP 661774  STOP 662673
9  START 659368 START 660160 START 660977 START 661774
10  DATE  43478  DATE  43479  DATE  43480  DATE  43481

推荐阅读