首页 > 解决方案 > 使用正则表达式将部分列重塑为长格式

问题描述

我有一个宽格式的数据框。

df <- data.frame(
time = as.Date('2009-01-01') + 0:5,
D.13.JA = rnorm(6, 0, 1),
D.40.JA = rnorm(6, 0, 1),
D.90.JA = rnorm(6, 0, 1),
A.13.JA = rnorm(6, 0, 1),
R.13.JA = rnorm(6, 0, 1)
)
        time    D.13.JA    D.40.JA    D.90.JA      A.13.JA     R.13.JA
1 2009-01-01 -2.2529442  0.1341954  0.3024757 -0.465533145 -0.49755117
2 2009-01-02  1.0698570 -1.3597724  0.6607091  0.001913148  0.92522135
3 2009-01-03  1.7558374 -1.0280084 -0.1446586 -0.355776775  0.12556738
4 2009-01-04 -0.2571767 -0.9065826  0.9340532 -0.150408270 -0.57386938
5 2009-01-05  0.2389923 -1.2818616  0.5643812 -1.272623868 -0.05700965
6 2009-01-06  1.6444592 -1.5610767 -1.4377561 -0.701273356  0.29777858

我希望将数据框转换为这种格式:

        time DirDegree Type         Wh
1 2009-01-01   D.13   JA         -2.2529442
2 2009-01-02   D.13   JA          1.0698570
3 2009-01-03   D.13   JA          1.7558374
4 2009-01-04   D.13   JA         -0.2571767
5 2009-01-05   D.13   JA          0.2389923
6 2009-01-06   D.13   JA          1.6444592

到目前为止,我已经成功地将其转换为整洁的格式

df.tidy = df %>%
    gather(key, Wh, -time) %>%
    separate(key, c("Dir", "Degree", "Type"), "\\.")
        time Dir Degree Type          Wh
1 2009-01-01   D     13   JA -1.18105757
2 2009-01-02   D     13   JA  1.34437449
3 2009-01-03   D     13   JA -0.08451173
4 2009-01-04   D     13   JA -1.88959285
5 2009-01-05   D     13   JA  1.25388470
6 2009-01-06   D     13   JA -1.24286611

我试图根据这个答案格式化它

test1 = df %>%
    gather(key, value, -time) %>%
    extract(key, c("DirDeg", "Type"), "(..\\..)\\.(.)")

test2 = df %>%
    gather(key, value, -time) %>%
    extract(key, c("DirDeg", "Type"), "(\\.)\\.()")

两者都给了我

         time DirDeg Type       value
1  2009-01-01   <NA> <NA> -1.18105757
2  2009-01-02   <NA> <NA>  1.34437449
3  2009-01-03   <NA> <NA> -0.08451173
4  2009-01-04   <NA> <NA> -1.88959285
5  2009-01-05   <NA> <NA>  1.25388470
6  2009-01-06   <NA> <NA> -1.24286611
7  2009-01-01   <NA> <NA> -0.55782526

标签: rdplyrreshapetidyr

解决方案


我们也可以使用separate. 显示.有两个匹配项 - 1).后跟一个数字,2).后跟大写字母。如果我们提供正则表达式环视来匹配.前一个大写字符,即第二个匹配,它将以这种方式拆分

library(tidyverse)
df %>% 
  gather(key, Wh, -time) %>% 
  separate(key, into = c("DirDeg", "Type"), sep = "\\.(?=[A-Z])") %>%
  as_tibble
# A tibble: 30 x 4
#   time       DirDeg Type        Wh
#   <date>     <chr>  <chr>    <dbl>
# 1 2009-01-01 D.13   JA    -0.546  
# 2 2009-01-02 D.13   JA     0.537  
# 3 2009-01-03 D.13   JA     0.420  
# 4 2009-01-04 D.13   JA    -0.584  
# 5 2009-01-05 D.13   JA     0.847  
# 6 2009-01-06 D.13   JA     0.266  
# 7 2009-01-01 D.40   JA     0.445  
# 8 2009-01-02 D.40   JA    -0.466  
# 9 2009-01-03 D.40   JA    -0.848  
#10 2009-01-04 D.40   JA     0.00231
# … with 20 more rows

推荐阅读