首页 > 解决方案 > 收集具有多个标题行的数据集

问题描述

我正在尝试收集具有多个标题行的数据集:

Sample Date Site Code   Arsenic Barium  Boron   Filterable Antimony
.   .   ug/L    ug/L    ug/L    ug/L
.   .   TM66TW  TM66TW  TM66TW  TM66FW
21/07/2016  1   0.7 88.6    24  <0.3
21/07/2016  A   0.7 110 19  <0.3
15/09/2016  1   0.5 67  32  <0.3
15/09/2016  A   0.4 96.5    22  <0.3
15/09/2016  2   0.6 122 26  <0.3

我想收集我的数据集,以便前三行列 3:6 成为变量,即“金属”、“检测单元”和“测试方法代码”

有没有一种简单的方法可以在 tidyverse 中实现这一目标?

标签: rdataframetidyverse

解决方案


注意:在底部找到原始数据集(ie )。 df0


它可以tidyverse通过创建一个将金属名称、单位和方法粘贴在一起的字符向量来完成。稍后我们可以使用该向量重命名列。之后,将数据重新整形为长格式,我们将拥有一个包含所有这三个的列( name列),可以将其分成三列以获得所需的输出。见下文;

library(dplyr)
library(tidyr)

df1 <- df0
## creating a character vector with the desired column names
df1 %>% 
  select(-c(1,2)) %>% 
  slice(1:2) %>% 
  mutate_if(is.factor, as.character) %>%
  rbind(names(.), .) %>% 
  summarise_all(toString) %>% 
  as.character -> comp.header

## renaming columns and then converting to long format  
df1 %>% 
  slice(-c(1,2)) %>% 
  rename_at(3:ncol(.), list(~comp.header)) %>% 
  pivot_longer(-c(Sample_Date, Site_Code)) %>% 
  separate(name, sep = ", ", into = c("Metal", "Detection", "Method"))

#> # A tibble: 20 x 6
#>   Sample_Date Site_Code Metal    Detection Method value
#>   <fct>       <fct>     <chr>    <chr>     <chr>  <fct>
#> 1 21/07/2016  1         Arsenic  ug/L      TM66TW 0.7  
#> 2 21/07/2016  1         Barium   ug/L      TM66TW 88.6 
#> 3 21/07/2016  1         Boron    ug/L      TM66TW 24  
#> 4 21/07/2016  1         Antimony ug/L      TM66FW <0.3 
#> 5 21/07/2016  A         Arsenic  ug/L      TM66TW 0.7  
#> 6 21/07/2016  A         Barium   ug/L      TM66TW 110
#> # ... with 14 more rows 


我们可以通过将站点代码和单位附加到标题中base来实现这一点,然后将数据转换为长格式,最后将具有列名的列分隔为具有“金属”、“单元”和“方法”的三列。见下文;

df1 <- df0
## making sure that everything is character and not factpr
df1[] <- lapply(df1, as.character)

## appending unit and site info to the header names
names(df1)[3:ncol(df1)] <- paste(names(df1)[3:ncol(df1)], 
                                 df1[1,3:ncol(df1)], 
                                 df1[2,3:ncol(df1)], sep = ";")

## removing second and third row
df1 <- df1[3:nrow(df1),]

## converting wide to long
df2 <- cbind(df1[1:2], stack(lapply(df1[-c(1, 2)], as.character)))

## separate ind column to three desired variables and append to the dataframe
df2 <- data.frame(subset(df2, select = -ind), 
                  setNames(as.data.frame(do.call(rbind, strsplit(as.character(df2$ind), ';'))), 
                           c("Metal", "Detection", "Code")))

## rearranging the columns
df2 <- df2[c(1,2,4:ncol(df2), 3)]

## result
head(df2)
#>   Sample_Date Site_Code   Metal Detection   Code values
#> 1  21/07/2016         1 Arsenic      ug/L TM66TW    0.7
#> 2  21/07/2016         A Arsenic      ug/L TM66TW    0.7
#> 3  15/09/2016         1 Arsenic      ug/L TM66TW    0.5
#> 4  15/09/2016         A Arsenic      ug/L TM66TW    0.4
#> 5  15/09/2016         2 Arsenic      ug/L TM66TW    0.6
#> 6  21/07/2016         1  Barium      ug/L TM66TW   88.6
#> ...



数据:

## reading data as df0 to preserve it for both solutions
read.table(text="Sample_Date Site_Code   Arsenic Barium  Boron   Antimony
.   .   ug/L    ug/L    ug/L    ug/L
.   .   TM66TW  TM66TW  TM66TW  TM66FW
21/07/2016  1   0.7 88.6    24  <0.3
21/07/2016  A   0.7 110 19  <0.3
15/09/2016  1   0.5 67  32  <0.3
15/09/2016  A   0.4 96.5    22  <0.3
15/09/2016  2   0.6 122 26  <0.3", header= T) -> df0

推荐阅读