首页 > 解决方案 > 合并重复分数但标记差异

问题描述

这就是我所拥有的:

df <- structure(list(Sample = structure(c(1L, 1L, 2L, 2L, 3L, 3L, 4L, 
                                    4L), .Label = c("19-0001", "19-0002", "19-0003", "19-0004"), class = "factor"), 
               Replicate = c(1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L), X24854000 = structure(c(1L, 
                                                                                      2L, 2L, 1L, 2L, 2L, 1L, 1L), .Label = c("", "CC"), class = "factor"), 
               X24854056 = structure(c(3L, 3L, 2L, 1L, 1L, 1L, 1L, 1L), .Label = c("", 
                                                                                   "AA", "GG"), class = "factor"), X24854764 = structure(c(1L, 
                                                                                                                                           1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "TA", class = "factor"), 
               X24854903 = structure(c(1L, 1L, 1L, 1L, 2L, 2L, 1L, 1L), .Label = c("", 
                                                                                   "CT"), class = "factor"), X24855066 = structure(c(1L, 1L, 
                                                                                                                                     3L, 3L, 2L, 2L, 2L, 2L), .Label = c("", "CA", "CC"), class = "factor"), 
               X24855114 = structure(c(2L, 1L, 3L, 3L, 2L, 2L, 2L, 2L), .Label = c("", 
                                                                                   "GA", "GG"), class = "factor"), X24855316 = structure(c(2L, 
                                                                                                                                           2L, 1L, 1L, 2L, 2L, 2L, 1L), .Label = c("", "TC"), class = "factor"), 
               X24855449 = structure(c(1L, 2L, 2L, 2L, 1L, 1L, 1L, 1L), .Label = c("CC", 
                                                                                   "GG"), class = "factor"), X24855925 = structure(c(2L, 1L, 
                                                                                                                                     1L, 3L, 2L, 2L, 1L, 1L), .Label = c("", "GA", "GG"), class = "factor"), 
               X24856070 = structure(c(2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("CC", 
                                                                                   "CT"), class = "factor"), X24856086 = structure(c(2L, 1L, 
                                                                                                                                     2L, 2L, 2L, 2L, 2L, 2L), .Label = c("CC", "CT"), class = "factor"), 
               X24856329 = structure(c(2L, 1L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("", 
                                                                                   "AG"), class = "factor"), X24856389 = structure(c(2L, 1L, 
                                                                                                                                     1L, 1L, 2L, 2L, 2L, 2L), .Label = c("", "GG"), class = "factor"), 
               X24857235 = structure(c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 1L), .Label = c("", 
                                                                                   "CT"), class = "factor"), X24857350 = structure(c(3L, 3L, 
                                                                                                                                     1L, 1L, 2L, 2L, 1L, 1L), .Label = c("", "GA", "GG"), class = "factor"), 
               X24857404 = structure(c(1L, 3L, 1L, 1L, 2L, 2L, 1L, 1L), .Label = c("", 
                                                                                   "AT", "TT"), class = "factor")), class = "data.frame", row.names = c(NA, 
                                                                                                                                                        -8L))

这将生成此表

Sample  Replicate   X24854000   X24854056   X24854764   X24854903   X24855066   X24855114   X24855316   X24855449   X24855925   X24856070   X24856086   X24856329   X24856389   X24857235   X24857350   X24857404
19-0001 1       GG  TA          GA  TC  CC  GA  CT  CT  AG  GG      GG
19-0001 2   CC  GG  TA              TC  GG      CC  CC              GG  TT
19-0002 1   CC  AA  TA      CC  GG      GG      CC  CT  AG
19-0002 2           TA      CC  GG      GG  GG  CC  CT  AG
19-0003 1   CC      TA  CT  CA  GA  TC  CC  GA  CC  CT  AG  GG  CT  GA  AT
19-0003 2   CC      TA  CT  CA  GA  TC  CC  GA  CC  CT  AG  GG  CT  GA  AT
19-0004 1           TA      CA  GA  TC  CC      CC  CT  AG  GG  CT
19-0004 2           TA      CA  GA      CC      CC  CT  AG  GG

这就是我要的:

Sample  Replicate   X24854000   X24854056   X24854764   X24854903   X24855066   X24855114   X24855316   X24855449   X24855925   X24856070   X24856086   X24856329   X24856389   X24857235   X24857350   X24857404
19-0001 1   CC  GG  TA          GA  TC  99  GA  99  99  AG  GG      GG  TT
19-0002 1   CC  AA  TA      CC  GG      GG  GG  CC  CT  AG
19-0003 1   CC      TA  CT  CA  GA  TC  CC  GA  CC  CT  AG  GG  CT  GA  AT
19-0004 1           TA      CA  GA  TC  CC      CC  CT  AG  GG  CT

将重复 1 和 2 合并到相同的样本名称下。缺失或相同的分数可以用另一个替换,但任何不匹配的都应替换为“99”,以便以后将其删除。

我试过了:

data_merge <- data %>%
    group_by(Sample) %>%
    summarise_all(ifelse(statement), (if_true), (if_false))

我只对数据进行子集化,真实数据有 44 个 X 数字。

标签: rmergedplyr

解决方案


这是一个选项

df %>%
    mutate_if(is.factor, as.character) %>%
    group_by(Sample) %>%
    summarise_at(
        vars(starts_with("X")),
        ~if_else(length(unique(.x[.x != ""])) == 1, first(.x[.x != ""]), "99"))
## A tibble: 4 x 17
#  Sample X24854000 X24854056 X24854764 X24854903 X24855066 X24855114 X24855316
#  <chr>  <chr>     <chr>     <chr>     <chr>     <chr>     <chr>     <chr>
#1 19-00… CC        GG        TA        99        99        GA        TC
#2 19-00… CC        AA        TA        99        CC        GG        99
#3 19-00… CC        99        TA        CT        CA        GA        TC
#4 19-00… 99        99        TA        99        CA        GA        TC
## … with 9 more variables: X24855449 <chr>, X24855925 <chr>, X24856070 <chr>,
##   X24856086 <chr>, X24856329 <chr>, X24856389 <chr>, X24857235 <chr>,
##   X24857350 <chr>, X24857404 <chr>

样本数据

df <- read.table(text =
    "Sample  Replicate   X24854000   X24854056   X24854764   X24854903   X24855066   X24855114   X24855316   X24855449   X24855925   X24856070   X24856086   X24856329   X24856389   X24857235   X24857350   X24857404
19-0001 1   ''  GG  TA  ''  ''  GA  TC  CC  GA  CT  CT  AG  GG  ''  GG  ''
19-0001 2   CC  GG  TA  ''  ''  ''  TC  GG  ''  CC  CC  ''  ''  ''  GG  TT
19-0002 1   CC  AA  TA  ''  CC  GG  ''  GG  ''  CC  CT  AG  ''  ''  ''  ''
19-0002 2   ''  ''  TA  ''  CC  GG  ''  GG  GG  CC  CT  AG  ''  ''  ''  ''
19-0003 1   CC  ''  TA  CT  CA  GA  TC  CC  GA  CC  CT  AG  GG  CT  GA  AT
19-0003 2   CC  ''  TA  CT  CA  GA  TC  CC  GA  CC  CT  AG  GG  CT  GA  AT
19-0004 1   ''  ''  TA  ''  CA  GA  TC  CC  ''  CC  CT  AG  GG  CT  ''  ''
19-0004 2   ''  ''  TA  ''  CA  GA  ''  CC  ''  CC  CT  AG  GG  ''  ''  ''", header = T)

推荐阅读