首页 > 解决方案 > 用 NA 总结数据框

问题描述

我有 3 种不同类别的突变,例如

 "CNA"                "MUTATIONS"          "STRUCTURAL_VARIANT"


f <- dput(e)
structure(list(track_name = c("AR", "ASCL1", "ATOH1", "PRDM1", 
"DLX1", "DLX2", "EPAS1", "ETV2", "EYA2", "FOXG1", "FOXC2", "GATA1", 
"GATA2", "GATA3", "GATA4", "GATA6", "GBX1", "GLI2", "GLI3", "MNX1"
), track_type = c("CNA", "CNA", "CNA", "CNA", "CNA", "CNA", "CNA", 
"CNA", "CNA", "CNA", "CNA", "CNA", "CNA", "CNA", "CNA", "CNA", 
"CNA", "CNA", "CNA", "CNA"), `TCGA-AB-2929` = c("amp_rec", NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, "Amplification", NA, NA, 
NA, NA, NA, NA, NA, NA), aml_ohsu_2018_1408 = c(NA_character_, 
NA_character_, NA_character_, NA_character_, NA_character_, NA_character_, 
NA_character_, NA_character_, NA_character_, NA_character_, NA_character_, 
NA_character_, NA_character_, NA_character_, NA_character_, NA_character_, 
NA_character_, NA_character_, NA_character_, NA_character_), 
    aml_ohsu_2018_1992 = c(NA_character_, NA_character_, NA_character_, 
    NA_character_, NA_character_, NA_character_, NA_character_, 
    NA_character_, NA_character_, NA_character_, NA_character_, 
    NA_character_, NA_character_, NA_character_, NA_character_, 
    NA_character_, NA_character_, NA_character_, NA_character_, 
    NA_character_)), row.names = c(NA, -20L), class = c("tbl_df", 
"tbl", "data.frame"))

这给了我这样的数据框

track_name track_type `TCGA-AB-2929` aml_ohsu_2018_1408 aml_ohsu_2018_1992
   <chr>      <chr>      <chr>          <chr>              <chr>             
 1 AR         CNA        amp_rec        NA                 NA                
 2 ASCL1      CNA        NA             NA                 NA                
 3 ATOH1      CNA        NA             NA                 NA                
 4 PRDM1      CNA        NA             NA                 NA                
 5 DLX1       CNA        NA             NA                 NA                
 6 DLX2       CNA        NA             NA                 NA                
 7 EPAS1      CNA        NA             NA                 NA                
 8 ETV2       CNA        NA             NA                 NA                
 9 EYA2       CNA        NA             NA                 NA                
10 FOXG1      CNA        NA             NA                 NA                
11 FOXC2      CNA        NA             NA                 NA                
12 GATA1      CNA        Amplification  NA                 NA                
13 GATA2      CNA        NA             NA                 NA                
14 GATA3      CNA        NA             NA                 NA                
15 GATA4      CNA        NA             NA                 NA                
16 GATA6      CNA        NA             NA                 NA                
17 GBX1       CNA        NA             NA                 NA                
18 GLI2       CNA        NA             NA                 NA                
19 GLI3       CNA        NA             NA                 NA                
20 MNX1       CNA        NA             NA                 NA    

这是我的小子集。对于每个样本,第一列包含基因,第二列包含突变类。

我试图在样本中找到这些类中每个基因的突变分布。第二列之后的我的列包含各种突变,例如

扩增,帧内突变(假定的乘客),深度删除,错义突变(假定的乘客)分布在样本的每一列中。

在我的示例数据框中,我有一个这样的观察结果

GATA1 CNA Amplification

我在做这个

table(Store2df$track_name, Store2df$track_type) %>% prop.table() %>% round(2)

有没有更好的方法/方法来总结?

标签: r

解决方案


不一定是更好的方法,但如果您正在使用dplyr,您可以这样做 -

library(dplyr)

e %>%
  count(track_name, track_type) %>%
  mutate(n = round(prop.table(n), 2))

这将以长格式返回数据。


推荐阅读