首页 > 解决方案 > 根据最高的最低 p 值过滤具有相似 ID 的行

问题描述

我有一个数据框,我将它映射到各种基因组区域,这给了我峰值及其各自的基因。现在两个峰可以映射到一个基因组区域给定我最终这样的距离

 Peak        annotation         ENSEMBL log2FoldChange         padj UP_DOWN
  Peak13361 Distal Intergenic ENSG00000000457       3.458416 1.429138e-03      UP
  Peak13362 Distal Intergenic ENSG00000000457       2.208152 3.153138e-10      UP
  Peak13356 Distal Intergenic ENSG00000000457      -2.092536 1.693891e-03    DOWN
  Peak13329 Distal Intergenic ENSG00000000460       3.862953 2.713778e-05      UP
  Peak13331 Distal Intergenic ENSG00000000460       2.535419 3.064567e-02      UP
   Peak2767          Promoter ENSG00000000938       2.664457 2.362797e-03      UP
   Peak2769 Distal Intergenic ENSG00000000938       1.588538 3.678620e-07      UP
   Peak2771 Distal Intergenic ENSG00000000938       1.818130 5.232734e-03      UP
   Peak2772 Distal Intergenic ENSG00000000938       1.800501 2.102107e-02      UP
 Peak15396 Distal Intergenic ENSG00000000971       1.577753 1.045814e-02      UP

例如从前三个峰值

 Peak        annotation         ENSEMBL log2FoldChange         padj UP_DOWN
      Peak13361 Distal Intergenic ENSG00000000457       3.458416 1.429138e-03      UP
      Peak13362 Distal Intergenic ENSG00000000457       2.208152 3.153138e-10      UP
      Peak13356 Distal Intergenic ENSG00000000457      -2.092536 1.693891e-03    DOWN

我只想选择这个最有意义的峰

  Peak13362 Distal Intergenic ENSG00000000457       2.208152 3.153138e-10      UP

如果一个峰有多个 ENSEMBL ID,这是我必须遵循的逻辑我必须寻找具有最大意义的那个

任何建议或帮助将不胜感激

标签: rdataframe

解决方案


如果没有最小的可重现示例,就无法对其进行测试,但是这些行周围的东西应该可以工作:

subsetting = function(x, df){
  df2 = subset(df, Peak = x) # subsetting the rows corresponding to a specific Peak
  df2 = subset(df2, padj = min(padj)) # selecting the smallest padj
  return(df2)
}

sapply(unique(Peak), subsetting, df = df)

推荐阅读