首页 > 解决方案 > 使用 dplyr 按字符串的最高出现过滤分组行

问题描述

我正在努力将转录组学数据集从转录本折叠到基因水平以进行下游分析。在这个数据集中,每一行都有一个唯一的基因标识符 ( qry_gene_id),每行qry_gene_id可以有多个qry_transcript_ids. 我想过滤数据集以qry_transcript_id从每个qry_gene_id具有最大数量go_id(GO:XXXXXXX)的数据集中选择。该go_id列是由go_id分隔的 s列表","

这是我的数据的一个子集:

structure(list(qry_transcript_id = c("TU22", "TU20", "TU27", 
"TU29", "TU25", "TU26", "TU28", "TU31", "TU24", "TU30"), go_id = c(NA, 
NA, "GO:0004672,GO:0005515,GO:0005524,GO:0006468", "GO:0005003,GO:0005515,GO:0005524,GO:0005887,GO:0006468,GO:0007169", 
"GO:0005003,GO:0005515,GO:0005524,GO:0005887,GO:0006468,GO:0007169", 
"GO:0005003,GO:0005515,GO:0005524,GO:0005887,GO:0006468,GO:0007169", 
"GO:0005003,GO:0005515,GO:0005524,GO:0005887,GO:0006468,GO:0007169", 
"GO:0005003,GO:0005515,GO:0005524,GO:0005887,GO:0006468,GO:0007169", 
"GO:0005003,GO:0005515,GO:0005524,GO:0005887,GO:0006468,GO:0007169", 
"GO:0005003,GO:0005515,GO:0005524,GO:0005887,GO:0006468,GO:0007169"
), ref_gene_id = c("LOC108906571", "LOC108906589", "LOC108906588", 
"LOC108906588", "LOC108906588", "LOC108906588", "LOC108906588", 
"LOC108906588", "LOC108906588", "LOC108906588"), qry_gene_id = c("G10", 
"G9", "G12", "G12", "G12", "G12", "G12", "G12", "G12", "G12"), 
    ref_gene_name = c("uncharacterized LOC108906571", "uncharacterized LOC108906589", 
    "ephrin type-B receptor 1-B", "ephrin type-B receptor 1-B", 
    "ephrin type-B receptor 1-B", "ephrin type-B receptor 1-B", 
    "ephrin type-B receptor 1-B", "ephrin type-B receptor 1-B", 
    "ephrin type-B receptor 1-B", "ephrin type-B receptor 1-B"
    ), gene_annotation = c("refseq", "refseq", "refseq", "refseq", 
    "refseq", "refseq", "refseq", "refseq", "refseq", "refseq"
    ), ref_transcript_id = c("XM_018709871.1", "XM_018709894.2", 
    "XM_018709891.1", "XM_018709891.1", "XM_018709891.1", "XM_018709891.1", 
    "XM_018709891.1", "XM_018709891.1", "XM_018709891.1", "XM_018709891.1"
    ), ref_transcript_name = c("uncharacterized LOC108906571", 
    "uncharacterized LOC108906589", "ephrin type-B receptor 1-B, transcript variant X2", 
    "ephrin type-B receptor 1-B, transcript variant X2", "ephrin type-B receptor 1-B, transcript variant X2", 
    "ephrin type-B receptor 1-B, transcript variant X2", "ephrin type-B receptor 1-B, transcript variant X2", 
    "ephrin type-B receptor 1-B, transcript variant X2", "ephrin type-B receptor 1-B, transcript variant X2", 
    "ephrin type-B receptor 1-B, transcript variant X2"), class_code = c("i", 
    "k", "j", "j", "=", "j", "j", "j", "j", "j")), row.names = 21:30, class = "data.frame")

正如您所看到的qry_gene_id= G12,第一个成绩单缺少几个 GO id。我想确保我的过滤器选择了一个完整的 GO id 的成绩单。

但是,我坚持如何适当地过滤它。这就是我所在的地方。

test_data <- test_data %>% group_by(qry_gene_id) %>% filter()

在我看来,通过 1)该字符串的总长度(我认为应该捕获最长的 GO 术语列表)或 2)计算字符串的出现次数(例如“GO”)并选择具有最高值的字符串,这似乎是合乎逻辑的“GO”的计数。基本上我不想遗漏与每个基因相关的任何 GO 术语。

标签: rdplyr

解决方案


这是一种保留每组中“GO”计数最高的行的方法:

library(dplyr)
library(stringr)
test_data %>% 
  mutate(go_count = str_count(go_id, "GO")) %>%
  group_by(qry_gene_id) %>% 
  slice_max(go_count)

看看?slice_max您是否想对此进行微调,例如,调整有关系时发生的情况。默认情况下,所有行都与组中出现最多“GO”的行保持联系。

您也可以使用类似的东西filter(which.max(nchar(go_id))),保持最大字符数。


推荐阅读