python - R - 匹配来自索引和返回值的嵌套列表值的组合
问题描述
嗨,我有两个数据集。第一个是链接到给定簇 (0-7) 的基因列表:
# gene output
Cluster <- rep(0:7, each = 10)
Gene <- c("LMO3", "NEUROD6", "NFIB", "SNAP25", "RTN1", "CPE", "SOX11", "CSRP2", "VAMP2", "ID2", "EMX2", "LHX5-AS1","PEG10",
"HES1", "TRH", "WLS", "TPBG", "RPS29", "CRABP2", "RSPO3", "RPL17", "RPL7", "PTMA", "RPL36A", "HMGN2", "H2AFZ",
"NFIB", "PABPC1", "NEUROD6", "HNRNPH1", "PTN", "FABP7", "IGFBP2", "ID4", "C1orf61", "VIM", "RPS27L", "FABP5",
"SDCBP", "BNIP3", "TCF7L2", "NEFL", "HMGCS1", "GAP43", "GPM6A", "SQLE", "ID4", "MSMO1", "SCOC", "BASP1", "TTR",
"MEST", "TPBG", "MDK", "TMBIM6", "RCN1", "C8orf59","ID3","PKM", "PTN", "NCOR1", "ELAVL4", "NNAT", "ETFB",
"STMN2", "TUBA1A", "GNG3", "MALAT1", "SOX4", "TUBB2B", "CRYAB", "GFAP", "CHCHD2", "HOPX", "LGALS1", "SCRG1", "ISG15",
"AC090498.1", "B2M", "CLU")
df <- data.frame(cbind(Cluster, Gene))
第二个是为特定基因组合提供细胞类型注释的索引:
# index
Type <- c("Radial Glia", "Excitatory Neuron ", "Inhibitory Neuron","Inhibitory Neuron",
"IPC","Excitatory Neuron ","Radial Glia","Microglia","IPC","Inhibitory Neuron")
Subtype <- c("early", "Layer IV", "SST-MGE1", "SST-MGE1", "IPC-div2",
"Parietal and Temporal", "oRG/Astrocyte", "Microglia", "IPC-new", "MGE2")
Markers <- c("TOP2A AURK HMGB CTNNB1", "PPP1R1B SCN2A RORB CRYM", "DLX6-AS1 DLX1 SST DCX", "ERBB4 SST DLX2 DLX5 DLX6-AS1",
"CCNB2 NEUROD4 KIF15 PENK HES6 ZFHX4 GLI3", "MEF2C STMN2 FLT ROBO CRYM", "AQP4 GFAP AGT DIO2 IL33",
"C1QB AIF1 CCL4 C1QC", "CENPK EOMES", "CCK LHX6 SCGN SST")
index <- data.frame(cbind(Type, Subtype, Markers))
我试图从我的 df 中的基因列表中找到标记中概述的特定组合。当找到这样的匹配时,将返回相应的类型和子类型。但是,我发现有一些警告很难理解。
- 每个集群的列表可能包含多个标记组合 - 因此该函数应该迭代地遍历每个标记组合,而不是在找到第一个匹配项时停止。
- 索引匹配过程应分别在每个集群上运行 - 即检查集群 0 中的基因是否有标记匹配并返回类型/子类型,然后重复集群 1 等的步骤。
我的项目数据由数十个类似 df 的输出组成,这些输出由不同数量的各个集群组成,每个集群包含数百到数千个基因。我已尽力在网上搜索解决方案,但不幸的是,我在这里完全空白。
任何帮助/建议/建议将不胜感激。
编辑:
输出可能如下所示:
Cluster Gene Type Subtype
1 0 LMO3 Radial Glia early
2 0 NEUROD6 <NA> <NA>
3 0 NFIB <NA> <NA>
4 0 SNAP25 <NA> <NA>
5 0 RTN1 <NA> <NA>
6 0 CPE <NA> <NA>
其中一个正确的匹配将向 df 添加一行,每个集群具有相应的类型和子类型,其余为空(NAs)。
解决方案
我假设你想用索引中的类型注释每个基因簇,当一个类型的所有标记都存在于簇的基因库中时。
我还将使用一些简化的数据集;索引中的两种简化类型:
library(tidyverse)
index <- bind_rows(
tibble(type = "AB", subtype = "X", markers = c("A", "B")),
tibble(type = "BC", subtype = "Y", markers = c("B", "C")),
)
index
#> # A tibble: 4 x 3
#> type subtype markers
#> <chr> <chr> <chr>
#> 1 AB X A
#> 2 AB X B
#> 3 BC Y B
#> 4 BC Y C
三个不同的集群说明了不同的匹配场景:
clusters <- bind_rows(
tibble(cluster = 0, genes = c("A", "B", "C")), # 2 matches
tibble(cluster = 1, genes = c("B", "C", "D")), # 1 match
tibble(cluster = 2, genes = c("C", "D", "E")), # No matches
)
clusters
#> # A tibble: 9 x 2
#> cluster genes
#> <dbl> <chr>
#> 1 0 A
#> 2 0 B
#> 3 0 C
#> 4 1 B
#> 5 1 C
#> 6 1 D
#> 7 2 C
#> 8 2 D
#> 9 2 E
我将通过首先创建一个返回给定基因池的匹配类型的函数来解决这个问题:
match_index <- function(genes) {
matches <- index %>%
group_by(type, subtype) %>%
filter(all(markers %in% genes)) %>%
distinct(type, subtype)
# If none matched, return a row of NAs
if (nrow(matches)) matches else matches[NA_integer_, ]
}
然后你可以用函数总结每个集群:
clusters %>%
group_by(cluster) %>%
summarise(match_index(genes))
#> `summarise()` regrouping output by 'cluster' (override with `.groups` argument)
#> # A tibble: 4 x 3
#> # Groups: cluster [3]
#> cluster type subtype
#> <dbl> <chr> <chr>
#> 1 0 AB X
#> 2 0 BC Y
#> 3 1 BC Y
#> 4 2 <NA> <NA>
推荐阅读
- spring-boot - Spring Cloud Kubernetes 不重新加载秘密更改
- javascript - 针对阵列中的所有内容?
- racket - 如何在 Dr. Racket 中使用 Latex/Tex 风格的键绑定?
- laravel - Is laravel seeding as hard as I see it or I'm stumbling with something that's not that complicated?
- flutter - 如何使用共享首选项保存布尔值
- javascript - 保存最后一个变量值 - Angular
- laravel - PhpStorm 使用 \Nwidart\Modules 自动完成
- go - 从结构切片中获取指定字段的切片
- arrays - 角度对象类型无法返回值
- powershell - 验证带有错误行号的 XML