首页 > 解决方案 > R - 匹配来自索引和返回值的嵌套列表值的组合

问题描述

嗨,我有两个数据集。第一个是链接到给定簇 (0-7) 的基因列表:

# gene output

Cluster <- rep(0:7, each = 10)

Gene <- c("LMO3", "NEUROD6", "NFIB", "SNAP25", "RTN1", "CPE", "SOX11", "CSRP2", "VAMP2", "ID2", "EMX2", "LHX5-AS1","PEG10",
          "HES1", "TRH", "WLS", "TPBG", "RPS29", "CRABP2", "RSPO3", "RPL17", "RPL7", "PTMA", "RPL36A", "HMGN2", "H2AFZ",
          "NFIB", "PABPC1", "NEUROD6", "HNRNPH1", "PTN", "FABP7", "IGFBP2", "ID4", "C1orf61", "VIM", "RPS27L", "FABP5",
          "SDCBP", "BNIP3", "TCF7L2", "NEFL", "HMGCS1", "GAP43", "GPM6A", "SQLE", "ID4", "MSMO1", "SCOC", "BASP1", "TTR",
          "MEST", "TPBG", "MDK", "TMBIM6", "RCN1", "C8orf59","ID3","PKM", "PTN", "NCOR1", "ELAVL4", "NNAT", "ETFB",
          "STMN2", "TUBA1A", "GNG3", "MALAT1", "SOX4", "TUBB2B", "CRYAB", "GFAP", "CHCHD2", "HOPX", "LGALS1", "SCRG1", "ISG15",
          "AC090498.1", "B2M", "CLU")

df <- data.frame(cbind(Cluster, Gene))

第二个是为特定基因组合提供细胞类型注释的索引:

# index

Type <- c("Radial Glia", "Excitatory Neuron ", "Inhibitory Neuron","Inhibitory Neuron",
          "IPC","Excitatory Neuron ","Radial Glia","Microglia","IPC","Inhibitory Neuron")

Subtype <- c("early", "Layer IV", "SST-MGE1", "SST-MGE1", "IPC-div2", 
             "Parietal and Temporal", "oRG/Astrocyte", "Microglia", "IPC-new", "MGE2")

Markers <- c("TOP2A AURK HMGB CTNNB1", "PPP1R1B SCN2A RORB CRYM", "DLX6-AS1 DLX1 SST DCX", "ERBB4 SST DLX2 DLX5 DLX6-AS1",
             "CCNB2 NEUROD4 KIF15 PENK HES6 ZFHX4 GLI3", "MEF2C STMN2 FLT ROBO CRYM", "AQP4 GFAP AGT DIO2 IL33",
             "C1QB AIF1 CCL4 C1QC", "CENPK EOMES", "CCK LHX6 SCGN SST")

index <- data.frame(cbind(Type, Subtype, Markers))

我试图从我的 df 中的基因列表中找到标记中概述的特定组合。当找到这样的匹配时,将返回相应的类型和子类型。但是,我发现有一些警告很难理解。

  1. 每个集群的列表可能包含多个标记组合 - 因此该函数应该迭代地遍历每个标记组合,而不是在找到第一个匹配项时停止。
  2. 索引匹配过程应分别在每个集群上运行 - 即检查集群 0 中的基因是否有标记匹配并返回类型/子类型,然后重复集群 1 等的步骤。

我的项目数据由数十个类似 df 的输出组成,这些输出由不同数量的各个集群组成,每个集群包含数百到数千个基因。我已尽力在网上搜索解决方案,但不幸的是,我在这里完全空白。

任何帮助/建议/建议将不胜感激。

编辑:

输出可能如下所示:

  Cluster    Gene        Type Subtype
1       0    LMO3 Radial Glia   early
2       0 NEUROD6        <NA>    <NA>
3       0    NFIB        <NA>    <NA>
4       0  SNAP25        <NA>    <NA>
5       0    RTN1        <NA>    <NA>
6       0     CPE        <NA>    <NA>

其中一个正确的匹配将向 df 添加一行,每个集群具有相应的类型和子类型,其余为空(NAs)。

标签: pythonrdataframeseurat

解决方案


我假设你想用索引中的类型注释每个基因簇,当一个类型的所有标记都存在于簇的基因库中时。

我还将使用一些简化的数据集;索引中的两种简化类型:

library(tidyverse)

index <- bind_rows(
  tibble(type = "AB", subtype = "X", markers = c("A", "B")),
  tibble(type = "BC", subtype = "Y", markers = c("B", "C")),
)

index
#> # A tibble: 4 x 3
#>   type  subtype markers
#>   <chr> <chr>   <chr>  
#> 1 AB    X       A      
#> 2 AB    X       B      
#> 3 BC    Y       B      
#> 4 BC    Y       C

三个不同的集群说明了不同的匹配场景:

clusters <- bind_rows(
  tibble(cluster = 0, genes = c("A", "B", "C")), # 2 matches
  tibble(cluster = 1, genes = c("B", "C", "D")), # 1 match
  tibble(cluster = 2, genes = c("C", "D", "E")), # No matches
)

clusters
#> # A tibble: 9 x 2
#>   cluster genes
#>     <dbl> <chr>
#> 1       0 A    
#> 2       0 B    
#> 3       0 C    
#> 4       1 B    
#> 5       1 C    
#> 6       1 D    
#> 7       2 C    
#> 8       2 D    
#> 9       2 E

我将通过首先创建一个返回给定基因池的匹配类型的函数来解决这个问题:

match_index <- function(genes) {
  matches <- index %>% 
    group_by(type, subtype) %>% 
    filter(all(markers %in% genes)) %>% 
    distinct(type, subtype)

  # If none matched, return a row of NAs  
  if (nrow(matches)) matches else matches[NA_integer_, ]
}

然后你可以用函数总结每个集群:

clusters %>% 
  group_by(cluster) %>% 
  summarise(match_index(genes))
#> `summarise()` regrouping output by 'cluster' (override with `.groups` argument)
#> # A tibble: 4 x 3
#> # Groups:   cluster [3]
#>   cluster type  subtype
#>     <dbl> <chr> <chr>  
#> 1       0 AB    X      
#> 2       0 BC    Y      
#> 3       1 BC    Y      
#> 4       2 <NA>  <NA>

推荐阅读