首页 > 解决方案 > 有没有办法通过相似性对多组字符串进行聚类?

问题描述

我有几位患者(Patient1、Patient2、Patient3...)的下一代测序数据。

患者样本可以来自相同的疾病或不同的疾病。我们知道某些突变在某些疾病中的发生率较高,一些变异是引起疾病的,另一些与疾病有关,我们真的不知道它们是如何导致疾病的等等。我正在寻找一种方法来聚类这些患者根据改变的基因来查看是否有任何共同特征......一个基因可能有几个改变(例如NRAS G12D vs. NRAS G13D vs. NRAS Q61K ...)。给定患者中改变基因的顺序无关紧要。一名患者的典型发现是大约 500 处改变,而患者人数约为 100 人。

我检查了以前的帖子,问题是关于组成一个列表的聚类字符串,这是在多个字符串列表之间。

谢谢你的帮助。

一位患者的数据如下:

    #Patient1
    chromosome <- c("X",    "7",    "10",   "1",    "X",    "5",    "5",    "X",    "10",   "7")
    position <- c("70360589","128829066","89692923","11206853","70360680","176637576","176637471","70360648","89692913","148543694")
    reference <- c("AGC","A","G","AC","GCA","T","G","CAG","G","AA")
    alter <- c("","G","A","","","C","A","","A","")
    gene <- c("MED12","SMO","PTEN","MTOR","MED12","NSD1","NSD1","MED12","PTEN","EZH2")
    cdot <- c("c.6165_6167delGCA","c.74A>G","c.407G>A","c.4571-6_4571-5delGT","c.6256_6258delCAG","c.2176T>C","c.2071G>A","c.6226_6228delCAG","c.397G>A","c.118-5_118-4delTT")
    pdot <- c("Q2076del","D25G","C136Y"," ","Q2086del","S726P","A691T","Q2076del","V133I"," ")
    patient1 <- data.frame(chromosome, position, reference, alter, gene, cdot, pdot)

突变可以用不同的方式表示,用 cdot 表示的基因,用 gdot 表示的基因,用 ref 和 alter 表示的染色体等。对我来说最方便的是gene & pdot,因为它提供了更多信息,因为它告诉我改变的基因和atleration 是什么(例如PTEN是基因,C25G是指第25位的参考氨基酸“C”被改变为氨基酸“G”)。

我试图将每一对 Gene&pdot 连接到一个字符串中,所以如果患者有 10 处改变,就像上面的数据框一样,我将有 10 个字符串。我会为所有患者这样做,而不是根据他们的变化对所有患者进行聚类。我的问题是在这个例子中聚类多个患者的最佳方法是什么。

还有两个病人:

    #Patient2
    chromosome <- c("X","6","1","1","6","12","5","X","1","10")
    position <- c("47424495","157100024","78429978","242023898","30858801","49427266","176637576","70360648","78435702","89692913")
    reference <- c("A","GGA","T","A","C","TGC","T","CAG","AA","G")
    alter <- c("","","","G","","","C","","","A")
    gene <- c("ARAF","ARID1B","FUBP1","EXO1","DDR1","KMT2D","NSD1","MED12","FUBP1","PTEN")
    cdot <- c("c.416delA","c.983_985delGAG","c.901delA","c.836A>G","c.474delC","c.11220_11222delGCA","c.2176T>C","c.6226_6228delCAG","c.121-4_121-3delTT","c.397G>A")
    pdot <- c("K139fs","G328del","I301fs","N279S","M159fs","Q3745del","S726P","Q2076del","","V133I")
    patient2 <- data.frame(chromosome, position,  reference, alter, gene, cdot, pdot)


    #Patient3
    chromosome <- c("1","2","11","14","14","12","2","19","12","17","X","1","10")
    position <- c("120539781","141259448","64572018","35871217","102551161","49426952","29416366","18273047","49426730","29490295","70360648","78435702","89692913")
    reference <- c("G","A","T","G","TCT","C","G","T","GCT","G","CAG","AA","G")
    alter <- c("A","","C","A","","T","C","C","","A","","","A")
    gene <- c("NOTCH2","LRP1B","MEN1","NFKBIA","HSP90AA1","KMT2D","ALK","PIK3R2","KMT2D","NF1","MED12","FUBP1","PTEN")
    cdot <- c("c.590C>T","c.8663-5delT","c.1621A>G","c.*2C>T","c.1202_1204delAGA","c.11536G>A","c.4587C>G","c.937T>C","c.11756_11758delAGC","c.380G>A","c.6226_6228delCAG","c.121-4_121-3delTT","c.397G>A")
    pdot <- c("T197I","","T541A","","K401del","G3846S","D1529E","S313P","Q3919del","G127E","Q2076del","","V133I")
    patient3 <- data.frame(chromosome, position,  reference, alter, gene, cdot, pdot)

为了让事情更简单,我做了这个例子:

    #Simple Example
    modules1 <- c("maths", "physics", "geometry", "languages", "science", "geology")
    scores1 <- c("A+", "A", "A", "B+", "B", "B")
    student1 <- data.frame(modules1, scores1)
    modules2 <- c("music", "dance", "languages", "science")
    scores2 <- c("A+", "A+", "A+", "B")
    student2 <- data.frame(modules2, scores2)
    modules3 <- c("languages", "science", "physics", "maths")
    scores3 <- c("A+", "A+", "A+", "A")
    student3 <- data.frame(modules3, scores3)

如何根据学生的分数对 1、2 和 3 学生进行聚类。我希望学生 1 和 3 比学生 2 更靠近彼此的树状图。

标签: cluster-analysisdna-sequencemutation

解决方案


我建议将数据编码为数字格式。可能是 1-hot 编码,因为这是分类数据。

我还将基因和突变编码分开,因为同一基因中的不同突变可能是等效的。

对于以下基因和突变:

list_genes = [gene1, gene2, gene3]
list_disease = [disease1, disease2]
list_mutations_patient1 = [c25g, g149e, t543k]
list_mutations_patient2 = [a50g, "", t543k]

每个列表中的第一个位置是基因中任何突变的真假,以下位置是数据集中所有已识别突变的真假,最后一个列表(在每个列表列表中)是疾病状况:

coded_list_gene_mutation_patient1 = [[1,1,0],[1,1],[1,1],[1,0]]
coded_list_gene_mutation_patient2 = [[1,0,1],[0,0],[1,1],[0,1]]

展平列表并附加所有患者数据

all_patient_lists = [1,1,0,1,1,1,1,1],[1,0,1,0,0,1,0,1]

因为列表可能会很长,所以您应该考虑使用降维(PCA 或 LDA 或 MDS)。然后,您可以绘制前 2 或 3 个组件以查看它们对数据的分区情况,然后将组件从 PCA 传递到真正的聚类算法(而不是分区算法)中,例如基于层次密度的聚类 (HDBScan),

这会将每个样本分配给一个集群,提供最少数量的成员来形成一个集群。如果您期望数据中有一些噪声(噪声被归类为异常值而不是分配给一个集群),这很好。


推荐阅读