首页 > 解决方案 > 仅计算匹配字符串的平均值

问题描述

我有这个任务,其中我有一个文件,其中包含很多 chromosed,我需要为它们中的每一个计算突变水平。问题是每条染色体可以出现多次,我需要找到这条染色体所有突变水平的平均值。最重要的是,我需要突变在相同的核苷酸中(T--> C 或 G--> A)。突变级别由 DP4 在 INFO 下计算,其中包含四个数字,表示为 [ref+,ref-,alt+,alt-] 文件示例:

  #CHROM    POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  Aligned.sortedByCoord.out.bam
    chr1    143755378   .   T   C   62  .   DP=550;VDB=0;SGB=-0.693147;RPB=1.63509e-10;MQB=1;BQB=0.861856;MQ0F=0;AC=2;AN=2;DP4=0,108,0,440;MQ=20    GT:PL:DP    1/1:89,179,0:548
    chr3    57644487    .   T   C   16.4448 .   DP=300;VDB=0;SGB=-0.693147;RPB=0.993846;MQB=1;BQB=0.316525;MQ0F=0;ICB=1;HOB=0.5;AC=1;AN=2;DP4=0,166,0,134;MQ=20 GT:PL:DP    0/1:49,0,63:300
    chr3    80706912    .   T   C   212 .   DP=298;VDB=0;SGB=-0.693147;RPB=0.635135;MQB=1;MQSB=1;BQB=0.609797;MQ0F=0;AC=2;AN=2;DP4=1,1,256,40;MQ=20 GT:PL:DP    1/1:239,255,0:298

所以这就是我到现在为止所做的事情,我有点卡住了,不知道如何从那一点继续:

def vcf(file):
 with open(file, "r+") as my_file:
    """First I wanted to clear the headline"""
    for columns in my_file:
        if columns.startswith("#"):
            continue
        """Then I split the file into columns"""
        for columns in my_file:
            columns=columns.rstrip('\n').split('\t')
            """This is the info column"""
            for row in columns[7]:
                row = columns[7].split(";")
                """Using slicing I extracted the DP4 part and removed the str DP4"""
            DP4 = [row[-2]]
            new_DP4 = [x.replace("DP4=","") for x in DP4]
            """Then I took all the int outs and put them under the categories"""
            for x in new_DP4:
                xyz = x.split(",")
            ref_plus = int(xyz[0])
            ref_minus = int(xyz[1])
            alt_plus = int(xyz[2])
            alt_minus = int(xyz[3])
            """calculated the mean for each one"""
            formula = ((alt_minus+alt_plus)/(alt_minus+alt_plus+ref_minus+ref_plus))
            """made a list of the chromosomes and their means"""
            chr_form = [columns[0] , columns[3], columns[4], (formula)]

所以基本上我认为现在我有了列表中的所有数据,我可以以某种方式整理出相同的 chr 并做相应的方法,但我不知道该怎么做。我也尝试使用正则表达式,但我不太熟悉这是我当前的 chr_form 输出:

['chr3', 'T', 'C', 0.44666666666666666]
['chr3', 'T', 'C', 0.9932885906040269]
['chr5', 'A', 'G', 0.42073170731707316]
['chr5', 'A', 'G', 0.5772870662460567]
['chr6', 'A', 'G', 0.5153061224489796]
['chr6', 'A', 'G', 0.8934010152284264]

等等..但我最终想要得到的输出是这样的:

{1: {‘T->C’: 0.802}, 3: {‘T->C’:0.446}}

我很乐意知道如何计算每个 chr 的平均值的想法或示例,

标签: python-3.xstringfunction

解决方案


你有很多不必要的for循环。您需要的唯一循环是文件中的行,当您拆分字段或从整个字段中删除某些内容时,您不需要遍历字段中的字符。

最后,您应该将计算结果添加到字典中。

def vcf(file):
    chromosomes = {}
    with open(file, "r+") as my_file:
        # First I wanted to clear the headline
        for line in my_file:
            if line.startswith("#"): # skip comment lines.
                continue
            line=line.rstrip('\n').split('\t')
            # This is the info column
            info = line[7].split(";")
            # Using slicing I extracted the DP4 part and removed the str DP4
            DP4 = info[-2].replace("DP4=","")
            # Then I took all the int outs and put them under the categories
            ref_plus, ref_minus, alt_plus, alt_minus = map(int, DP4.split(','))
            # calculated the mean for each one
            formula = ((alt_minus+alt_plus)/(alt_minus+alt_plus+ref_minus+ref_plus))
            # Get chromosome number from first field
            chr_num = int(line[0].replace('chr', ''))
            chromosomes[chr_num] = {f'{line[3]}->{line[4]}': formula}

    return chromosomes

推荐阅读