首页 > 解决方案 > 如何从文件读取中的唯一匹配项中正确清除列表-> 文件写入

问题描述

我有一系列文件,我正在使用 re.search() 函数为其提取字符串。

我要写入的文件是一个制表符分隔的列表,其中包括名称、hsa 和 matSEQID 以及 matACC。matSEQID 和 matACC 都是一个 commadelimited 列表,我使用 ','.join() 函数创建它。

我正在努力解决如何正确写入第一行的新文件,其中仅包括从第一个文件中提取的值。目前,名称和 hsa 列是正确的,但其他两列要么是所有文件中的每一个匹配项,要么是最后一个文件中的匹配项,具体取决于我清除列表的方式。如何使行独一无二?我试图在每个文件之后清除列表,但这似乎无法正常工作。我是否正确地考虑了这一点?谢谢

import re
import os
import sys

List_of_names = []
List_of_hsa = []
List_of_matSEQID = []
List_of_matACC = []
fileList = []

path = "/blah/blah/blah/stuff/"
dirs = os.listdir(path)

for file in dirs:
    fileList.append(path+file)

outname = sys.argv[1]

output_fhandle = open(outname, "w")


def linesplitter(M_object):
    Temp_storage = M_object.group()
    new_storage1 = Temp_storage.split(">")
    new_storage2 = new_storage1[1].split("<")
    List_of_names.append(new_storage2[0])

def stem_patternID(M_object):
    string = M_object.group()
    List_of_hsa.append(string)

def mat_seqID(M_object):
    string = M_object.group()
    List_of_matSEQID.append(string)

for file in fileList:
    fh_html = open(file).readlines()
    for line in fh_html:
        temp_string = line
        match_object = re.search(acc_pattern, temp_string)
        match_obj2 = re.search(stem_pattern, temp_string)
        match_obj3 = re.search(mature_seq_pattern, temp_string)
        match_obj4 = re.search(mature_acc_pattern, temp_string)
        if (match_object):
            linesplitter(match_object)
        if(match_obj2):
            stem_patternID(match_obj2)
        if(match_obj3):
            mat_seqID(match_obj3)
        if(match_obj4):
            mature_acc_func(match_obj4)

        my_matseqid_string = ','.join(List_of_matSEQID)
        mat_acc_string = ','.join(List_of_matACC)

        with open(outname,"w") as f:
                for (name,hsa) in zip(List_of_names, List_of_hsa):
                        f.write("{0}\t{1}\t{2}\t{3}\n".format(name,hsa,my_matseqid_string,mat_acc_string))
        List_of_matSEQID.clear()
        List_of_matACC.clear()

如果我在最后删除 .clear() 函数,那么每个匹配项都包含在该列表中。如果我保留它,那么我只会从最后一个文件中获得匹配项。如何使它打印到第 2 列和第 3 列中的文件列表,其中的值对于该给定文件是唯一的?谢谢

这是两个错误的输出文件: 1) .clear() 保持原样。

MI0023620       Stem-loop sequence hsa-mir-7159 Mature sequence hsa-miR-7160-5p,Mature sequence hsa-miR-7160-3p MIMAT0028230,MIMAT0028231
MI0023613       Stem-loop sequence hsa-mir-7153 Mature sequence hsa-miR-7160-5p,Mature sequence hsa-miR-7160-3p MIMAT0028230,MIMAT0028231
MI0023562       Stem-loop sequence hsa-mir-6077-2       Mature sequence hsa-miR-7160-5p,Mature sequence hsa-miR-7160-3p MIMAT0028230,MIMAT0028231
MI0023619       Stem-loop sequence hsa-mir-7161 Mature sequence hsa-miR-7160-5p,Mature sequence hsa-miR-7160-3p MIMAT0028230,MIMAT0028231
MI0023616       Stem-loop sequence hsa-mir-7156 Mature sequence hsa-miR-7160-5p,Mature sequence hsa-miR-7160-3p MIMAT0028230,MIMAT0028231
MI0023612       Stem-loop sequence hsa-mir-7152 Mature sequence hsa-miR-7160-5p,Mature sequence hsa-miR-7160-3p MIMAT0028230,MIMAT0028231
MI0023565       Stem-loop sequence hsa-mir-6511a-3      Mature sequence hsa-miR-7160-5p,Mature sequence hsa-miR-7160-3p MIMAT0028230,MIMAT0028231
MI0023622       Stem-loop sequence hsa-mir-486-2        Mature sequence hsa-miR-7160-5p,Mature sequence hsa-miR-7160-3p MIMAT0028230,MIMAT0028231
MI0023431       Stem-loop sequence hsa-mir-6511b-2      Mature sequence hsa-miR-7160-5p,Mature sequence hsa-miR-7160-3p MIMAT0028230,MIMAT0028231
MI0023611       Stem-loop sequence hsa-mir-7151 Mature sequence hsa-miR-7160-5p,Mature sequence hsa-miR-7160-3p MIMAT0028230,MIMAT0028231
MI0023618       Stem-loop sequence hsa-mir-7158 Mature sequence hsa-miR-7160-5p,Mature sequence hsa-miR-7160-3p MIMAT0028230,MIMAT0028231
MI0023563       Stem-loop sequence hsa-mir-6089-2       Mature sequence hsa-miR-7160-5p,Mature sequence hsa-miR-7160-3p MIMAT0028230,MIMAT0028231
MI0023564       Stem-loop sequence hsa-mir-6511a-2      Mature sequence hsa-miR-7160-5p,Mature sequence hsa-miR-7160-3p MIMAT0028230,MIMAT0028231
MI0023561       Stem-loop sequence hsa-mir-3690-2       Mature sequence hsa-miR-7160-5p,Mature sequence hsa-miR-7160-3p MIMAT0028230,MIMAT0028231
MI0023623       Stem-loop sequence hsa-mir-7162 Mature sequence hsa-miR-7160-5p,Mature sequence hsa-miR-7160-3p MIMAT0028230,MIMAT0028231
MI0023610       Stem-loop sequence hsa-mir-7150 Mature sequence hsa-miR-7160-5p,Mature sequence hsa-miR-7160-3p MIMAT0028230,MIMAT0028231
MI0023614       Stem-loop sequence hsa-mir-7154 Mature sequence hsa-miR-7160-5p,Mature sequence hsa-miR-7160-3p MIMAT0028230,MIMAT0028231
MI0023615       Stem-loop sequence hsa-mir-7155 Mature sequence hsa-miR-7160-5p,Mature sequence hsa-miR-7160-3p MIMAT0028230,MIMAT0028231
MI0023617       Stem-loop sequence hsa-mir-7157 Mature sequence hsa-miR-7160-5p,Mature sequence hsa-miR-7160-3p MIMAT0028230,MIMAT0028231
MI0023566       Stem-loop sequence hsa-mir-6511a-4      Mature sequence hsa-miR-7160-5p,Mature sequence hsa-miR-7160-3p MIMAT0028230,MIMAT0028231
MI0023621       Stem-loop sequence hsa-mir-7160 Mature sequence hsa-miR-7160-5p,Mature sequence hsa-miR-7160-3p MIMAT0028230,MIMAT0028231

2) 删除了 .clear() 的前 3 行

MI0023620       Stem-loop sequence hsa-mir-7159 Mature sequence hsa-miR-7159-5p,Mature sequence hsa-miR-7159-3p,Mature sequence hsa-miR-7153-5p,Mature sequence hsa-miR-7153-3p,Mature sequence hsa-miR-6077,Mature sequence hsa-miR-7161-5p,Mature sequence hsa-miR-7161-3p,Mature sequence hsa-miR-7156-5p,Mature sequence hsa-miR-7156-3p,Mature sequence hsa-miR-7152-5p,Mature sequence hsa-miR-7152-3p,Mature sequence hsa-miR-6511a-5p,Mature sequence hsa-miR-6511a-3p,Mature sequence hsa-miR-486-5p,Mature sequence hsa-miR-486-3p,Mature sequence hsa-miR-6511b-5p,Mature sequence hsa-miR-6511b-3p,Mature sequence hsa-miR-7151-5p,Mature sequence hsa-miR-7151-3p,Mature sequence hsa-miR-7158-5p,Mature sequence hsa-miR-7158-3p,Mature sequence hsa-miR-6089,Mature sequence hsa-miR-6511a-5p,Mature sequence hsa-miR-6511a-3p,Mature sequence hsa-miR-3690,Mature sequence hsa-miR-7162-5p,Mature sequence hsa-miR-7162-3p,Mature sequence hsa-miR-7150,Mature sequence hsa-miR-7154-5p,Mature sequence hsa-miR-7154-3p,Mature sequence hsa-miR-7155-5p,Mature sequence hsa-miR-7155-3p,Mature sequence hsa-miR-7157-5p,Mature sequence hsa-miR-7157-3p,Mature sequence hsa-miR-6511a-5p,Mature sequence hsa-miR-6511a-3p,Mature sequence hsa-miR-7160-5p,Mature sequence hsa-miR-7160-3p       MIMAT0028228,MIMAT0028229,MIMAT0028216,MIMAT0028217,MIMAT0023702,MIMAT0028232,MIMAT0028233,MIMAT0028222,MIMAT0028223,MIMAT0028214,MIMAT0028215,MIMAT0025478,MIMAT0025479,MIMAT0002177,MIMAT0004762,MIMAT0025847,MIMAT0025848,MIMAT0028212,MIMAT0028213,MIMAT0028226,MIMAT0028227,MIMAT0023714,MIMAT0025478,MIMAT0025479,MIMAT0018119,MIMAT0028234,MIMAT0028235,MIMAT0028211,MIMAT0028218,MIMAT0028219,MIMAT0028220,MIMAT0028221,MIMAT0028224,MIMAT0028225,MIMAT0025478,MIMAT0025479,MIMAT0028230,MIMAT0028231
MI0023613       Stem-loop sequence hsa-mir-7153 Mature sequence hsa-miR-7159-5p,Mature sequence hsa-miR-7159-3p,Mature sequence hsa-miR-7153-5p,Mature sequence hsa-miR-7153-3p,Mature sequence hsa-miR-6077,Mature sequence hsa-miR-7161-5p,Mature sequence hsa-miR-7161-3p,Mature sequence hsa-miR-7156-5p,Mature sequence hsa-miR-7156-3p,Mature sequence hsa-miR-7152-5p,Mature sequence hsa-miR-7152-3p,Mature sequence hsa-miR-6511a-5p,Mature sequence hsa-miR-6511a-3p,Mature sequence hsa-miR-486-5p,Mature sequence hsa-miR-486-3p,Mature sequence hsa-miR-6511b-5p,Mature sequence hsa-miR-6511b-3p,Mature sequence hsa-miR-7151-5p,Mature sequence hsa-miR-7151-3p,Mature sequence hsa-miR-7158-5p,Mature sequence hsa-miR-7158-3p,Mature sequence hsa-miR-6089,Mature sequence hsa-miR-6511a-5p,Mature sequence hsa-miR-6511a-3p,Mature sequence hsa-miR-3690,Mature sequence hsa-miR-7162-5p,Mature sequence hsa-miR-7162-3p,Mature sequence hsa-miR-7150,Mature sequence hsa-miR-7154-5p,Mature sequence hsa-miR-7154-3p,Mature sequence hsa-miR-7155-5p,Mature sequence hsa-miR-7155-3p,Mature sequence hsa-miR-7157-5p,Mature sequence hsa-miR-7157-3p,Mature sequence hsa-miR-6511a-5p,Mature sequence hsa-miR-6511a-3p,Mature sequence hsa-miR-7160-5p,Mature sequence hsa-miR-7160-3p       MIMAT0028228,MIMAT0028229,MIMAT0028216,MIMAT0028217,MIMAT0023702,MIMAT0028232,MIMAT0028233,MIMAT0028222,MIMAT0028223,MIMAT0028214,MIMAT0028215,MIMAT0025478,MIMAT0025479,MIMAT0002177,MIMAT0004762,MIMAT0025847,MIMAT0025848,MIMAT0028212,MIMAT0028213,MIMAT0028226,MIMAT0028227,MIMAT0023714,MIMAT0025478,MIMAT0025479,MIMAT0018119,MIMAT0028234,MIMAT0028235,MIMAT0028211,MIMAT0028218,MIMAT0028219,MIMAT0028220,MIMAT0028221,MIMAT0028224,MIMAT0028225,MIMAT0025478,MIMAT0025479,MIMAT0028230,MIMAT0028231
MI0023562       Stem-loop sequence hsa-mir-6077-2       Mature sequence hsa-miR-7159-5p,Mature sequence hsa-miR-7159-3p,Mature sequence hsa-miR-7153-5p,Mature sequence hsa-miR-7153-3p,Mature sequence hsa-miR-6077,Mature sequence hsa-miR-7161-5p,Mature sequence hsa-miR-7161-3p,Mature sequence hsa-miR-7156-5p,Mature sequence hsa-miR-7156-3p,Mature sequence hsa-miR-7152-5p,Mature sequence hsa-miR-7152-3p,Mature sequence hsa-miR-6511a-5p,Mature sequence hsa-miR-6511a-3p,Mature sequence hsa-miR-486-5p,Mature sequence hsa-miR-486-3p,Mature sequence hsa-miR-6511b-5p,Mature sequence hsa-miR-6511b-3p,Mature sequence hsa-miR-7151-5p,Mature sequence hsa-miR-7151-3p,Mature sequence hsa-miR-7158-5p,Mature sequence hsa-miR-7158-3p,Mature sequence hsa-miR-6089,Mature sequence hsa-miR-6511a-5p,Mature sequence hsa-miR-6511a-3p,Mature sequence hsa-miR-3690,Mature sequence hsa-miR-7162-5p,Mature sequence hsa-miR-7162-3p,Mature sequence hsa-miR-7150,Mature sequence hsa-miR-7154-5p,Mature sequence hsa-miR-7154-3p,Mature sequence hsa-miR-7155-5p,Mature sequence hsa-miR-7155-3p,Mature sequence hsa-miR-7157-5p,Mature sequence hsa-miR-7157-3p,Mature sequence hsa-miR-6511a-5p,Mature sequence hsa-miR-6511a-3p,Mature sequence hsa-miR-7160-5p,Mature sequence hsa-miR-7160-3p       MIMAT0028228,MIMAT0028229,MIMAT0028216,MIMAT0028217,MIMAT0023702,MIMAT0028232,MIMAT0028233,MIMAT0028222,MIMAT0028223,MIMAT0028214,MIMAT0028215,MIMAT0025478,MIMAT0025479,MIMAT0002177,MIMAT0004762,MIMAT0025847,MIMAT0025848,MIMAT0028212,MIMAT0028213,MIMAT0028226,MIMAT0028227,MIMAT0023714,MIMAT0025478,MIMAT0025479,MIMAT0018119,MIMAT0028234,MIMAT0028235,MIMAT0028211,MIMAT0028218,MIMAT0028219,MIMAT0028220,MIMAT0028221,MIMAT0028224,MIMAT0028225,MIMAT0025478,MIMAT0025479,MIMAT0028230,MIMAT0028231

标签: pythonlistfile-io

解决方案


推荐阅读