首页 > 解决方案 > 如何使用python有条件地从txt文件中删除行序列

问题描述

我从包含 EI-MS、MS/MS 的 MS-DIAL 代谢组学 MSP 光谱试剂盒下载了一个大文本文件

该文件以 txt 格式的化合物文件打开,如下所示:

NAME: C11H11NO5; PlaSMA ID-967
PRECURSORMZ: 238.0712
PRECURSORTYPE: [M+H]+
FORMULA: C11H11NO5
Ontology: Formula predicted
INCHIKEY:
SMILES:
RETENTIONTIME: 1.74
CCS: -1
IONMODE: Positive
COLLISIONENERGY:
Comment: Annotation level-3; PlaSMA ID-967; ID title-AC_Bulb_Pos-629; Max plant tissue-LE_Ripe_Pos
Num Peaks: 2
192.06602   53
238.0757    31

NAME: Malvidin-3,5-di-O-glucoside; PlaSMA ID-3141
PRECURSORMZ: 656.19415
PRECURSORTYPE: [M+H]+
FORMULA: C29H35O17
Ontology: Anthocyanidin O-glycosides
INCHIKEY: CILLXFBAACIQNS-UHFFFAOYNA-O
SMILES: COC1=CC(=CC(OC)=C1O)C1=C(OC2OC(CO)C(O)C(O)C2O)C=C2C(OC3OC(CO)C(O)C(O)C3O)=CC(O)=CC2=[O+]1
RETENTIONTIME: 2.81
CCS: 241.3010517
IONMODE: Positive
COLLISIONENERGY:
Comment: Annotation level-1; PlaSMA ID-3141; ID title-Malvidin-3,5-di-O-glucoside; Max plant tissue-Standard only
Num Peaks: 0

每个化合物都有NAME到下一个之间的数据NAME

我正在尝试做的是删除所有值Num Peaks: 为零的化合物(即Num Peaks: 0,如果化合物的第 12 行是Num Peaks: 0删除所有稀释化合物的数据 - 向上 12 行,则删除)。

在上面的化合物中,它是删除NAME: Malvidin-3,5-di-O-glucoside; PlaSMA ID-3141直到之间的行Num Peaks: 0 之后,我需要将数据保存回txt或msp格式。

我所做的只是将数据作为列表导入:

with open('path\to\MSMS-Public-Pos-VS15.msp') as f:
    lines = f.readlines()

然后创建一个带有索引的列表,其中每个复合起始链接

indices = [i for i, s in enumerate(lines) if 'NAME' in s]

我想,现在我需要附加差异大于 14 的连续索引(意味着峰值数大于零)链接

# to find the difference between consecutive indices.

v = np.diff(indices)

选择具有差异 14 的那些并在第一个位置添加元素零


diff14 = np.where(v == 14)

diff14 = np.append([0],diff14[0])

现在我只想选择那些不在 diff14 中的值,以便创建一个包含峰数大于零的化合物的新列表

现在我需要一些循环来选择正确的索引,但不知道如何:

lines[indices[diff14[0]]: indices[diff14[1]]]

lines[indices[diff14[1]+1] : indices[diff14[2]]]

lines[indices[diff14[2]+1] : lines[indices[diff14[3]]]]

lines[indices[diff14[3]+1] : indices[diff14[4]]]

非常感谢任何更好的想法或提示

标签: pythonfor-loop

解决方案


这不像其他答案那样紧凑和内存效率高,但希望它应该更容易理解和扩展。

我建议的方法是将您的输入解析为例如列表列表,每个元素包含一个化合物。我建议 3 个步骤:(1) 将数据解析为化合物列表,(2) 迭代此化合物列表,删除不需要的化合物,(3) 将列表输出回文件。根据文件的大小,可以使用 1 次循环数据或 3 次单独的循环来执行此操作。

# Step (1) Parse the file
compounds = list() # store all compunds
with open('compound.txt', 'r') as f:
    # stores a single compound as a list of rows for a given compound.
    # Note: can be improved to e.g. a dictionary or a custom class
    current_compound = list()
    for line in f:
        if line.strip() == '': # assumes each compound is split by empty line(s)
            print('Empty line')
            # Store previous compound
            if len(current_compound) != 0:
                compounds.append(list(current_compound))

            # prepare for next compound
            current_compound = list()
        else:
            # At this point we could parse this more,
            # e.g. seperate into key/value, but lets just append the whole line with trailing newline
            print('Adding', line.strip())
            current_compound.append(line)

好的,现在让我们检查一下我们的进度

for item in compounds:
    print('\n===Compound===\n', item)

结果是

===Compound===
 ['NAME: C11H11NO5; PlaSMA ID-967\n', 'PRECURSORMZ: 238.0712\n', 'PRECURSORTYPE: [M+H]+\n', 'FORMULA: C11H11NO5\n', 'Ontology: Formula predicted\n', 'INCHIKEY:\n', 'SMILES:\n'\
, 'RETENTIONTIME: 1.74\n', 'CCS: -1\n', 'IONMODE: Positive\n', 'COLLISIONENERGY:\n', 'Comment: Annotation level-3; PlaSMA ID-967; ID title-AC_Bulb_Pos-629; Max plant tissue-LE\
_Ripe_Pos\n', 'Num Peaks: 2\n', '192.06602   53\n', '238.0757    31\n']

===Compound===
 ['NAME: Malvidin-3,5-di-O-glucoside; PlaSMA ID-3141\n', 'PRECURSORMZ: 656.19415\n', 'PRECURSORTYPE: [M+H]+\n', 'FORMULA: C29H35O17\n', 'Ontology: Anthocyanidin O-glycosides\n\
', 'INCHIKEY: CILLXFBAACIQNS-UHFFFAOYNA-O\n', 'SMILES: COC1=CC(=CC(OC)=C1O)C1=C(OC2OC(CO)C(O)C(O)C2O)C=C2C(OC3OC(CO)C(O)C(O)C3O)=CC(O)=CC2=[O+]1\n', 'RETENTIONTIME: 2.81\n', '\
CCS: 241.3010517\n', 'IONMODE: Positive\n', 'COLLISIONENERGY:\n', 'Comment: Annotation level-1; PlaSMA ID-3141; ID title-Malvidin-3,5-di-O-glucoside; Max plant tissue-Standard\
 only\n', 'Num Peaks: 0\n']

然后,您可以遍历此化合物列表并删除 Num Peaks 设置为 0 的化合物,然后再写回文件。如果您也需要这部分的帮助,请告诉我。


推荐阅读