首页 > 解决方案 > 如何从文件中提取数据而不使用 Python 去复制

问题描述

我目前有一个脚本,可以很好地根据第二个文件(白名单)中的关键字从一个文件中提取数据,并将提取的数据写到第三个文件中

import sys
import csv

input_file = csv.DictReader(open(sys.argv[1], "rU"))

white_list_file = csv.DictReader(open(sys.argv[2], "rU"))

output_file = csv.DictWriter(open(sys.argv[3], "w"), input_file.fieldnames)

output_file.writeheader()

white_list = {} #load empty dictionary

for record in white_list_file:
    white_list[record["key_word"]] = None

for record in input_file: #for every item in my input file
    record_id = record["key_word"] #assign column with key word from input file as a variable
    if (record_id in (white_list)): # if this key word is in my white list,
        output_file.writerow(record)   # then I write the whole line in my output file

    else:   # if not, then ignore this line and move on to the next line
    continue

但是,输出文件的结果是我的原始输入文件的重复版本。过去这对我来说效果很好,但现在我需要一个不会重复我的结果的新脚本。

因此,如果我的输入文件在 3 个不同的行中有一个关键字,我希望我的输出文件也有 3 次该关键字和相关信息。

我尝试使用“计数器”方法解决修改我的脚本的问题,以尝试计算在我的白名单中找到关键字的次数,但这不起作用或产生了预期的结果。

有没有一种简单的方法来修改我的脚本,使输出文件不会被取消复制?

标签: python-2.7dictionary

解决方案


使用此处给出的代码,您可以实现所需的输出,如下所示,您有一个名为data.csv的输入文件,您还可以在文件中包含空格:

HEADER    Signaling Protein                       03-May-12   4F0A
TITLE     Crystal Structure Of Xwnt8 In Complex With The Cysteine    
TITLE    2 rich Domain Of Frizzled 8                                  
AUTHOR    C.Y.Janda,D.Waghray,A.M.Levin,C.Thomas,K.C.Garcia          
REMARK = 1 NCBI PDB FORMAT VERSION 6.0
REMARK = 2 NOTE:  NCBI-MMDB PDB-Format File derived from ASN.1
REMARK = 3 Refer to original ASN.1 file or PDB file for data records


HELIX    1   1 GLN A   62  HIS A   70  1                                    9
HELIX    2   2 PHE A   72  GLN A   79  1                                    8
HELIX    3   3 LEU A   84  TYR A   92  1                                    9
HELIX    4   4 SER A  109  TYR A  125  1                                   17 
HELIX    1   1 PRO B   34  ALA B   42  1                                    9
HELIX    2   2 SER B   43  PHE B   59  1                                   17
HELIX    3   3 ARG B   84  SER B  106  1                                   23
HELIX    4   4 ALA B  137  PHE B  147  1                                   11
HELIX    5   5 ALA B  157  GLU B  175  1                                   19
HELIX    6   6 PHE B  202  GLN B  215  1                                   14
HELIX    7   7 GLY B  236  SER B  244  1                                    9
ATOM      1  N   CYS A  35     -46.772 -32.953  13.444  1.00118.86           N  
ATOM      2  CA  CYS A  35     -45.589 -33.712  13.063  1.00132.02           C  
ATOM      3  C   CYS A  35     -45.956 -34.934  12.237  1.00141.34           C  
ATOM      4  O   CYS A  35     -47.000 -35.548  12.450  1.00140.11           O  
SEQRES = 1 A  132  ALA SER ALA LYS GLU LEU ALA CYS GLN GLU ILE THR VAL
SEQRES = 2 A  132  PRO LEU CYS LYS GLY ILE GLY TYR ASN TYR THR TYR MET
SEQRES = 25 B  316  HIS PHE CYS ALA
ATOM      5  CB  CYS A  35     -44.802 -34.155  14.301  1.00137.04           C  
ATOM      6  SG  CYS A  35     -43.999 -32.812  15.204  1.00163.69           S  
ATOM      7  N   GLN A  36     -45.100 -35.263  11.277  1.00149.21           N  
ATOM      8  CA  GLN A  36     -45.159 -36.550  10.594  1.00144.14           C  
ATOM      9  C   GLN A  36     -43.746 -37.119  10.503  1.00143.70           C  
SHEET    1   A 1 CYS A  35  ILE A 38  0
SHEET    2   A 1 ASN A  49  TYR A 52  0
SHEET    1   B 1 GLY B 121  ARG B126  0
SHEET    2   B 1 GLY B 127  GLY B131  0
SHEET    3   B 1 THR B 176  HIS B184  0

您要从中提取文件keys.txt中存在的以下密钥:

REMARK
HELIX
SEQRES
SHEET

为此,可以使用以下代码:

#!/usr/bin/python
with open('data.csv', 'r') as sourcefile:
     source = sourcefile.read().splitlines()

with open('keys.txt', 'r') as keyfile:
     keys = keyfile.read().split()

with open('MyOutFile', 'w') as outfile:
     for line in source:
         if line.split():
             if line.split()[0] in keys:
                 outfile.write(line + "\n")
outfile.close()

这会将 keys.txt 中的键提取为:

REMARK = 1 NCBI PDB FORMAT VERSION 6.0
REMARK = 2 NOTE:  NCBI-MMDB PDB-Format File derived from ASN.1
REMARK = 3 Refer to original ASN.1 file or PDB file for data records
HELIX    1   1 GLN A   62  HIS A   70  1                                    9
HELIX    2   2 PHE A   72  GLN A   79  1                                    8
HELIX    3   3 LEU A   84  TYR A   92  1                                    9
HELIX    4   4 SER A  109  TYR A  125  1                                   17 
HELIX    1   1 PRO B   34  ALA B   42  1                                    9
HELIX    2   2 SER B   43  PHE B   59  1                                   17
HELIX    3   3 ARG B   84  SER B  106  1                                   23
HELIX    4   4 ALA B  137  PHE B  147  1                                   11
HELIX    5   5 ALA B  157  GLU B  175  1                                   19
HELIX    6   6 PHE B  202  GLN B  215  1                                   14
HELIX    7   7 GLY B  236  SER B  244  1                                    9
SEQRES = 1 A  132  ALA SER ALA LYS GLU LEU ALA CYS GLN GLU ILE THR VAL
SEQRES = 2 A  132  PRO LEU CYS LYS GLY ILE GLY TYR ASN TYR THR TYR MET
SEQRES = 25 B  316  HIS PHE CYS ALA
SHEET    1   A 1 CYS A  35  ILE A 38  0
SHEET    2   A 1 ASN A  49  TYR A 52  0
SHEET    1   B 1 GLY B 121  ARG B126  0
SHEET    2   B 1 GLY B 127  GLY B131  0
SHEET    3   B 1 THR B 176  HIS B184  0

这将解决您的问题。


推荐阅读