python - 与python匹配后如何打印行?
问题描述
我有一个文件,有几行(我只显示其中两行):
UniRef90_A0A0K2VG56 UniRef90_A0A0P5UY87
UniRef90_A0A095VQ09 UniRef90_A0A0C1UI80 UniRef90_A0A1M4ZSK2
和另一个文件(我只显示文件的一些行):
>UniRef90_A0A095VQ09 - Cluster: LOW QUALITY PROTEIN: titin
MTTKAPTFTQPLQSVVALEGSAATFEAHISGSPVPEVSWYRDGQVLSAATLPGVQISFSD
GRAKLMIPAVAAGHSGRYTLQATNGSGQATSTAELLVTAETAPPNFSQRLQSTTARQGSQ
VRLDVRVTGIPTPVVKFYRDRAEIQSSPDFQILQEGDLYSLIIAEAYPEDSGTYSVNATN
>UniRef90_A0A0K2VG56 - Cluster: titin isoform X29
MATQAPTFTQPLQSVVVLEGSTATFEAHISGFPVPEVSWFRDGQVISTSTLPGVQISFSD
GRAKLMIPAVTKANSGRYSLRATNGSGQATSTAELLVKAETAPPNFVQRLQSMTVRQGSQ
VRLQVRVTGIPTPVVKFYRDGAEIQSSLDFQISQEGELYSLLIVEAYPEDSGTYSVNATN
SVGRATSTAELLVQGEEVVPAKKTKTIVSTAQISKSRETRIEKKIEAHFDARSIATVEMV
IDGAAGQELPHKTPPRIPLKPKSRSPTPPSIAAKAQLARQQSPSPIRHSPSPVRHVRAPT
>UniRef90_A0A0C1UI80 - Cluster: LOW QUALITY PROTEIN: lafev
GRAKLMIPAVTKANSGRYSLRATNGSGQATSTAELLVKAETAPPNFVQRLQSMTVRQGSQ
VRLQVRVTGIPTPVVKFYRDGAEIQSSLDFQISQEGLARQQSPSPIRHSPSPVRHVRAPT
>UniRef90_A0A0P5UY87 - Cluster: titin isoform X4
VRLQVRVTGIPTPVVKFYRDGAEIQSSLDFQISQEGELYSLLIVEAYPEDSGTYSVNATN
SVGRATSTAELLVQGEEVVPAKKTKTIVSTAQISKSRETRIEKKIEAHFDARSIATVEMV
GRAKLMIPAVAAGHSGRYTLQATNGSGQATSTAELLVTAETAPPNFSQRLQSTTARQGSQ
>UniRef90_A0A1M4ZSK2 - Cluster: titin isoform X54
SVGRATSTAELLVQGEEVVPAKKTKTIVSTSTAELLVTAETAPPNFSQRLQSTTARQGSQ
SVGRATSTAELLVQGEEVVPAKKTKTIVSTAQISKSRETRIEKKIEAHFDARSIATVEMV
IDGAAGQELPHKTPPRIPLKPKSRSPTPPSIAAKAQLARQQSPSPIRHSPSPVRHVRAPT
对于我的第一个文件的每一行,我需要将Uniref90_XXXXXX
ID 与 Uniref90_XXXXXX
第二个文件的 ID 匹配。匹配完成后,我需要将序列(字母 ...TNGSGQATS .... = 序列)返回到相应的 ID。
例如,第一个文件的第一行有 2 个 Uniref90_XXXXX ID,我希望得到这样的输出:
>UniRef90_A0A0K2VG56 - Cluster: titin isoform X29
MATQAPTFTQPLQSVVVLEGSTATFEAHISGFPVPEVSWFRDGQVISTSTLPGVQISFSD
GRAKLMIPAVTKANSGRYSLRATNGSGQATSTAELLVKAETAPPNFVQRLQSMTVRQGSQ
VRLQVRVTGIPTPVVKFYRDGAEIQSSLDFQISQEGELYSLLIVEAYPEDSGTYSVNATN
SVGRATSTAELLVQGEEVVPAKKTKTIVSTAQISKSRETRIEKKIEAHFDARSIATVEMV
IDGAAGQELPHKTPPRIPLKPKSRSPTPPSIAAKAQLARQQSPSPIRHSPSPVRHVRAPT ##first ID of the first line
>UniRef90_A0A0P5UY87 - Cluster: titin isoform X4
VRLQVRVTGIPTPVVKFYRDGAEIQSSLDFQISQEGELYSLLIVEAYPEDSGTYSVNATN
SVGRATSTAELLVQGEEVVPAKKTKTIVSTAQISKSRETRIEKKIEAHFDARSIATVEMV
GRAKLMIPAVAAGHSGRYTLQATNGSGQATSTAELLVTAETAPPNFSQRLQSTTARQGSQ ##second ID of the first line
我需要对我的第一个文件的每一行都这样做。
解决方案
所以你似乎需要Uniref90_XXXXXX
根据他们在第一个文件中的顺序来订购 s 。
这UniRef_ids.txt
是您的第一个文件,UniRef_data.txt
是您的第二个文件,并且UniRef_data_ordered.txt
是输出文件。
我注意到每个Uniref90_XXXXXX
似乎都以 a 开头>
并继续,跨越可变数量的行,直到下一个>
或我假设文件的结尾。
我只处理了一个例外:如果Uniref90_XXXXXX
出现了您的第一个文件,而不是您的第二个文件。它只是向您的控制台(而不是您的文件)打印警告。
如果其余文件的格式不同,这可能不起作用。同样,如果您的文件有几千兆字节,我的方法可能不合适,因为我将您的第二个文件的全部内容读入内存。
# We first go through the second file, get all the Uniref90_XXXXXX IDs, and
# put their sequences (including the Uniref90_XXXXXX header line) into a dict.
# A sequence can be accessed like so: uniref_dict["UniRef90_A0A0K2VG56"]
with open("UniRef_data.txt", "rt") as f:
data = f.read()
uniref_dict = {}
for uniref in [f">{chunk.rstrip()}" for chunk in data.split(">")]:
uniref_id = uniref[1:uniref.find(" ")]
uniref_dict[uniref_id] = uniref
# Then we go through the first file, line by line, id by id, and write to
# a new file the corresponding sequence (again, including the Uniref90_XXXXXX
# header line, as per your output) and append the Uniref90_XXXXXX at the end.
with open("UniRef_ids.txt", "rt") as fin:
with open("UniRef_data_ordered.txt", "wt") as fout:
for line in fin:
line = line.rstrip()
uniref_ids = line.split(" ")
for uniref_id in uniref_ids:
try:
fout.write("{} ##{}\n".format(uniref_dict[uniref_id], uniref_id))
except KeyError as e:
print(f"uniref_id '{uniref_id}' found in id file but not data file. Continuing...")
UniRef_data_ordered.txt:
>UniRef90_A0A0K2VG56 - Cluster: titin isoform X29
MATQAPTFTQPLQSVVVLEGSTATFEAHISGFPVPEVSWFRDGQVISTSTLPGVQISFSD
GRAKLMIPAVTKANSGRYSLRATNGSGQATSTAELLVKAETAPPNFVQRLQSMTVRQGSQ
VRLQVRVTGIPTPVVKFYRDGAEIQSSLDFQISQEGELYSLLIVEAYPEDSGTYSVNATN
SVGRATSTAELLVQGEEVVPAKKTKTIVSTAQISKSRETRIEKKIEAHFDARSIATVEMV
IDGAAGQELPHKTPPRIPLKPKSRSPTPPSIAAKAQLARQQSPSPIRHSPSPVRHVRAPT ##UniRef90_A0A0K2VG56
>UniRef90_A0A0P5UY87 - Cluster: titin isoform X4
VRLQVRVTGIPTPVVKFYRDGAEIQSSLDFQISQEGELYSLLIVEAYPEDSGTYSVNATN
SVGRATSTAELLVQGEEVVPAKKTKTIVSTAQISKSRETRIEKKIEAHFDARSIATVEMV
GRAKLMIPAVAAGHSGRYTLQATNGSGQATSTAELLVTAETAPPNFSQRLQSTTARQGSQ ##UniRef90_A0A0P5UY87
>UniRef90_A0A095VQ09 - Cluster: LOW QUALITY PROTEIN: titin
MTTKAPTFTQPLQSVVALEGSAATFEAHISGSPVPEVSWYRDGQVLSAATLPGVQISFSD
GRAKLMIPAVAAGHSGRYTLQATNGSGQATSTAELLVTAETAPPNFSQRLQSTTARQGSQ
VRLDVRVTGIPTPVVKFYRDRAEIQSSPDFQILQEGDLYSLIIAEAYPEDSGTYSVNATN ##UniRef90_A0A095VQ09
>UniRef90_A0A0C1UI80 - Cluster: LOW QUALITY PROTEIN: lafev
GRAKLMIPAVTKANSGRYSLRATNGSGQATSTAELLVKAETAPPNFVQRLQSMTVRQGSQ
VRLQVRVTGIPTPVVKFYRDGAEIQSSLDFQISQEGLARQQSPSPIRHSPSPVRHVRAPT ##UniRef90_A0A0C1UI80
>UniRef90_A0A1M4ZSK2 - Cluster: titin isoform X54
SVGRATSTAELLVQGEEVVPAKKTKTIVSTSTAELLVTAETAPPNFSQRLQSTTARQGSQ
SVGRATSTAELLVQGEEVVPAKKTKTIVSTAQISKSRETRIEKKIEAHFDARSIATVEMV
IDGAAGQELPHKTPPRIPLKPKSRSPTPPSIAAKAQLARQQSPSPIRHSPSPVRHVRAPT ##UniRef90_A0A1M4ZSK2
是否可以为循环的每次迭代创建单独的文件?我的意思是,对于第一个文件的每一行,我想创建一个带有 ID 和相应序列的文件?
是的,这是可能的。我们只需要打开输出文件并在遍历第一个文件中的行的 for 循环中编写代码,并给每个文件一个唯一的名称。
# We first go through the second file, get all the Uniref90_XXXXXX IDs, and
# put their sequences (including the Uniref90_XXXXXX header line) into a dict.
# A sequence can be accessed like so: uniref_dict["UniRef90_A0A0K2VG56"]
with open("UniRef_data.txt", "rt") as f:
data = f.read()
uniref_dict = {}
for uniref in [f">{chunk.rstrip()}" for chunk in data.split(">")]:
uniref_id = uniref[1:uniref.find(" ")]
uniref_dict[uniref_id] = uniref
# Then we go through the first file, line by line, and write to a new
# file the ids and their corresponding sequences (again, including the
# Uniref90_XXXXXX header line, as per your output)
with open("UniRef_ids.txt", "rt") as fin:
# Each iteration of this for loop is a new line of Uniref90_XXXXXX ids,
# so we've moved the file writing code inside of this loop.
# enumerate gives us a counter - i - that starts at 1, and increments by 1
# after each iteration. We use this to give each file a unique name.
for i, line in enumerate(fin, start=1):
line = line.rstrip()
uniref_ids = line.split(" ")
with open(f"UniRef_data_by_id_row_{i:03}.txt", "wt") as fout:
for uniref_id in uniref_ids:
try:
fout.write(uniref_dict[uniref_id] + "\n")
except KeyError as e:
print(f"uniref_id '{uniref_id}' found in id file but not data file. Continuing...")
顺便说一句,这是生成我们文件名的代码:
f"UniRef_data_by_id_row_{i:03}.txt"
f
前缀告诉 Python 它是一个f-string
. 它评估{}
s 中的内容并返回一个字符串。之前:
是值,之后是格式说明符。在这种情况下,我的格式说明符将 0 填充i
到宽度为 3,给我的文件名如下:
UniRef_data_by_id_row_001.txt
UniRef_data_by_id_row_999.txt
这样,很容易在文件管理器中对文件进行排序。
您可以以不同的方式命名文件。例如,如果您不想要下划线,并且想要用空格而不是 0 填充数字:
f"UniRef Data Ordered by ID - Row {i: >4}.txt"
UniRef Data Ordered by ID - Row 1.txt
UniRef Data Ordered by ID - Row 9999.txt
推荐阅读
- sql - 对 SQL 记录应用多重验证并识别所有失败的验证
- java - 在不同的功能文件和步骤定义文件中运行测试时出现 NullPointerException
- qt - 禁用按钮时不会触发 MouseMoveEvent
- flutter - Flutter 未加载设备并显示“正在等待另一个 Flutter 命令释放启动锁......”
- .net - MassTransit:取消长时间运行的作业
- aws-lambda - Lambda 函数完成后,Amazon Cloudformation 堆栈挂起
- solr - 需要帮助在 Solr 上标记模型代码
- azure-devops - 如何将包描述添加到使用 DotNetCoreCLI 打包的 Nuget
- python - 如何在 tweepy 中显示推文的全文
- ios - ActivityViewController:弹出关闭时的completionHandler