python - 如何通过搜索 List1 中的子字符串来查找 List2 中的完整字符串?
问题描述
我有两个由氨基酸序列组成的列表(这不是生物学问题,而是针对上下文进行说明),其中List1
包含完整字符串(完整序列)的子字符串(部分序列)。List2
包含完整的字符串,其中一些包含子字符串,List1
而一些字符串不包含。
这些列表很大,这就是制作它们的原因,但现在我不知道如何将子字符串跟踪到完整字符串。
下面是使用我自己数据集中的真实数据的两个列表的修改示例。对于其中包含的两个子字符串List1
,应该有两个匹配项List2
。我已经确认比赛在List2
.
List1 = ['QSLNQNVVSRTCPAVVSHRARRAVRVMATGSPLTFSKYQGLGNDFILIDNRHTSEPVVTPEQAVKICDRNFGVGGDGVIFALPPVGETDLTMRIFNSDGSEPEMCGNGIRCLAKFVADIDKSSPRKYKIHTLAGLIQPELLADGQVRVDMGAPILDGSKVPTTLTPTEGNTVVQQDLVVDGKTYKVTCVSMGNPHAVIYTCNGKTIKIDDLESDLAALGPKFERNTVFPARTNTEFVEVISPSHVRMVVWERGAGRTLACGTGACALVVAGILEGRVDRSKTCRVDLPGGPLQIEWSTVDNHIYMTGPAELVFGGSLRV', 'DMRISYERGGLEEAAFRGRDPMQVFDEWFKAAVAGKVCEEPNAISLASSNPSGQPSVRVVLLKGYDERGFVFYTNYSSRKGTELESGSAAFSIYWEKLQRQIRVEGTVEHVSEEESTAYFHSRPRGSQIGAWVSAQSQPCRNRGEMEARNAELQQRFSDESVPVPKPPHWGGYLIRPTRIEFWQGRPSRLHDRIRFRRPSPNESWVMERLQP']
List2 = [Seq('SSLPSNSVWASGKSYLGHLY*CVHPAHTVTFTLPLVAA*YRALSYDVRRSKFLT...LHL', HasStopCodon(ExtendedIUPACProtein(), '*')), Seq('PLYHLILSGPLENPT*DTYTDAFILLTRSLSPSLS*PRNTALCHMTFAVQNFLL...CIF', HasStopCodon(ExtendedIUPACProtein(), '*')), Seq('LSTI*FCLGLWKILPRTPILMRSSCSHGHFHPPSRSRVIPRFVI*RSPFKISYS...TAS', HasStopCodon(ExtendedIUPACProtein(), '*')), Seq('EDAVIESKCGQSHMPGCCQPPGTQGCARNGYGIAPDVLQVSGPW*RFHLD*QSP...VER', HasStopCodon(ExtendedIUPACProtein(), '*')), Seq('KMQSLNQNVVSRTCPAVVSHRARRAVRVMATGSPLTFSKYQGLGNDFILIDNRH...*RG', HasStopCodon(ExtendedIUPACProtein(), '*')), Seq('RCSH*IKMWSVAHARLLSATGHAGLCA*WLRDRP*RSPSIRALVTISS*LTIAT...GRE', HasStopCodon(ExtendedIUPACProtein(), '*')) Seq('VLTHVVASDKELLARAVRWEALPSRKNLSGLHHPSAPKPLSNSQYYSKKKPIRL...DFV', HasStopCodon(ExtendedIUPACProtein(), '*')), Seq('FLHTWLLPTRSCSRVQSAGKHCQAEKTSQVCITHRRLSH*ATLNITVKKNQSVS...QTS', HasStopCodon(ExtendedIUPACProtein(), '*')), Seq('SYTRGCFRQGVARACSPLGSIAKQKKPLRSASPIGA*AIKQLSILQ*KKTNPSH...RLR', HasStopCodon(ExtendedIUPACProtein(), '*')), Seq('HEVCVSVT*QHYVLP*RTNLWGHPSSELLSRVRINC*LQLLSVLNQCSIAHHRA...CKN', HasStopCodon(ExtendedIUPACProtein(), '*')), Seq('TKSAFQ*HNNIMFFPNAQIYGDTPAPSCYHVCA*IANCNYYLCSINAV*HIIAP...CVR', HasStopCodon(ExtendedIUPACProtein(), '*')), Seq('RSLRFSNITTLCSSLTHKFMGTPQLRVAITCAHKLLTATIICAQSMQYSTSSRQ...V*E', HasStopCodon(ExtendedIUPACProtein(), '*'))]
这是我的脚本的高度浓缩版本,以获得更多上下文:
import os
import xml.etree.ElementTree as ET
from Bio.Seq import Seq
from Bio.Alphabet import generic_dna
path_to_allxmlfiles = "path/to/xml/file/dir/" # Path to a directory where a bunch of XML files are found.
xml_dir = os.listdir(path_to_allxmlfiles)
path_to_transcriptome = "path/to/transcriptome/file.fasta" #This is just a giant fasta file.
transcriptomefile = open(path_to_transcriptome, 'r')
List1=[]
for file in xml_dir:
if file.endswith(".xml"):
xml_file_path = os.path.join(path_to_allxmlfiles, file)
xml_files = open(xml_file_path, 'r')
for lines in xml_files:
tree = ET.parse(xml_files)
root = tree.getroot()
for substring in root.findall("./BlastOutput_iterations/Iteration/Iteration_hits/Hit[1]/Hit_hsps/Hsp[1]/substring"):
SUBSTRING = substring.text
List1.append(SUBSTRING)
fullstrlist1 = []
fullstrlist2 = []
fullstrlist3 = []
fullstrlist4 = []
fullstrlist5 = []
fullstrlist6 = []
for line in transcriptomefile:
if (stuff_was_done_here):
A_lot_of_stuff_done_here_where_I_appended_full_strings_to_six_lists. # I am translating in 6 reading frames so this is necessary because each reading frame is unique.
List2 = [fullstrlist1, fullstrlist2, fullstrlist3, fullstrlist4, fullstrlist5, fullstrlist6] #List2 is a combination of the six lists above.
for item in List2:
if any(x in item for x in List1):
print(item)
返回print(item)
的不是包含List1
我所知道的子字符串的项目。
这是我在 StackOverflow 上的第一个问题。如果需要更多细节,请告诉我。我提前感谢您的帮助。
解决方案
我不太确定你在问什么,但这里有 2 种搜索方法的示例......
#Sub Strings
List1 = ["Apple", "Mulberry"]
# List of lists
List2 = [
["Apple", "Grapefruit", "Guava"],
["Banana", "Blueberry", "Grape"],
["Lemon", "Lime"],
["Loquat", "Lychee", "Mango"],
["Mulberry", "Nectarine", "Strawberry"],
["Pomegranate", "Raspberry"]
]
#a substring to search for.
List3 = ["berry", "ime"]
print("Search for whole matching strings", List1)
for item in List2:
if any(x in item for x in List1):
print(item)
print('\n Substring search for substrings', List3)
for group in List2:
for item in group:
if any(sbs in item for sbs in List3):
print(group)
break # no need to keep searching the group
和输出
Search for whole matching strings ['Apple', 'Mulberry']
['Apple', 'Grapefruit', 'Guava']
['Mulberry', 'Nectarine', 'Strawberry']
Substring search for substrings ['berry', 'ime']
['Banana', 'Blueberry', 'Grape']
['Lemon', 'Lime']
['Mulberry', 'Nectarine', 'Strawberry']
['Pomegranate', 'Raspberry']
推荐阅读
- javascript - 使用正则表达式匹配 youtube 播放列表 url,而不是播放列表中的歌曲
- javascript - webRTC能否消耗CPU利用率
- reactjs - 我在反应的单独页面中显示我的问题时遇到问题
- c# - 使用委托调用异步方法
- angular - ngb-highlight 选择器的 highlightClass 输入不起作用
- php - 我应该如何解决“未定义路由 [admin.conditions.update]”。错误?
- c# - 如何转换
<0.05 成字节数组 - flutter - 颤振 - 不能让两者同时工作
- xcode - 构建失败的 Xcode 10 cordova firebase 项目
- python - 多行语句在哪里设置断点?