python - 在文本中查找并行单词的缩放问题(Python)
问题描述
我在 python 中工作,我必须解决一个简单的任务(至少有一个简单的定义):
我有一组名字,每个名字都是一个标记序列:names_to_find = ['York', 'New York', 'Dustin']
我有一个语料库,它由一个句子列表组成:corpus = [' I love New York but also York ', ' Dustin is my cat ', ' I live in New York with my friend Dustin ']
我想要的输出是一个带有names_to_find
as 键的字典,对于语料库中的每次出现,一对(#sentence_index, #word_index)
该示例的所需输出是:
output = { 'York' : [(0, 3), (0, 6), (2, 4)], 'New York' : [(0, 2), (2, 2)], 'Dustin' : [(1, 0), (2, 8)]}
如您所见,如果name_to_find
在同一个句子中出现两次,我都想要,对于组合名称(例如,'New York'),我想要第一个单词的索引。
问题是我有 100 万个names_to_find
和 480 万个句子corpus
我制作了一个无法缩放的代码,以查看时间是否可以接受(不是);要找到 100000 (100k) 个句子中的所有名称,我的代码需要 12 小时 :'(
我的问题是双重的:我来这里是为了请您帮助我编写代码或粘贴一个完全不同的代码,没关系,唯一重要的是代码可以扩展
我报告了我的(并行)代码,在这里我只找到了单个单词,而在另一个检查单词索引是否连续的函数中找到了复合词(即“纽约”):
def parallel_find(self, n_proc):
"""
takes entities in self.entity_token_in_corpus and for each call the function self.entities_token_in_corpus
this method (and the involved one) are thought to work in parallel, so after the calling a reduce is applied
:param
n_proc: the number of process used to make the computation
"""
p = Pool(n_proc)
print('start indexing')
t = time.time()
index_list = p.map(self.create_word_occurrence_index, self.entities_token_in_corpus)
t = time.time() - t
index_list_dict = {k:v for elem in index_list for k, v in elem.items() if v}
p.close()
return index_list_dict, n_proc, len(self.corpus), t
def create_word_occurrence_index(self, word):
"""
loop on all the corpus, call self.find_in_sentence to find occurrences of word in each sentence, returns a dict
:param
word: the word to find
:return: a dict with the structure: {entity_name: list of tuples (row: [occurrences in row])}
"""
key = word
returning_list = []
for row_index, sent in enumerate(self.joined_corpus):
if sent.find(' ' + word + ' ') != -1:
indices = self.find_in_sentence(word = word, sentence = sent)
if indices:
returning_list.append((row_index, indices))
return {key: returning_list}
def find_in_sentence(self, word, sentence):
"""
returns the indexes in which the word appear in a sentence
:params
word: the word to find
sentence: the sentence in which find the word
:return: a list of indices
"""
indices = [i for i, x in enumerate(sentence.split()) if x == word]
return indices
提前致谢
解决方案
这是使用生成器的尝试,但我不确定它在大型目标上的性能会有多好。有问题的部分是多字匹配,但我尝试构建一些多重短路和提前终止代码(我认为还有更多工作要做,但复杂性也开始增加):
def matcher(words, targets):
for word in words:
result = {word: []} #empty dict to hold each word
if len(word.split()) == 1: #check to see if word is single
for t, target in enumerate(targets):
foo = target.split()
bar = [(t,i) for i,x in enumerate(foo) if x == word] #collect the indices
if bar:
result[word].extend(bar) #update the dict
yield result #returns a generator
else:
consecutive = word.split()
end = len(consecutive)
starter = consecutive[0] #only look for first word match
for t, target in enumerate(targets):
foo = target.split()
limit = len(foo)
if foo.count(starter): #skip entire target if 1st word missing
indices = [i for i, x in enumerate(foo) if (x==starter and
limit - end > i)] #don't try to match if index too high
bar = []
for i in indices:
if foo[i:i+end] == consecutive: #do match (expensive)
bar.append((t,i))
result[word].extend(bar)
else:
continue
yield result
如果您想一次性收集所有内容,对于这个修改后的示例
targets = [ ' I love New York but also York ',
' Dustin is my cat ',
' I live in New York with my friend Dustin ',
' New York State contains New York City aka New York']
values = [ 'York', 'New York', 'Dustin', 'New York State' ]
zed = matcher(values, targets)
print(list(zed))
产生:
{'York': [(0, 3), (0, 6), (2, 4), (3, 1), (3, 5), (3, 9)]}
{'New York': [(0, 2), (2, 3), (3, 0), (3, 4)]}
{'Dustin': [(1, 0), (2, 8)]}
{'New York State': [(3, 0)]}
这里可能有利用并发性的方法,我真的不确定,目前还不太熟悉。例如,请参阅https://realpython.com/async-io-python/。另外,我没有仔细检查该代码是否存在错误...认为它还可以。可能在这里需要一些单元测试。