首页 > 解决方案 > 在文本中查找并行单词的缩放问题(Python)

问题描述

我在 python 中工作,我必须解决一个简单的任务(至少有一个简单的定义):

我有一组名字,每个名字都是一个标记序列:names_to_find = ['York', 'New York', 'Dustin'] 我有一个语料库,它由一个句子列表组成:corpus = [' I love New York but also York ', ' Dustin is my cat ', ' I live in New York with my friend Dustin ']

我想要的输出是一个带有names_to_findas 键的字典,对于语料库中的每次出现,一对(#sentence_index, #word_index)

该示例的所需输出是:

output = { 'York' : [(0, 3), (0, 6), (2, 4)], 'New York' : [(0, 2), (2, 2)], 'Dustin' : [(1, 0), (2, 8)]}

如您所见,如果name_to_find在同一个句子中出现两次,我都想要,对于组合名称(例如,'New York'),我想要第一个单词的索引。

问题是我有 100 万个names_to_find和 480 万个句子corpus

我制作了一个无法缩放的代码,以查看时间是否可以接受(不是);要找到 100000 (100k) 个句子中的所有名称,我的代码需要 12 小时 :'(

我的问题是双重的:我来这里是为了请您帮助我编写代码或粘贴一个完全不同的代码,没关系,唯一重要的是代码可以扩展

我报告了我的(并行)代码,在这里我只找到了单个单词,而在另一个检查单词索引是否连续的函数中找到了复合词(即“纽约”):

 def parallel_find(self, n_proc):
    """
    takes entities in self.entity_token_in_corpus and for each call the function self.entities_token_in_corpus
    this method (and the involved one) are thought to work in parallel, so after the calling a reduce is applied
    :param 
       n_proc: the number of process used to make the computation
    """
    p = Pool(n_proc)

    print('start indexing')

    t = time.time()

    index_list = p.map(self.create_word_occurrence_index, self.entities_token_in_corpus)

    t = time.time() - t

    index_list_dict = {k:v for elem in index_list for k, v in elem.items() if v}
    p.close()
    return index_list_dict, n_proc, len(self.corpus), t

def create_word_occurrence_index(self, word):
    """
    loop on all the corpus, call self.find_in_sentence to find occurrences of word in each sentence, returns a dict
    :param 
        word: the word to find
    :return: a dict with the structure: {entity_name: list of tuples (row: [occurrences in row])}
    """
    key = word
    returning_list = []
    for row_index, sent in enumerate(self.joined_corpus):
        if sent.find(' ' + word + ' ') != -1:
            indices = self.find_in_sentence(word = word, sentence = sent)
            if indices:
                returning_list.append((row_index, indices))
    return {key: returning_list}

def find_in_sentence(self, word, sentence):
    """
    returns the indexes in which the word appear in a sentence
    :params
        word: the word to find
        sentence: the sentence in which find the word
    :return: a list of indices
    """

    indices = [i for i, x in enumerate(sentence.split()) if x == word]
    return indices

提前致谢

标签: pythonpython-3.xparallel-processingnlp

解决方案


这是使用生成器的尝试,但我不确定它在大型目标上的性能会有多好。有问题的部分是多字匹配,但我尝试构建一些多重短路和提前终止代码(我认为还有更多工作要做,但复杂性也开始增加):

def matcher(words, targets):
    for word in words:
        result = {word: []}                     #empty dict to hold each word
        if len(word.split()) == 1:              #check to see if word is single
            for t, target in enumerate(targets):
                foo = target.split()
                bar = [(t,i) for i,x in enumerate(foo) if x == word] #collect the indices
                if bar:
                    result[word].extend(bar)    #update the dict
            yield result                        #returns a generator

        else:
            consecutive = word.split()          
            end = len(consecutive)
            starter = consecutive[0]              #only look for first word match
            for t, target in enumerate(targets):
                foo = target.split()
                limit = len(foo)
                if foo.count(starter):  #skip entire target if 1st word missing
                    indices = [i for i, x in enumerate(foo) if (x==starter and
                             limit - end > i)]    #don't try to match if index too high
                    bar = []
                    for i in indices:
                        if foo[i:i+end] == consecutive:   #do match (expensive)
                            bar.append((t,i))
                    result[word].extend(bar)

                else:
                    continue
            yield result

如果您想一次性收集所有内容,对于这个修改后的示例

targets = [ ' I love New York but also York ',
            ' Dustin is my cat ',
            ' I live in New York with my friend Dustin ',
            ' New York State contains New York City aka New York']  

values =  [ 'York', 'New York', 'Dustin', 'New York State' ]

zed = matcher(values, targets)
print(list(zed))

产生:

{'York': [(0, 3), (0, 6), (2, 4), (3, 1), (3, 5), (3, 9)]}
{'New York': [(0, 2), (2, 3), (3, 0), (3, 4)]}
{'Dustin': [(1, 0), (2, 8)]}
{'New York State': [(3, 0)]} 

这里可能有利用并发性的方法,我真的不确定,目前还不太熟悉。例如,请参阅https://realpython.com/async-io-python/。另外,我没有仔细检查该代码是否存在错误...认为它还可以。可能在这里需要一些单元测试。


推荐阅读