首页 > 解决方案 > 标记文本中出现的所有位置

问题描述

我有一个字符串列表和 2 个查找表。

说,文本:“ Barrack Obama was president of the United States

LookupA: ["Barrack", "Barrack Obama"]
LookupB: ["United", "United States", "president"]

我需要一种计算成本低廉的方式和pythonic方式来标记所有出现的位置,例如,

结果:[("Barrack", 0, "A"), ("Barrack Obama", 0, "A"), ("president", 18, "B"), ("United", 35, "B"), ("United States", 35, "B")]

我目前有一种非常低效的处理方式。我想这可以使用 Tries 结构快速完成,但我不知道如何以Python的方式在文本流上使用它。如果可以简化问题,在单词(而不是子单词)级别标记单词也足以满足我的用例。

我的低效代码可以在下面找到:

annotations_all = []
for text_index, text in enumerate(texts):
    annotations = []
    found_uniq_entities_tup = {}

    for entity in lookupA:
        if entity not in found_uniq_entities_tup:
            start_index = str(text).find(entity)
            if not start_index == -1:
                found_uniq_entities_tup[entity] = 'A'

    for entity in lookupB:
        if entity not in found_uniq_entities_tup:
            start_index = str(text).find(entity)
            if not start_index == -1:
                found_uniq_entities_tup[entity] = 'B'

    def find_all(super_string: str, sub_string: str):
        start = 0
        while True:
            start = super_string.find(sub_string, start)
            if start == -1:
                return
            yield start
            start += len(sub_string)

    # Find all mentions of all found entities
    for key in found_uniq_entities_tup:
        start_index_list = find_all(str(text), str(key))
        for start_index in start_index_list:
            if not start_index == -1:
                annotations.append({"start": start_index, "end": start_index + len(key) - 1, "entity": key,
                                    "label": found_uniq_entities_tup[key]})
    annotations_all.append(annotations)

任何帮助表示赞赏!

标签: pythonstringdata-structurestime-complexity

解决方案


您可以使用正则表达式来组合所有关键字并将匹配项映射到标签字典。唯一的问题是您的某些关键字包含较小的关键字。这可以通过为关键字中的每个字数生成单独的正则表达式并根据每组模式检查文本来处理。

例子:

import re
tags = {"barrack":"A", "barrack obama":"A", 
        "united":"B", "united states":"B", "president":"B"}

patterns = dict()
for tag in tags: # group keywords by number of words
    patterns.setdefault(tag.count(" "),[]).append(tag)
patterns = [re.compile(r"\b("+"|".join(tn)+r")\b",flags=re.I) 
             for tn in patterns.values()] # regular expression for each group

# generator function to find/return tagged words
def tagWords(text):
    for pattern in patterns: # lookup for each keyword group
        for match in pattern.finditer(text):    # go through matches
            word = match.group()                # matched keyword
            pos  = match.start()                # position in string
            yield (word,pos,tags[word.lower()]) # output tagged word

输出:

text = "Barrack Obama was president of the United States"
for tag in tagWords(text): print(tag)
('Barrack', 0, 'A')
('president', 18, 'B')
('United', 35, 'B')
('Barrack Obama', 0, 'A')
('United States', 35, 'B')

推荐阅读