首页 > 解决方案 > 遍历文本并找到预定义子字符串之间的距离

问题描述

我决定我想获取一个文本并找出文本中的一些标签有多接近。基本上,这个想法是检查两个人是否相距少于 14 个单词,如果他们是,我们就说他们是相关的。

我的幼稚实现是有效的,但前提是这个人是一个单词,因为我会遍历单词。

text = """At this moment Robert  who rises at seven and works before 
       breakfast   came in  He glanced at his wife  her cheek was 
       slightly flushed  he  patted it caressingly      What s the 
       matter  my dear   he asked      She objects to my doing nothing 
       and having red hair   said I  in an  injured tone      Oh  of 
       course he can t help his hair   admitted Rose      It generally 
       crops out once in a generation   said my brother   So does  the 
       nose  Rudolf has got them both I must premise that I am going  
       perforce  to rake up the  very scandal which my dear Lady 
       Burlesdon wishes forgotten--in the year  1733  George II  
       sitting then on the throne  peace reigning for  the moment  and 
       the King and the Prince of Wales being not yet at  loggerheads  
       there came on a visit to the English Court a certain  prince  
      who was afterwards known to history as Rudolf the Third of Ruritania"""
involved = ['Robert', 'Rose', 'Rudolf the Third', 
            'a Knight of the Garter', 'James', 'Lady Burlesdon']

# my naive implementation
ws = text.split()
l = len(ws)
    for wi,w in enumerate(ws):
        # Skip if the word is not a person
        if w not in involved:
            continue
        # Check next x words for any involved person
        x = 14
        for i in range(wi+1,wi+x):
            # Avoid list index error
            if i >= l:
                break
            # Skip if the word is not a person
            if ws[i] not in involved:
                continue
            # Print related
            print(ws[wi],ws[i])

现在我想升级此脚本以允许使用多字名称,例如“Lady Burlesdon”。我不完全确定最好的方法是什么。欢迎任何提示。

标签: pythontext

解决方案


您可以首先对文本进行预处理,以便将其中的所有名称text替换为单字 ID。id 必须是您不希望在文本中显示为其他单词的字符串。在预处理文本时,您可以保留 id 到名称的映射,以了解哪个名称对应于哪个 id。这将允许保持您当前的算法不变。


推荐阅读