python - 遍历文本并找到预定义子字符串之间的距离
问题描述
我决定我想获取一个文本并找出文本中的一些标签有多接近。基本上,这个想法是检查两个人是否相距少于 14 个单词,如果他们是,我们就说他们是相关的。
我的幼稚实现是有效的,但前提是这个人是一个单词,因为我会遍历单词。
text = """At this moment Robert who rises at seven and works before
breakfast came in He glanced at his wife her cheek was
slightly flushed he patted it caressingly What s the
matter my dear he asked She objects to my doing nothing
and having red hair said I in an injured tone Oh of
course he can t help his hair admitted Rose It generally
crops out once in a generation said my brother So does the
nose Rudolf has got them both I must premise that I am going
perforce to rake up the very scandal which my dear Lady
Burlesdon wishes forgotten--in the year 1733 George II
sitting then on the throne peace reigning for the moment and
the King and the Prince of Wales being not yet at loggerheads
there came on a visit to the English Court a certain prince
who was afterwards known to history as Rudolf the Third of Ruritania"""
involved = ['Robert', 'Rose', 'Rudolf the Third',
'a Knight of the Garter', 'James', 'Lady Burlesdon']
# my naive implementation
ws = text.split()
l = len(ws)
for wi,w in enumerate(ws):
# Skip if the word is not a person
if w not in involved:
continue
# Check next x words for any involved person
x = 14
for i in range(wi+1,wi+x):
# Avoid list index error
if i >= l:
break
# Skip if the word is not a person
if ws[i] not in involved:
continue
# Print related
print(ws[wi],ws[i])
现在我想升级此脚本以允许使用多字名称,例如“Lady Burlesdon”。我不完全确定最好的方法是什么。欢迎任何提示。
解决方案
您可以首先对文本进行预处理,以便将其中的所有名称text
替换为单字 ID。id 必须是您不希望在文本中显示为其他单词的字符串。在预处理文本时,您可以保留 id 到名称的映射,以了解哪个名称对应于哪个 id。这将允许保持您当前的算法不变。
推荐阅读
- angular - Angular 明确排除包
- javascript - 打字稿中不同类型的函数参数
- javascript - 角色分配向特定频道发布公告
- azure - 如何将 Blobfuse 升级到最新版本
- android - 在 Android Studio 中使用外部 cmake 可执行文件
- docker - 在 Docker 构建中使用 pip install 时出现 ModuleNotFoundError
- linux - 错误:无法找到或加载主类 io.gatling.app.Gatling
- python - 什么是 TensorFlow 线性回归的 pytorch 等价物?
- objective-c - 将 objc 关联对象用于 IBOutletCollection
- c# - 如何从基类实例访问派生类属性