python - Python添加空间
问题描述
我们在文本中有像先生和夫人这样的重复词。我们想在关键字 Mr 和 Mrs 之前和之后添加一个空格。但是,Mrs 中的单词 Mr 越来越重复。请协助解决查询:
输入:
嗨,我是山姆先生。你好,我是 MrsPamela.Mr.Sam,你的电话是关于什么的?帕梅拉夫人,我有一个问题要问你。
import re
s = "Hi This is Mr Sam. Hello, this is Mrs.Pamela.Mr.Sam, what is your call about? Mrs. Pamela, I have a question for you."
words = ("Mr", "Mrs")
def add_spaces(string, words):
for word in words:
# pattern to match any non-space char before the word
patt1 = re.compile('\S{}'.format(word))
matches = re.findall(patt1, string)
for match in matches:
non_space_char = match[0]
string = string.replace(match, '{} {}'.format(non_space_char, word))
# pattern to match any non-space char after the word
patt2 = re.compile('{}\S'.format(word))
matches = re.findall(patt2, string)
for match in matches:
non_space_char = match[-1]
string = string.replace(match, '{} {}'.format(word, non_space_char))
return string
print(add_spaces(s, words))
当前输出:
Hi This is Mr .Sam. Hello, this is Mr sPamela. Mr .Sam, what is your call about? Mr s.Pamela, I have a question for you.
预期输出:
Hi This is Mr .Sam. Hello, this is Mrs Pamela. Mr .Sam, what is your call about? Mrs .Pamela, I have a question for you.
解决方案
我对 re 模块没有非常广泛的知识,但是我想出了一个可以扩展到任意数量的单词和字符串并且完美工作的解决方案(在 python3 中测试),尽管它可能是一个非常广泛的解决方案,你可能找到更优化和更简洁的东西。另一方面,理解这个过程并不难:
- 首先,程序按降序排列单词列表。
- 然后,它首先找到较长单词的匹配项,并记下已经完成匹配的部分,以免再次更改它们。(请注意,这引入了一个限制,但这是必要的,因为程序无法知道您是否要允许变量 word 中的一个单词可以包含在 other 中,无论如何它不会影响您的大小写)
- 当它注意到一个单词的所有匹配项(在字符串的非阻塞部分中)时,它会添加相应的空格并更正被阻塞的索引(它们由于插入空格而移动)
- 最后,它做了一个修剪以消除多个空格
注意:我对变量词使用列表而不是元组
import re
def add_spaces(string, words):
# Get the lenght of the longest word
max_lenght = 0
for word in words:
if len(word)>max_lenght:
max_lenght = len(word)
print("max_lenght = ", max_lenght)
# Order words in descending lenght
ordered_words = []
i = max_lenght
while i>0:
for word in words:
if len(word)==i:
ordered_words.append(word)
i -= 1
print("ordered_words = ", ordered_words)
# Iterate over words adding spaces with each match and "blocking" the match section so not to modify it again
blocked_sections=[]
for word in ordered_words:
matches = [match.start() for match in re.finditer(word, string)]
print("matches of ", word, " are: ", matches)
spaces_position_to_add = []
for match in matches:
blocked = False
for blocked_section in blocked_sections:
if match>=blocked_section[0] and match<=blocked_section[1]:
blocked = True
if not blocked:
# Block section and store position to modify after
blocked_sections.append([match,match+len(word)])
spaces_position_to_add.append([match,match+len(word)+1])
# Add the spaces and update the existing blocked_sections
spaces_added = 0
for new_space in spaces_position_to_add:
# Add space before and after the word
string = string[:new_space[0]+spaces_added]+" "+string[new_space[0]+spaces_added:]
spaces_added += 1
string = string[:new_space[1]+spaces_added]+" "+string[new_space[1]+spaces_added:]
spaces_added += 1
# Update existing blocked_sections
for blocked_section in blocked_sections:
if new_space[0]<blocked_section[0]:
blocked_section[0] += 2
blocked_section[1] += 2
# Trim extra spaces
string = re.sub(' +', ' ', string)
return string
### MAIN ###
if __name__ == '__main__':
s = "Hi This is Mr Sam. Hello, this is Mrs.Pamela.Mr.Sam, what is your call about? Mrs. Pamela, I have a question for you."
words = ["Mr", "Mrs"]
print(s)
print(add_spaces(s,words))
推荐阅读
- php - 如何在php中逐行分隔html标签?
- python - 绘制箱线图关闭异常值检测
- regex - 动态 htaccess 重写规则
- gremlin - 得到另一个查询的响应后如何执行查询?
- react-native - 通过样式将矩形视图变为四边形视图 - react native
- git - 如何还原在父分支中检查的子分支中的文件?
- java - java - 如何使用正则表达式从字符串中删除破折号(-),但Java中的数字和单词之间除外?
- macos - 将 MacOS 更新到 10.14.6 后 webdriverIO 键功能在本地不起作用
- cucumber - JBehave 和 Cucumber 与 Java 11 的兼容性
- php - 表格行删除按钮