首页 > 解决方案 > Python添加空间

问题描述

我们在文本中有像先生和夫人这样的重复词。我们想在关键字 Mr 和 Mrs 之前和之后添加一个空格。但是,Mrs 中的单词 Mr 越来越重复。请协助解决查询:

输入:

嗨,我是山姆先生。你好,我是 MrsPamela.Mr.Sam,你的电话是关于什么的?帕梅拉夫人,我有一个问题要问你。

import re

s = "Hi This is Mr Sam. Hello, this is Mrs.Pamela.Mr.Sam, what is your call about? Mrs. Pamela, I have a question for you."
words = ("Mr", "Mrs")


def add_spaces(string, words):

for word in words:
    # pattern to match any non-space char before the word
    patt1 = re.compile('\S{}'.format(word))

    matches = re.findall(patt1, string)
    for match in matches:
        non_space_char = match[0]
        string = string.replace(match, '{} {}'.format(non_space_char, word))

    # pattern to match any non-space char after the word
    patt2 = re.compile('{}\S'.format(word))
    matches = re.findall(patt2, string)
    for match in matches:
        non_space_char = match[-1]
        string = string.replace(match, '{} {}'.format(word, non_space_char))

return string


print(add_spaces(s, words))

当前输出:

Hi This is Mr .Sam. Hello, this is Mr sPamela. Mr .Sam, what is your call about? Mr s.Pamela, I have a question for you.

预期输出:

Hi This is Mr .Sam. Hello, this is Mrs Pamela. Mr .Sam, what is your call about? Mrs .Pamela, I have a question for you.

标签: pythonsplit

解决方案


我对 re 模块没有非常广泛的知识,但是我想出了一个可以扩展到任意数量的单词和字符串并且完美工作的解决方案(在 python3 中测试),尽管它可能是一个非常广泛的解决方案,你可能找到更优化和更简洁的东西。另一方面,理解这个过程并不难:

  1. 首先,程序按降序排列单词列表。
  2. 然后,它首先找到较长单词的匹配项,并记下已经完成匹配的部分,以免再次更改它们。(请注意,这引入了一个限制,但这是必要的,因为程序无法知道您是否要允许变量 word 中的一个单词可以包含在 other 中,无论如何它不会影响您的大小写)
  3. 当它注意到一个单词的所有匹配项(在字符串的非阻塞部分中)时,它会添加相应的空格并更正被阻塞的索引(它们由于插入空格而移动)
  4. 最后,它做了一个修剪以消除多个空格

注意:我对变量词使用列表而不是元组

import re

def add_spaces(string, words):
    # Get the lenght of the longest word
    max_lenght = 0
    for word in words:
        if len(word)>max_lenght:
            max_lenght = len(word)
    print("max_lenght = ", max_lenght)

    # Order words in descending lenght
    ordered_words = []
    i = max_lenght
    while i>0:
        for word in words:
            if len(word)==i:
                ordered_words.append(word)
        i -= 1
    print("ordered_words = ", ordered_words)

    # Iterate over words adding spaces with each match and "blocking" the match section so not to modify it again
    blocked_sections=[]
    for word in ordered_words:
        matches = [match.start() for match in re.finditer(word, string)]
        print("matches of ", word, " are: ", matches)

        spaces_position_to_add = []
        for match in matches:
            blocked = False
            for blocked_section in blocked_sections:
                if match>=blocked_section[0] and match<=blocked_section[1]:
                    blocked = True
            if not blocked:
                # Block section and store position to modify after
                blocked_sections.append([match,match+len(word)])
                spaces_position_to_add.append([match,match+len(word)+1])

        # Add the spaces and update the existing blocked_sections
        spaces_added = 0
        for new_space in spaces_position_to_add:
            # Add space before and after the word
            string = string[:new_space[0]+spaces_added]+" "+string[new_space[0]+spaces_added:]
            spaces_added += 1
            string = string[:new_space[1]+spaces_added]+" "+string[new_space[1]+spaces_added:]
            spaces_added += 1

            # Update existing blocked_sections
            for blocked_section in blocked_sections:
                if new_space[0]<blocked_section[0]:
                    blocked_section[0] += 2
                    blocked_section[1] += 2

    # Trim extra spaces
    string = re.sub(' +', ' ', string)

    return string


###  MAIN  ###
if __name__ == '__main__':
    s = "Hi This is Mr Sam. Hello, this is Mrs.Pamela.Mr.Sam, what is your call about? Mrs. Pamela, I have a question for you."
    words = ["Mr", "Mrs"]

    print(s)
    print(add_spaces(s,words))

推荐阅读