首页 > 解决方案 > 返回包含在文本块中的字符串,不附加标点符号

问题描述

我需要匹配并返回至少包含以下字符串/字符组合之一的任何单词:

- tion (as in navigation, isolation, or mitigation)
- ex (as in explanation, exfiltrate, or expert)
- ph (as in philosophy, philanthropy, or ephemera)
- ost, ist, ast (as in hostel, distribute, past)

我的功能似乎可以做到这一点

TEXT_SAMPLE = """
Striking an average of observations taken at different times-- rejecting those
timid estimates that gave the object a length of 200 feet, and ignoring those
exaggerated views that saw it as a mile wide and three long--you could still
assert that this phenomenal creature greatly exceeded the dimensions of
anything then known to ichthyologists, if it existed at all.
Now then, it did exist, this was an undeniable fact; and since the human mind
dotes on objects of wonder, you can understand the worldwide excitement caused
by this unearthly apparition. As for relegating it to the realm of fiction,
that charge had to be dropped.
In essence, on July 20, 1866, the steamer Governor Higginson, from the
Calcutta & Burnach Steam Navigation Co., encountered this moving mass five
miles off the eastern shores of Australia.
"""

def latin_ish_words(text):

    #Returns input text into list of words, splitting on whitespace, allocates list to text_list 
    text_list = text.split()
    #Creates an empty string, match_list
    match_list = []
    #Creates a string containing latinish featurs
    part_list = ["tion", "ex", "ph", "ost", "ist", "ast"]
    #Iterates through list of words and latinish features, adds word to match_list if contains latinish features
    for word in text_list:
        for part in part_list:
            if part in word:
                match_list.append(word)
    match_list = list(dict.fromkeys(match_list))
    return match_list

latin_ish_words(TEXT_SAMPLE)

['observations', 'exaggerated', 'phenomenal', 'exceeded', 'ichthyologists,', 'existed', 'exist,', 'excitement', 'apparition.', 'fiction,', 'Navigation', 'eastern']

但是,当数字带有标点符号时,该函数也会返回标点符号

例如 - 存在,',

怎么能过滤掉这种附加的标点符号?

标签: pythonregextext

解决方案


您可以使用r"\b\w*(?:tion|ex|ph|ost|ist|ast)\w*\b"正则表达式。说明(另见文档):

  • \b...单词边界
  • \w...单词字符
  • *... 0 次或多次重复
  • \w*... 0 个或多个单词字符
  • (?:...)...“普通”括号,不创建组
  • |... 或者
  • tion|ex|ph...tionexph

代码:

import re
print(re.findall(r"\b\w*(?:tion|ex|ph|ost|ist|ast)\w*\b",TEXT_SAMPLE))

为方便起见,您可以通过编程方式构建模式,从变量中添加部分:

import re
part_list = [
    "tion", 
    "ex", 
    "ph", 
    "ost", 
    "ist", 
    "ast",
]
part_re = "|".join(part_list)
pattern = fr"\b\w*(?:{part_re})\w*\b"
# pattern = r"\b\w*(?:{})\w*\b".format(part_re) # for older versions not allowing f-string syntax
print(re.findall(pattern,TEXT_SAMPLE))

输出:

[
   'observations',
   'exaggerated',
   'phenomenal',
   'exceeded',
   'ichthyologists',
   'existed',
   'exist',
   'excitement',
   'apparition',
   'fiction',
   'Navigation',
   'eastern',
]

推荐阅读