python - 返回包含在文本块中的字符串,不附加标点符号
问题描述
我需要匹配并返回至少包含以下字符串/字符组合之一的任何单词:
- tion (as in navigation, isolation, or mitigation)
- ex (as in explanation, exfiltrate, or expert)
- ph (as in philosophy, philanthropy, or ephemera)
- ost, ist, ast (as in hostel, distribute, past)
我的功能似乎可以做到这一点
TEXT_SAMPLE = """
Striking an average of observations taken at different times-- rejecting those
timid estimates that gave the object a length of 200 feet, and ignoring those
exaggerated views that saw it as a mile wide and three long--you could still
assert that this phenomenal creature greatly exceeded the dimensions of
anything then known to ichthyologists, if it existed at all.
Now then, it did exist, this was an undeniable fact; and since the human mind
dotes on objects of wonder, you can understand the worldwide excitement caused
by this unearthly apparition. As for relegating it to the realm of fiction,
that charge had to be dropped.
In essence, on July 20, 1866, the steamer Governor Higginson, from the
Calcutta & Burnach Steam Navigation Co., encountered this moving mass five
miles off the eastern shores of Australia.
"""
def latin_ish_words(text):
#Returns input text into list of words, splitting on whitespace, allocates list to text_list
text_list = text.split()
#Creates an empty string, match_list
match_list = []
#Creates a string containing latinish featurs
part_list = ["tion", "ex", "ph", "ost", "ist", "ast"]
#Iterates through list of words and latinish features, adds word to match_list if contains latinish features
for word in text_list:
for part in part_list:
if part in word:
match_list.append(word)
match_list = list(dict.fromkeys(match_list))
return match_list
latin_ish_words(TEXT_SAMPLE)
['observations', 'exaggerated', 'phenomenal', 'exceeded', 'ichthyologists,', 'existed', 'exist,', 'excitement', 'apparition.', 'fiction,', 'Navigation', 'eastern']
但是,当数字带有标点符号时,该函数也会返回标点符号
例如 - 存在,',
怎么能过滤掉这种附加的标点符号?
解决方案
您可以使用r"\b\w*(?:tion|ex|ph|ost|ist|ast)\w*\b"
正则表达式。说明(另见文档):
\b
...单词边界\w
...单词字符*
... 0 次或多次重复\w*
... 0 个或多个单词字符(?:...)
...“普通”括号,不创建组|
... 或者tion|ex|ph
...tion
或ex
或ph
代码:
import re
print(re.findall(r"\b\w*(?:tion|ex|ph|ost|ist|ast)\w*\b",TEXT_SAMPLE))
为方便起见,您可以通过编程方式构建模式,从变量中添加部分:
import re
part_list = [
"tion",
"ex",
"ph",
"ost",
"ist",
"ast",
]
part_re = "|".join(part_list)
pattern = fr"\b\w*(?:{part_re})\w*\b"
# pattern = r"\b\w*(?:{})\w*\b".format(part_re) # for older versions not allowing f-string syntax
print(re.findall(pattern,TEXT_SAMPLE))
输出:
[
'observations',
'exaggerated',
'phenomenal',
'exceeded',
'ichthyologists',
'existed',
'exist',
'excitement',
'apparition',
'fiction',
'Navigation',
'eastern',
]
推荐阅读
- matlab - matlab无法导出图形
- javascript - 使用 cordova-file-plugin 存储的文件在哪里
- react-native - 循环博览会生物特征认证直到成功
- sql - 如何构造特定的正则表达式
- c - C程序不读取键盘输入
- r - 导致此特定错误消息的语法错误是什么?
- apache-kafka - 如果在 Apache Flink 中操作时发生异常,则不提交来自 Apache Kafka 的消息
- javascript - 如何使用WEB API Dot net core实现文件上传?
- javascript - 如何使用jQuery从字符串中获取特定值
- python - 在 groupby 聚合函数中有条件地连接字符串