python - 如何通过包含极端情况将文本拆分为句子
问题描述
我正在使用此链接将文本拆分为句子:
这是代码:
%%time
import re
alphabets= "([A-Za-z])"
prefixes = "(Mr|St|Mrs|Ms|Dr)[.]"
suffixes = "(Inc|Ltd|Jr|Sr|Co)"
starters = "(Mr|Mrs|Ms|Dr|He\s|She\s|It\s|They\s|Their\s|Our\s|We\s|But\s|However\s|That\s|This\s|Wherever)"
acronyms = "([A-Z][.][A-Z][.](?:[A-Z][.])?)"
websites = "[.](com|net|org|io|gov|ly)"
digits = "([0-9])"
def split_into_sentences(text):
text = " " + text + " "
text = text.replace("\n"," ")
text = re.sub(prefixes,"\\1<prd>",text)
text = re.sub(websites,"<prd>\\1",text)
text = re.sub(digits + "[.]" + digits,"\\1<prd>\\2",text)
if "Ph.D" in text: text = text.replace("Ph.D.","Ph<prd>D<prd>")
text = re.sub("\s" + alphabets + "[.] "," \\1<prd> ",text)
text = re.sub(acronyms+" "+starters,"\\1<stop> \\2",text)
text = re.sub(alphabets + "[.]" + alphabets + "[.]" + alphabets + "[.]","\\1<prd>\\2<prd>\\3<prd>",text)
text = re.sub(alphabets + "[.]" + alphabets + "[.]","\\1<prd>\\2<prd>",text)
text = re.sub(" "+suffixes+"[.] "+starters," \\1<stop> \\2",text)
text = re.sub(" "+suffixes+"[.]"," \\1<prd>",text)
text = re.sub(" " + alphabets + "[.]"," \\1<prd>",text)
if "”" in text: text = text.replace(".”","”.")
if "\"" in text: text = text.replace(".\"","\".")
if "!" in text: text = text.replace("!\"","\"!")
if "?" in text: text = text.replace("?\"","\"?")
text = text.replace(".",".<stop>")
text = text.replace("?","?<stop>")
text = text.replace("!","!<stop>")
text = text.replace("<prd>",".")
sentences = text.split("<stop>")
sentences = sentences[:-1]
sentences = [s.strip() for s in sentences]
if len(sentences)==0:
k=[]
return [text]
else:
return sentences
尽管上面的代码适用于大多数极端情况。但如果在少数情况下失败,例如:
text="Thank you for contacting back. Request you to please help us with the transaction ID for $<***>.92 ? - Charlie."
它打破了$<***>.92
成$<***>.
和92
。我怎样才能在上面的代码中包含这个?
解决方案
如果要扩展代码,可以将美元符号 ( $
) 添加到浮点值解析中:
text = re.sub("$" + digits + "[.]" + digits,"\\1<prd>\\2",text)
推荐阅读
- vue.js - Vue UI 不工作,与 UI 服务器断开连接
- python - 如何访问 Tweepy 光标的元素
- javascript - 如何使用来自 QuilljS 的事件监听器传递 dotNetHelper
- node.js - Mongoose - 'model.createCollection(...)' 返回的类型在这些类型之间不兼容
- wso2-am - wso2 api manager 分析 pdf 报告徽标未更改
- c# - 切换启用时文本框的文本略有移动:字体和字体大小更改
- java - Java:具有泛型类型的可变参数与泛型类型的数组
- xml - SSIS:Foreach 节点列表枚举器
- authentication - 会话状态超时代码在 web.config 中不起作用
- python - 在 Python、R 或 Stata 中从 DCC Garch 开发 DECO-Garch 模型(等相关)