首页 > 解决方案 > 使用python将字符串拆分成句子

问题描述

我有以下字符串:

string = 'This is one sentence  ${w_{1},..,w_{i}}$. This is another sentence. '

现在,我想把它分成两句话。

但是,当我这样做时:

string.split('.')

我得到:

['This is one sentence  ${w_{1},',
 '',
 ',w_{i}}$',
 ' This is another sentence',
 ' ']

任何人都知道如何改进它,以免检测到“。” 内$ $

另外,你会怎么做:

string2 = 'This is one sentence  ${w_{1},..,w_{i}}$! This is another sentence. Is this a sentence? Maybe !  '

编辑1:

期望的输出是:

对于字符串 1:

['This is one sentence  ${w_{1},..,w_{i}}$','This is another sentence']

对于字符串 2:

['This is one sentence  ${w_{1},..,w_{i}}$','This is another sentence', 'Is this a sentence', 'Maybe !  ']

标签: pythonstring

解决方案


对于更一般的情况,您可以re.split像这样使用:

import re

mystr = 'This is one sentence  ${w_{1},..,w_{i}}$. This is another sentence. '

re.split("[.!?]\s{1,}", mystr)
# ['This is one sentence  ${w_{1},..,w_{i}}$', 'This is another sentence', '']

str2 = 'This is one sentence  ${w_{1},..,w_{i}}$! This is another sentence. Is this a sentence? Maybe !  '

re.split("[.!?]\s{1,}", str2)
['This is one sentence  ${w_{1},..,w_{i}}$', 'This is another sentence', 'Is this a sentence', 'Maybe ', '']

括号中的字符是您选择的标点符号,并且您在末尾添加至少一个空格\s{1,}以忽略其他.没有间距的字符。这也将处理您的感叹号案例

这是一种(有点老套)找回标点符号的方法

punct = re.findall("[.!?]\s{1,}", str2)
['! ', '. ', '? ', '!  ']

sent = [x+y for x,y in zip(re.split("[.!?]\s{1,}", str2), punct)]
sent
['This is one sentence  ${w_{1},..,w_{i}}$! ', 'This is another sentence. ', 'Is this a sentence? ', 'Maybe !  ']

推荐阅读