首页 > 解决方案 > 如何将文本文件中的数据提取到定义为空白行之间的数据行的句子中?

问题描述

数据位于文本文件中,我想将其中的数据分组为句子。一个句子的定义是所有行一个接一个,每行至少有1个字符。数据行之间有空白行,因此我希望空白行标记句子的开头和结尾。有没有办法通过列表理解来做到这一点?

来自文本文件的示例。数据如下所示:

This is the
first sentence.

This is a really long sentence
and it just keeps going across many
rows there will not necessarily be 
punctuation
or consistency in word length
the only difference in ending sentence
is the next row will be blank

here would be the third sentence
as 
you see
the blanks between rows of data 
help define what a sentence is

this would be sentence 4
i want to pull data
from text file
as such (in sentences) 
where sentences are defined with
blank records in between

this would be sentence 5 since blank row above it
and continues but ends because blank row(s) below it

标签: pythonlistlist-comprehensiondata-mining

解决方案


您可以将整个文件作为单个字符串使用file_as_string = file_object.read(). 由于您想将此字符串拆分为一个空行,这相当于拆分为两个后续换行符,所以我们可以这样做sentences = file_as_string.split("\n\n"). 最后,您可能希望删除仍然存在于句子中间的换行符。您可以通过列表理解来做到这一点,用任何内容替换换行符:sentences = [s.replace('\n', '') for s in sentences]

总共给出:

file_as_string = file_object.read()
sentences = file_as_string.split("\n\n")
sentences = [s.replace('\n', '') for s in sentences]

推荐阅读