首页 > 解决方案 > 如何使用python整理段落?

问题描述

我最近下载了一本 pdf 格式的书,并将 pdf 中的文本复制粘贴到记事本中,并将其保存为 book.txt。我现在遇到的问题是这本书有大约 29,000 行,一行一行。book.txt 中的段落不是连续的,而是在一组单词之后换行,如下所示

The project manager selected a professional team with a long practical
experience in this field which comes to around twenty years. This team is
unique in the fact that their area of experience and expertise is mainly in
dealing with this genre of texts

当您在“实用”一词之后注意到“经验”一词以换行开头但我希望这些段落是连续的,如下所示

The project manager selected a professional team with a long practical experience in this field which comes to around twenty years. This team is unique in the fact that their area of experience and expertise is mainly in dealing with this genre of texts. The team started their job keeping in mind the importance of the task and the objective the author (of the Arabic book) was trying to realize from this project.

book.txt 中的所有文本都采用相同的格式。

使用 python,我怎样才能为整本书纠正这个问题,以便每个段落都是连续的?

标签: python

解决方案


假设它很简单 \n 这将解决您的问题:

a_file = open("book.txt", "r")


string_without_line_breaks = ""

for line in a_file:

    stripped_line = line.replace('\n', " ")

    string_without_line_breaks += stripped_line



a_file.close()

formattedbookString = string_without_line_breaks.replace('  ', '\n\n')

text_file = open("formattedBook.txt", "w")
text_file.write(formattedbookString)
text_file.close()

我们在 python 中打开您的文件(假设它与您的 .py 脚本在同一目录中):

a_file = open("book.txt", "r")

我们创建一个空字符串来保存您的书,并在您的 txt 中的每一行中读取。虽然我们用空格替换单个换行符并将它们全部连接起来:

string_without_line_breaks = ""

for line in a_file:

    stripped_line = line.replace('\n', " ")

    string_without_line_breaks += stripped_line

我们正在关闭文件流,现在我们正在使用替换来获取由于用“”替换“\n”而产生的双空格,并用换行符“\n”替换它们

a_file.close()

formattedbookString = string_without_line_breaks.replace('  ', '\n\n')

最后,我们创建一个新文件来保存您的图书。

text_file = open("formattedBook.txt", "w")
text_file.write(formattedbookString)
text_file.close()

您的格式化书应该在您的 pythonscript 所在的文件系统中可用。


推荐阅读