python-3.x - 如何将pdf文件重新组织成段落?
问题描述
我正在做一个字幕项目,对于我的数据集,我必须从 pdfs 文件中提取图像及其标题。
我使用 pdftotext 从 pdf 中提取文本,但现在我必须将这些文本文件重新组织成符合 pdf 的段落。
我正在使用这段代码,但我不满意,因为输出文本的结构并不是我想要的。
import re
import os
from os import listdir
text_directory = 'Texte'
for name in listdir(text_directory):
filename = text_directory + '/' + name
#open the text file
with open(filename,encoding="utf8") as file:
data = file.read()
# Split text by 2 line break to have a kind of bloc
paragraphs = [item for item in data.split('\n\n') if item]
rawParagraphs = []
for paragraph in paragraphs:
newParagraph = []
# Split the bloc by line break
lines = paragraph.split('\n')
for line in lines:
# Split lines by 2 white spaces to have kind of colones
cols = [item for item in line.split(' ') if item]
newParagraph.append(cols)
# Find max cols of pg
maxcol = max([len(line) for line in newParagraph])
# Patch lines
for index, line in enumerate(newParagraph):
if len(line)< maxcol:
if lines[index].startswith(' '):
for i in range(maxcol-len(line)):
line.insert(0, '')
else:
for i in range(maxcol-len(line)):
line.append('')
newParagraph[index]= line
rawParagraph = []
# Join line to have paragraph per colones
for i in range(maxcol):
for j in range(len(newParagraph)):
rawParagraph.append(newParagraph[j][i])
rawParagraph = ' '.join(rawParagraph).replace('- ','')
rawParagraphs.append(rawParagraph)
# Create a text file to write the new paragraph
textcle = open("Text_org" + "/" + name ,"w",encoding="utf8")
# Get new paragraph
references = [paragraph for paragraph in rawParagraphs if paragraph]
# Write the paragraph in the texte file
for index, reference in enumerate(references):
text = f"{index} {reference}\n\n"
textcle.write(text)
textcle.close
有人可以帮我完善它吗?还是有其他方法可以轻松做到这一点?
解决方案
推荐阅读
- hazelcast - Hazelcast 节点不加入集群
- python - 如何列出所有关注者及其以下日期?
- powershell - 如果我单击表单上的退出按钮退出脚本?
- php - PHP x1,x2,y1,y2坐标的边界框(左,上,高,宽)?
- javascript - 如何验证多个图像文件的大小和尺寸
- python - 在 BigQuery 中获取表的列名
- reactjs - 样式组件未应用于组件
- google-chrome - chrome.cookies.set 不在隐身窗口中创建 cookie
- pointers - 要在 for 循环中使用 goroutine,为什么迭代指向结构而不是结构本身的指针有效
- javascript - 如何在 ES6 中将对象映射到数组?