python - 将PDF文件转换为.txt python 3
问题描述
from io import StringIO
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
import os
import sys, getopt
#converts pdf, returns its text content as a string
def convert(fname, pages=None):
if not pages:
pagenums = set()
else:
pagenums = set(pages)
output = StringIO
manager = PDFResourceManager()
converter = TextConverter(manager, output, laparams=LAParams())
interpreter = PDFPageInterpreter(manager, converter)
filepath = open(fname, 'rb')
for page in PDFPage.get_pages(filepath, pagenums):
interpreter.process_page(page)
filepath.close()
converter.close()
text = output.getvalue()
output.close
return text
def convertMultiple(pdfDir, txtDir):
if pdfDir == "": pdfDir = os.getcwd() + "\\" #if no pdfDir passed in
for pdf in os.listdir(pdfDir): #iterate through pdfs in pdf directory
fileExtension = pdf.split(".")[-1]
if fileExtension == "pdf":
pdfFilename = pdfDir + pdf
text = convert(pdfFilename) #get string of text content of pdf
textFilename = txtDir + pdf + ".txt"
textFile = open(textFilename, "w") #make text file
textFile.write(text) #write text to text file
#textFile.close
pdfDir = (r"FK_EPPS")
txtDir = (r"FK_txt")
convertMultiple(pdfDir, txtDir)
我尝试将多个名为 FK_EPPS 的 pdf 文件转换为 txt 文件并将其写入名为 FK_txt 的不同文件夹中。但它说没有这样的文件或目录。我将文件夹完全放在那些路径中。我尝试找到解决方案,但仍然存在错误。你能帮我为什么会这样吗?
/usr/local/lib/python2.7/dist-packages/pdfminer/__init__.py:20: UserWarning: On January 1st, 2020, pdfminer.six will stop supporting Python 2. Please upgrade to Python 3. For more information see https://github.com/pdfminer/pdfminer.six/issues/194
warnings.warn('On January 1st, 2020, pdfminer.six will stop supporting Python 2. Please upgrade to Python 3. For '
Traceback (most recent call last):
File "/home/a1-re/Documents/pdftotext/1.py", line 44, in <module>
convertMultiple(pdfDir, txtDir)
File "/home/a1-re/Documents/pdftotext/1.py", line 36, in convertMultiple
text = convert(pdfFilename) #get string of text content of pdf
File "/home/a1-re/Documents/pdftotext/1.py", line 21, in convert
filepath = file(fname, 'rb')
IOError: [Errno 2] No such file or directory: 'pdf1831150030.pdf'
解决方案
(您显示的回溯不可能是正确的。使用您的示例输入,错误应该FK_EPPS
在开始时包含。)
您忘记了路径和文件名必须使用适合您操作系统的分隔符相互分隔。
fname
如果您在该函数的开头打印出 的值,您可能会立即看到这一点convert
。您对文本输出文件名犯了同样的错误,但这会更难注意到,因为它不会产生错误,而只会创建错误的文件名。
推荐阅读
- numpy - 给定三角形分布样本,如何计算最小值和最大值?
- discord - 在 x 天每 x 次发送消息 (Discord.js)
- entity-framework - DbContext.Entry() vs DbConext.Attach() 用于新的或尚未跟踪的对象
- html - 当宽度缩小时其中一个元素获得高度时,引导网格无法正确显示
- c - 我在使用字符串指针时遇到问题
- c++ - C++:返回一个可以是特殊派生类的对象
- excel - 用循环在excel中创建线条
- python - issubclass() 的一些实际用例是什么?
- javascript - 如何将视频通话添加到 nativescript 应用程序
- wasm-bindgen - 如何从 rust 访问 web_sys::CustomEvent.detail 数据