python - 在python中阅读pdf
问题描述
我正在尝试根据期末考试成绩对我所在城市的学校进行比赛,数据在此处以 pdf 格式提供: https ://oke.wroc.pl/wp-content/uploads/library/File/pdfy/Powiaty_E8_192/0264.pdf
不幸的是,将 pdf 读取为 python 的简单解决方案不起作用。我努力了:
PyPDF2
包,但它给出了错误PdfReadWarning: Superfluous whitespace found in object header b'39' b'0' [pdf.py:1666]
textract
包,但它给出了错误stdout, stderr = pipe.communicate() UnboundLocalError: local variable 'pipe' referenced before assignment
tabula
包,但它给出了错误
得到标准错误:sty 27, 2020 5:33:46 PM org.apache.pdfbox.pdmodel.font.PDCIDFontType2
INFO: OpenType Layout tables used in font CIDFont+F1 are not implemented in PDFBox and will be ignored
sty 27, 2020 5:33:46 PM org.apache.pdfbox.pdmodel.font.PDCIDFontType2 <init>
INFO: OpenType Layout tables used in font CIDFont+F2 are not implemented in PDFBox and will be ignored
sty 27, 2020 5:33:46 PM org.apache.pdfbox.pdmodel.font.PDCIDFontType2 <init>
INFO: OpenType Layout tables used in font CIDFont+F3 are not implemented in PDFBox and will be ignored
sty 27, 2020 5:33:46 PM org.apache.pdfbox.pdmodel.font.PDCIDFontType2 <init>
INFO: OpenType Layout tables used in font CIDFont+F4 are not implemented in PDFBox and will be ignored
sty 27, 2020 5:33:47 PM org.apache.pdfbox.rendering.PDFRenderer suggestKCMS
INFO: Your current java version is: 1.8.0_131
sty 27, 2020 5:33:47 PM org.apache.pdfbox.rendering.PDFRenderer suggestKCMS
INFO: To get higher rendering speed on old java 1.8 or 9 versions,
sty 27, 2020 5:33:47 PM org.apache.pdfbox.rendering.PDFRenderer suggestKCMS
INFO: update to the latest 1.8 or 9 version (>= 1.8.0_191 or >= 9.0.4),
sty 27, 2020 5:33:47 PM org.apache.pdfbox.rendering.PDFRenderer suggestKCMS
INFO: or
sty 27, 2020 5:33:47 PM org.apache.pdfbox.rendering.PDFRenderer suggestKCMS
INFO: use the option -Dsun.java2d.cmm=sun.java2d.cmm.kcms.KcmsServiceProvider
sty 27, 2020 5:33:47 PM org.apache.pdfbox.rendering.PDFRenderer suggestKCMS
INFO: or call System.setProperty("sun.java2d.cmm", "sun.java2d.cmm.kcms.KcmsServiceProvider")
sty 27, 2020 5:33:47 PM org.apache.pdfbox.pdmodel.font.PDCIDFontType2 <init>
INFO: OpenType Layout tables used in font CIDFont+F1 are not implemented in PDFBox and will be ignored
sty 27, 2020 5:33:47 PM org.apache.pdfbox.pdmodel.font.PDCIDFontType2 <init>
INFO: OpenType Layout tables used in font CIDFont+F2 are not implemented in PDFBox and will be ignored
sty 27, 2020 5:33:47 PM org.apache.pdfbox.pdmodel.font.PDCIDFontType2 <init>
INFO: OpenType Layout tables used in font CIDFont+F3 are not implemented in PDFBox and will be ignored
sty 27, 2020 5:33:48 PM org.apache.pdfbox.pdmodel.font.PDCIDFontType2 <init>
INFO: OpenType Layout tables used in font CIDFont+F4 are not implemented in PDFBox and will be ignored
sty 27, 2020 5:33:49 PM org.apache.pdfbox.pdmodel.font.PDCIDFontType2 <init>
INFO: OpenType Layout tables used in font CIDFont+F1 are not implemented in PDFBox and will be ignored
sty 27, 2020 5:33:49 PM org.apache.pdfbox.pdmodel.font.PDCIDFontType2 <init>
INFO: OpenType Layout tables used in font CIDFont+F2 are not implemented in PDFBox and will be ignored
sty 27, 2020 5:33:49 PM org.apache.pdfbox.pdmodel.font.PDCIDFontType2 <init>
INFO: OpenType Layout tables used in font CIDFont+F3 are not implemented in PDFBox and will be ignored
sty 27, 2020 5:33:49 PM org.apache.pdfbox.pdmodel.font.PDCIDFontType2 <init>
INFO: OpenType Layout tables used in font CIDFont+F4 are not implemented in PDFBox and will be ignored
Traceback (most recent call last):
File "<ipython-input-17-2f598bf9e926>", line 7, in <module>
df.head()
AttributeError: 'list' object has no attribute 'head'`
解决方案
推荐阅读
- neural-network - 神经网络和无序输入数据
- php - 我的 JSON 数组需要如何格式化才能遍历这个 $.each?
- spring-boot - 配置 liquibase 以忽略 MD5SUM
- mysql - 将“其他”案例集中在一起进行计数
- javascript - 在超链接之外点击触发
- python - 将 ORCL 表读入 Dask 数据帧
- javascript - 如何修复 Jquery 模板?段落长度问题
- python - Python:向熊猫数据框添加条件列,更多pythonic解决方案?
- uwp - 网格背景是否始终为 IsHitTestVisible="True"?
- wso2 - 在 WSO2 API Manager 中将 LifeCycle 添加到租户