首页 > 解决方案 > 在python中阅读pdf

问题描述

我正在尝试根据期末考试成绩对我所在城市的学校进行比赛,数据在此处以 pdf 格式提供: https ://oke.wroc.pl/wp-content/uploads/library/File/pdfy/Powiaty_E8_192/0264.pdf

不幸的是,将 pdf 读取为 python 的简单解决方案不起作用。我努力了:

  1. PyPDF2包,但它给出了错误PdfReadWarning: Superfluous whitespace found in object header b'39' b'0' [pdf.py:1666]
  2. textract包,但它给出了错误stdout, stderr = pipe.communicate() UnboundLocalError: local variable 'pipe' referenced before assignment
  3. tabula包,但它给出了错误

得到标准错误:sty 27, 2020 5:33:46 PM org.apache.pdfbox.pdmodel.font.PDCIDFontType2

 INFO: OpenType Layout tables used in font CIDFont+F1 are not implemented in PDFBox and will be ignored
    sty 27, 2020 5:33:46 PM org.apache.pdfbox.pdmodel.font.PDCIDFontType2 <init>
    INFO: OpenType Layout tables used in font CIDFont+F2 are not implemented in PDFBox and will be ignored
    sty 27, 2020 5:33:46 PM org.apache.pdfbox.pdmodel.font.PDCIDFontType2 <init>
    INFO: OpenType Layout tables used in font CIDFont+F3 are not implemented in PDFBox and will be ignored
    sty 27, 2020 5:33:46 PM org.apache.pdfbox.pdmodel.font.PDCIDFontType2 <init>
    INFO: OpenType Layout tables used in font CIDFont+F4 are not implemented in PDFBox and will be ignored
    sty 27, 2020 5:33:47 PM org.apache.pdfbox.rendering.PDFRenderer suggestKCMS
    INFO: Your current java version is: 1.8.0_131
    sty 27, 2020 5:33:47 PM org.apache.pdfbox.rendering.PDFRenderer suggestKCMS
    INFO: To get higher rendering speed on old java 1.8 or 9 versions,
    sty 27, 2020 5:33:47 PM org.apache.pdfbox.rendering.PDFRenderer suggestKCMS
    INFO:   update to the latest 1.8 or 9 version (>= 1.8.0_191 or >= 9.0.4),
    sty 27, 2020 5:33:47 PM org.apache.pdfbox.rendering.PDFRenderer suggestKCMS
    INFO:   or
    sty 27, 2020 5:33:47 PM org.apache.pdfbox.rendering.PDFRenderer suggestKCMS
    INFO:   use the option -Dsun.java2d.cmm=sun.java2d.cmm.kcms.KcmsServiceProvider
    sty 27, 2020 5:33:47 PM org.apache.pdfbox.rendering.PDFRenderer suggestKCMS
    INFO:   or call System.setProperty("sun.java2d.cmm", "sun.java2d.cmm.kcms.KcmsServiceProvider")
    sty 27, 2020 5:33:47 PM org.apache.pdfbox.pdmodel.font.PDCIDFontType2 <init>
    INFO: OpenType Layout tables used in font CIDFont+F1 are not implemented in PDFBox and will be ignored
    sty 27, 2020 5:33:47 PM org.apache.pdfbox.pdmodel.font.PDCIDFontType2 <init>
    INFO: OpenType Layout tables used in font CIDFont+F2 are not implemented in PDFBox and will be ignored
    sty 27, 2020 5:33:47 PM org.apache.pdfbox.pdmodel.font.PDCIDFontType2 <init>
    INFO: OpenType Layout tables used in font CIDFont+F3 are not implemented in PDFBox and will be ignored
    sty 27, 2020 5:33:48 PM org.apache.pdfbox.pdmodel.font.PDCIDFontType2 <init>
    INFO: OpenType Layout tables used in font CIDFont+F4 are not implemented in PDFBox and will be ignored
    sty 27, 2020 5:33:49 PM org.apache.pdfbox.pdmodel.font.PDCIDFontType2 <init>
    INFO: OpenType Layout tables used in font CIDFont+F1 are not implemented in PDFBox and will be ignored
    sty 27, 2020 5:33:49 PM org.apache.pdfbox.pdmodel.font.PDCIDFontType2 <init>
    INFO: OpenType Layout tables used in font CIDFont+F2 are not implemented in PDFBox and will be ignored
    sty 27, 2020 5:33:49 PM org.apache.pdfbox.pdmodel.font.PDCIDFontType2 <init>
    INFO: OpenType Layout tables used in font CIDFont+F3 are not implemented in PDFBox and will be ignored
    sty 27, 2020 5:33:49 PM org.apache.pdfbox.pdmodel.font.PDCIDFontType2 <init>
    INFO: OpenType Layout tables used in font CIDFont+F4 are not implemented in PDFBox and will be ignored

    Traceback (most recent call last):

      File "<ipython-input-17-2f598bf9e926>", line 7, in <module>
        df.head()

    AttributeError: 'list' object has no attribute 'head'`

标签: pythonpdf

解决方案


推荐阅读