首页 > 解决方案 > python 中关于依赖关系的表格错误(colab 和本地)

问题描述

我正在从 python 中的许多 pdf 文档中提取数据,在 colab 中进行测试。一个解决方案在 colab 上会很好,但如果不可能的话,也可以在本地解决。每页有很多有趣的条目,所以我选择了tabula。

代码适用于大多数文件,但其他文件崩溃......

我可以在 colab 中以某种方式导入丢失的 .jar 等,或者如果没有,如何在本地安装它以运行?

提前致谢!

Got stderr: Oct 26, 2021 5:54:00 AM org.apache.pdfbox.pdmodel.font.FileSystemFontProvider loadDiskCache
WARNING: New fonts found, font cache will be re-built
Oct 26, 2021 5:54:00 AM org.apache.pdfbox.pdmodel.font.FileSystemFontProvider <init>
WARNING: Building on-disk font cache, this may take a while
Oct 26, 2021 5:54:00 AM org.apache.pdfbox.pdmodel.font.FileSystemFontProvider <init>
WARNING: Finished building on-disk font cache, found 17 fonts
Oct 26, 2021 5:54:00 AM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init>
WARNING: Using fallback font 'LiberationSerif' for 'TimesNewRomanPSMT'
Oct 26, 2021 5:54:00 AM org.apache.pdfbox.contentstream.PDFStreamEngine operatorException
SEVERE: Cannot read JPEG2000 image: Java Advanced Imaging (JAI) Image I/O Tools are not installed
Oct 26, 2021 5:54:00 AM org.apache.pdfbox.contentstream.PDFStreamEngine operatorException
SEVERE: Cannot read JPEG2000 image: Java Advanced Imaging (JAI) Image I/O Tools are not installed
... (multiple lines)

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-10-987da78e7e88> in <module>()
      2 regions = []
      3 for i in range(0,len(regions_raw)):
----> 4     regions.append(regions_raw[i]['data'][0][0]['text'])
      5 

IndexError: list index out of range

代码:(仅打印了一个区域,主要来自 # https://towardsdatascience.com/how-to-extract-tables-from-pdf-using-python-pandas-and-tabula-py-c65e43bd754

import tabula as tb
from tabula import read_pdf
import PyPDF2 # just for pagecount
from PyPDF2 import PdfFileReader

box = [2,0,4,13]
fc = 28.28       
for i in range(0, len(box)):
    box[i] *= fc

for filename in (files):
  pdftemp=open(filename,'rb')
  pdfReader = PyPDF2.PdfFileReader(pdftemp)
  pagestmp=pdfReader.getNumPages()
  pages=[i+3 for i in range(pagestmp-2)] #leave out first 2 pages

  regions_raw = tb.read_pdf(filename, pages=pages,area=[box],output_format="json")
  regions = []
  for i in range(0,len(regions_raw)):
      regions.append(regions_raw[i]['data'][0][0]['text'])

  print(regions)

标签: pythonpdftabula

解决方案


哦,我明白了。工作,只是一页后开始的一些数据(第4页)。“数据”中的空条目崩溃,导致错误。


推荐阅读