首页 > 解决方案 > 如何使用python从PDF中检索多个表中的特定表数据

问题描述

我有 100 份不同银行的年度报告。所有这些年度报告都是相同的格式。我想从所有 100 个 PDF 中提取损益表和资产负债表并存储在一个 Excel 文件中。有没有办法使用python做到这一点?

下面是在 PDF 中提取所有表格并保存在 excel 文件中的代码。

import tabula
from tabula import wrapper
from tabula import *
import PyPDF2,os,time
import pandas as pd

filename=input("enter pdf name")+".pdf"
pdf=PyPDF2.PdfFileReader(open(filename,"rb"))
pag_no=pdf.getNumPages()

for i in range(0,pag_no):
    pg=pdf.getPage(i)
    writer=PyPDF2.PdfFileWriter()
    writer.addPage(pg)
    NewPDFfilename="Page_"+str(i)+".pdf"
    with open(NewPDFfilename,"wb")as outputStream:
        writer.write(outputStream)

for i in range(0,pag_no):
    file=wrapper.convert_into('Page_'+str(i)+'.pdf,'result_'+str(i)+'.csv',output_format='csv')
    file=wrapper.convert_into('Page_'+str(i)+'.pdf,'result_'+str(i)+'.csv',output_format='xml')
    try:
        df=pd.read_csv("result_"+str(i)+".csv", sep=" ",header='none',delimiter=r"\s+")
        if(df.empty):
            print("yes")
        else:
            print("table found in --->PAGE"+str(i+1)+"and store in --->result_"+str(i)+".csv")
    except (pd.errors.EmptyDataError,FileNotFoundError):
        os.remove(r'Users\Downloads\Table-extraction-from-PDF-and-Images-master'+str(i)+'.pdf')
        os.remove(r'Users\Downloads\Table-extraction-from-PDF-and-Images-master'+str(i)+'.csv')
        pass

标签: pythonexceltabulartabulapdf-parsing

解决方案


推荐阅读