首页 > 解决方案 > 如何将pdf转换为数据框pandas python并提取值?

问题描述

我在线下载了 pdf 文件,并希望将其放入 pandas 数据框中。下一步是提取数据帧中的 CAS 和 REACH 编号。

有人可以帮我吗?

这是pdf链接(更新)。(https://msdspds.castrol.com/ussds/amersdsf.nsf/Files/109BFD5F3F227AE58025859100538A55/$File/2620961.pdf

我想要 pdf 中第 3 部分的 CAS 编号和 REACH 编号。

非常感谢琼

标签: pythonpandasstringpdf

解决方案


我也遇到过这个问题。我找到了一个使用 PyPDF2 和 tabula 的解决方案。

Ubuntu FWIW 上的 Jupyter Notebook

第一个单元格导入所有内容。

# Import modules needed for this project
import tabula as tb
from PyPDF2 import PdfFileReader
import pandas as pd
import glob

这是我们使用 PyPDF2 读取 pdf 包含多少页的地方。tabula 不能做到这一点,我们需要一个准确的计数传递给下一个循环,将 pdf 逐页读取到 tabula 并将它们转换为 csv。

# This cell gets a list of pages in the pdf. We cannot rely on reading the file as a whole :(
# We will pass this list into the next cell.

infile = '../PDFs/2620961.pdf'

# Get number of pages from pdf infile
pdf = PdfFileReader(open(infile,'rb'))
numPages = pdf.getNumPages()

# Get a list of pages to pass into the reader loop
tmpPages = []
for i in range(numPages):
    tmpPages.append(i++1)
    
print("There are ",len(tmpPages),"pages.")

该单元现在通过允许将 pagenumbers(i) 传递到 'pages=' 参数来循环 tabula.convert_into。tabula.read_pdf 不允许这样做,所以这似乎是我唯一的选择。

# This loops over the main pdf file page by page, saving each page as a csv in the /pages directory
# THIS MIGHT TAKE SOME TIME IF THE FILE IS LARGE
print(len(tmpPages)," pages to be converted.") # Here is our list of pages.

# This for loop takes the list of pages in the PDF from the previous cell.
# This loop also converts the PDF into individual CSVs and saves them to /pages
for i in tmpPages:
    print("Converting page: "+str(i))
    tb.convert_into(infile,
                    "../pages/page-"+str(i)+".csv",
                    guess=True,
                    output_format="CSV",
                    stream=True,
                    pages=i,
                    silent=True)
        
print("Done!")

最后,我们只使用 pandas 读取我们在上一个单元格中创建的所有 CSV,以从所有转换的 pdf 页面创建一个数据帧。

# This cell takes the CSVs from the previous cell and converts them into one DataFrame
path = r'../pages/' # use your path
all_files = glob.glob(path + "/*.csv")

li = []

for filename in all_files:
    df = pd.read_csv(filename, names=[0,1,2,3,4,5], index_col=0, header=None)
    li.append(df)

frame = pd.concat(li, ignore_index=False)
frame

从这里你可以清理你的数据框。

这是数据框的几行。它很脏,但我相信你要找的数字在这里。

    1   2   3   4   5
0                   
Product/ingredient name     Oral (mg/   Dermal  Inhalation  Inhalation  Inhalation
NaN     kg)     (mg/kg)     (gases)     (vapours)   (dusts
NaN     NaN     NaN     (ppm)   (mg/l)  and mists)
NaN     NaN     NaN     NaN     NaN     (mg/l)
maleic anhydride    500     NaN     NaN     NaN     NaN
Phosphorodithioic acid, mixed O,O-bis   REACH #: 01-2119493628-22   ≤2.4    Skin Irrit. 2, H315     [1] [2]     NaN
(iso-bu and pentyl) esters, zinc salts  EC: 270-608-0   NaN     Eye Dam. 1, H318    NaN     NaN
NaN     CAS: 68457-79-4     NaN     Aquatic Chronic 2, H411     NaN     NaN

推荐阅读