首页 > 解决方案 > 在 pdf 中搜索特定单词并仅返回找到单词的 pdf 链接(Python)

问题描述

我正在尝试在许多 PDF 中搜索多个单词。这些 PDF 的链接保存在数据框中。目标是让 python 返回一个文本,说明“单词位于pdf 链接中”)。这是我到目前为止的代码:(仅供参考 g7 是保存链接的数据框的名称)。这里的问题是每次找到单词时代码都会多次返回相同的链接。

数据框(名为 g7)如下所示:

    URL
0   https://westafricatradehub.com/wp-content/uploads/2021/07/RFA-WATIH-1295_Senegal-RMNCAH-Activity_English-Version.pdf
1   https://westafricatradehub.com/wp-content/uploads/2021/07/RFA-WATIH-1295_Activit%C3%A9-RMNCAH-S%C3%A9n%C3%A9gal_Version-Fran%C3%A7aise.pdf
2   https://westafricatradehub.com/wp-content/uploads/2021/08/Senegal-Health-RFA-Webinar-QA.pdf
3   https://westafricatradehub.com/wp-content/uploads/2021/02/APS-WATIH-1021_Catalytic-Business-Concepts-Round-2.pdf
4   https://westafricatradehub.com/wp-content/uploads/2021/02/APS-WATIH-1021_Concepts-d%E2%80%99Affaires-Catalytiques-2ieme-Tour.pdf
5   https://westafricatradehub.com/wp-content/uploads/2021/06/APS-WATIH-1247_Research-Development-Round-2.pdf

代码如下:

import glob
import pathlib
import PyPDF2
import re
import os
for i in range(g7.shape[0]):
    pdf_link=g7.iloc[i,0]
    download_file(pdf_link, f"pdf_{i}")
    text = textract.process(f"/Users/fze/pdf_{i}.PDF")        
    # open the pdf file
    object = PyPDF2.PdfFileReader(f"/Users/fze/pdf_{i}.PDF") 
    all_files = glob.glob('/Users/fze/*.pdf') #User input: give path to your downloads folder file path
    latest_pdf_path = max(all_files, key=os.path.getctime)
    
    path = pathlib.PurePath(latest_pdf_path)
    latest_pdf_name=path.name
    print(latest_pdf_name)
    
    # get number of pages
    NumPages = object.getNumPages()    
    # define keyterms
    search_word = 'organization'
    # extract text and do the search
    for i in range(0, NumPages):
            page = object.getPage(i)
            text = page.extractText()
            search_text = text.lower().split()
            for word in search_text:
                if search_word in word:
                    print("The word '{}' was found in '{}'".format(search_word,pdf_link))

谢谢 !

标签: pythondataframepdfsearchhyperlink

解决方案


推荐阅读