首页 > 解决方案 > 如何从各种pdf列表中提取关键字

问题描述

我在 python 中有一个包含很多 url 的列表,我做了一个循环来下载所有在 mi 桌面的地毯上。到目前为止,每一个pdf都有这样的名字:document0,document1,.....,documentx

我想做的是从每个 pdf 文件中提取关键字,但到目前为止,我一直无法弄清楚如何做到这一点。

"""
Created on Tue Aug 17 11:03:34 2019

@author: xxxx
"""
#This code is for only one of the pdf but I want do it for each one with 
#the characteristics described above.
import os
os.chdir("//DOCUMENTS/")
import PyPDF2
import re
object = PyPDF2.PdfFileReader("document3.pdf")
NumPages=object.getNumPages()
String="USD" 
for i in range(1, NumPages):
    PageObj = object.getPage(i)
    print("this is page " + str(i))
    Text = PageObj.extractText()
    # print(Text)
    ResSearch = re.search(String, Text)
    print(ResSearch)

标签: pythonpython-3.xpdf

解决方案


进行 shell 样式名称匹配的一种快速方法是使用该glob模块。下面,我重写了您的代码以从 pdf 文件返回匹配的生成器。然后,我们将所有文档的所有此类匹配的计数加在一起。

import os
from glob import glob
import re
from PyPDF2 import PdfFileReader

def search_page(pattern, page):
    yield from pattern.findall(page.extractText())

def search_document(pattern, path):
    document = PdfFileReader(path)
    for page in document.pages:
        yield from search_page(pattern, page)

pattern = re.compile(r'USD')  # Or r'\bUSD\b' if you don't want to match words containing USD

count = 0

for path in glob('//DOCUMENTS/document*.pdf'):
    matches = search_document(pattern, path)
    count += sum(1 for _ in matches)

print(f"Total count is {count}")  # "Total count is {}".format(count)

推荐阅读