首页 > 解决方案 > 从PDF中提取特定单词周围的文本

问题描述

我是基本级别的 python 用户,我正在尝试创建一个程序,在我使用的特定单词之前和之后(比如之前的 50 个单词和之后的 50 个单词)给出文本。到目前为止,我设法创建了一个程序,它给出了提到的 PDF 页面。我如何将这 100 个单词写入 CVS?

import PyPDF2
import re
import os
...
for pdfName in pdffiles:
    pdfFull = pdfFolder + pdfName
    pdfFileObj = open(pdfFull, mode='rb')
    pdfReader = PyPDF2.PdfFileReader(pdfFileObj)

    number_of_pages = pdfReader.numPages
    pages_text = []
    words_start_pos = {}
    words = {}

    csvFolder = newpath
    csvName = pdfName.replace('pdf', 'csv')
    csvFull = csvFolder + csvName
    with open(csvFull, 'w') as f:
        f.write('{0},{1},{2}\n'.format("Sheet Number", "Search Word", "File Name"))
        for word in searchwords:
            for page in range(number_of_pages):
                pages_text.append(pdfReader.getPage(page).extractText())
                words_start_pos[page] = [dwg.start() for dwg in re.finditer(word, pages_text[page].lower())]
                words[page] = [pages_text[page][value:value + len(word)] for value in words_start_pos[page]]
            for page in words:
                for i in range(0, len(words[page])):
                    if str(words[page][i]) != 'nan':
                        f.write('{0},{1},{2}\n'.format(page + 1, words[page][i], pdfFull))

标签: pythonpdfpypdf2

解决方案


我认为没有必要抓取页面的每个字母并找到第一个字母的索引,而是您仍然可以执行以下操作:

pages_text.append(pdfReader.getPage(page).extractText())

然后做这样的事情:

pages_text[0].split()

这将使您从提取的文本中获取每个单词,因此您已经有了单词,而不是索引字母并且必须弄清楚单词的开始和结束位置。此时,我将遍历单词并找到单词的索引,然后从该单词的位置加减 50 并将它们打印出来。我在pdf的第一页上使用了它,如下所示:

import PyPDF2
import re
import os


pdfFileObj = open(r'C:\path','rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)

number_of_pages = pdfReader.numPages
pages_text = []
words_start_pos = {}
words = {}
searchwords = ["pdf"]
word_pos = 0
print_words = []


word_pos = []
print_text = ''
line = []

for word in searchwords:
    for page in range(number_of_pages):
        pages_text.append(pdfReader.getPage(page).extractText())

text = pages_text[0].split()
for each_word in range(0, len(text)):
    if(text[each_word] == "PDF"):
        word_pos.append(each_word)

print(word_pos)
for each_pos in word_pos:    
    for each_word in range(each_pos-50, each_pos+50):
        print_text = print_text +' ' + text[each_word]
    line.append(print_text)
    print_text = ''    
print(line)
with open(r'C:\path', 'w') as f:
    f.write('{0},{1},{2}\n'.format("Sheet Number", word, "File Name"))
    for each_line in line:
        f.write('{0},{1},{2}\n'.format(page + 1, each_line, r'C:\path'))

注意:我会警惕将从 pdf 中抓取的文本保存在 csv 文件中,因为文本中很可能会有逗号,这会与您的 csv 文件混淆。我希望这有帮助!


推荐阅读