首页 > 解决方案 > 如何获取pdf的特定部分?

问题描述

我有PDF文件。我想得到一些不同的文本部分。

例如让我有以下页面:

0021 Literacy and numeracy
Literacy and numeracy are programmes or qualifications arranged mainly for adults, designed 
to teach fundamental skills in reading, writing and arithmetic. The typical age range of 
participants can be used to distinguish between detailed field 0011 ‘Basic programmes and 
qualifications’ and this detailed field. 
Programmes and qualifications with the following main content are classified here:
Basic remedial programmes for youth or adults
Literacy
Numeracy
003 Personal skills
0031 Personal skills
Personal skills are defined by reference to the effects on the individual’s capacity (mental, 
social etc.). This detailed field covers personal skills programmes not included in 0011 ‘Basic
Programmes and qualifications with the following main content are classified here:

我想让所有行都包含 4 个数字以及之后的所有段落,直到这句话: Programmes and qualifications with the following main content are classified here:

所以输出是:

First_list= [0021 Literacy and numeracy,0031 Personal skills]
secend_list=[    Literacy and numeracy are programmes or qualifications arranged mainly for  
  adults, designed  to teach fundamental skills in reading, writing and arithmetic. The typical age range of  participants can be used to distinguish between detailed field 0011 ‘Basic programmes and qualifications’ and this detailed field. , Personal skills are defined by reference to the effects on the individual’s capacity (mental, social etc.). This detailed field covers personal skills programmes not included in 0011 ‘Basic]

我试图这样做,但我无法完成它。

我试图获取pdf的文本并找到我想要的文本之前或同一行中的关键字。

import re
f = open('f.pdf','rb')
pdf_reader = PyPDF2.PdfFileReader(f)
while count < num_pages:
    pageObj = pdf_reader.getPage(count)
    count +=1
    text += pageObj.extractText()
text_fefore = re.findall('Programmes and qualifications with the following main content are classified here',text)
4_digit = re.findall(r'\d\d\d\d',text)

所以我认为text_fefore这正是我需要在它之前的一段。也是4_digit一个我想要整行的数字。

知道如何完成这段代码吗?

注意:4 位在行首。

我还应该提到这text_fefore = re.search('Programmes and qualifications with the following main content are classified here',text) 给了我span句子的开头和结尾。所以我知道我在哪里停止获取文本,但我应该如何找到起点?

对于这个也:4_digit = re.search(r'\d\d\d\d',text)我应该找到span行尾的。这是我上述问题的答案。

标签: pythonpdftextrepypdf

解决方案


您可以尝试利用nltk库来执行句子区分并选择下一个句子,然后是整数条件,这里我添加了代码片段

import nltk
sents = nltk.tokenize.sent_tokenize(text)
lines = text.split('\n')
res = [[],[]]
for idx, sent in enumerate(lines):
    ints = re.findall(r'(\d+)', sent)
    if sent[:4] in ints:
        res[0].append(sent)
        res[1].append(nltk.sent_tokenize(' '.join(lines[idx+1:]))[0])
out
        

出去:

[['0021 Literacy and numeracy', '0031 Personal skills'],
 ['Literacy and numeracy are programmes or qualifications arranged mainly for adults, designed  to teach fundamental skills in reading, writing and arithmetic.',
  'Personal skills are defined by reference to the effects on the individual’s capacity (mental,  social etc.).']]

推荐阅读