首页 > 解决方案 > 如何获取pdf的特定部分?




0021 Literacy and numeracy
Literacy and numeracy are programmes or qualifications arranged mainly for adults, designed 
to teach fundamental skills in reading, writing and arithmetic. The typical age range of 
participants can be used to distinguish between detailed field 0011 ‘Basic programmes and 
qualifications’ and this detailed field. 
Programmes and qualifications with the following main content are classified here:
Basic remedial programmes for youth or adults
003 Personal skills
0031 Personal skills
Personal skills are defined by reference to the effects on the individual’s capacity (mental, 
social etc.). This detailed field covers personal skills programmes not included in 0011 ‘Basic
Programmes and qualifications with the following main content are classified here:

我想让所有行都包含 4 个数字以及之后的所有段落,直到这句话: Programmes and qualifications with the following main content are classified here:


First_list= [0021 Literacy and numeracy,0031 Personal skills]
secend_list=[    Literacy and numeracy are programmes or qualifications arranged mainly for  
  adults, designed  to teach fundamental skills in reading, writing and arithmetic. The typical age range of  participants can be used to distinguish between detailed field 0011 ‘Basic programmes and qualifications’ and this detailed field. , Personal skills are defined by reference to the effects on the individual’s capacity (mental, social etc.). This detailed field covers personal skills programmes not included in 0011 ‘Basic]



import re
f = open('f.pdf','rb')
pdf_reader = PyPDF2.PdfFileReader(f)
while count < num_pages:
    pageObj = pdf_reader.getPage(count)
    count +=1
    text += pageObj.extractText()
text_fefore = re.findall('Programmes and qualifications with the following main content are classified here',text)
4_digit = re.findall(r'\d\d\d\d',text)



注意:4 位在行首。

我还应该提到这text_fefore = re.search('Programmes and qualifications with the following main content are classified here',text) 给了我span句子的开头和结尾。所以我知道我在哪里停止获取文本,但我应该如何找到起点?

对于这个也:4_digit = re.search(r'\d\d\d\d',text)我应该找到span行尾的。这是我上述问题的答案。

标签: pythonpdftextrepypdf



import nltk
sents = nltk.tokenize.sent_tokenize(text)
lines = text.split('\n')
res = [[],[]]
for idx, sent in enumerate(lines):
    ints = re.findall(r'(\d+)', sent)
    if sent[:4] in ints:
        res[1].append(nltk.sent_tokenize(' '.join(lines[idx+1:]))[0])


[['0021 Literacy and numeracy', '0031 Personal skills'],
 ['Literacy and numeracy are programmes or qualifications arranged mainly for adults, designed  to teach fundamental skills in reading, writing and arithmetic.',
  'Personal skills are defined by reference to the effects on the individual’s capacity (mental,  social etc.).']]
