python - 如何获取pdf的特定部分?
问题描述
我有PDF文件。我想得到一些不同的文本部分。
例如让我有以下页面:
0021 Literacy and numeracy
Literacy and numeracy are programmes or qualifications arranged mainly for adults, designed
to teach fundamental skills in reading, writing and arithmetic. The typical age range of
participants can be used to distinguish between detailed field 0011 ‘Basic programmes and
qualifications’ and this detailed field.
Programmes and qualifications with the following main content are classified here:
Basic remedial programmes for youth or adults
Literacy
Numeracy
003 Personal skills
0031 Personal skills
Personal skills are defined by reference to the effects on the individual’s capacity (mental,
social etc.). This detailed field covers personal skills programmes not included in 0011 ‘Basic
Programmes and qualifications with the following main content are classified here:
我想让所有行都包含 4 个数字以及之后的所有段落,直到这句话:
Programmes and qualifications with the following main content are classified here:
所以输出是:
First_list= [0021 Literacy and numeracy,0031 Personal skills]
secend_list=[ Literacy and numeracy are programmes or qualifications arranged mainly for
adults, designed to teach fundamental skills in reading, writing and arithmetic. The typical age range of participants can be used to distinguish between detailed field 0011 ‘Basic programmes and qualifications’ and this detailed field. , Personal skills are defined by reference to the effects on the individual’s capacity (mental, social etc.). This detailed field covers personal skills programmes not included in 0011 ‘Basic]
我试图这样做,但我无法完成它。
我试图获取pdf的文本并找到我想要的文本之前或同一行中的关键字。
import re
f = open('f.pdf','rb')
pdf_reader = PyPDF2.PdfFileReader(f)
while count < num_pages:
pageObj = pdf_reader.getPage(count)
count +=1
text += pageObj.extractText()
text_fefore = re.findall('Programmes and qualifications with the following main content are classified here',text)
4_digit = re.findall(r'\d\d\d\d',text)
所以我认为text_fefore
这正是我需要在它之前的一段。也是4_digit
一个我想要整行的数字。
知道如何完成这段代码吗?
注意:4 位在行首。
我还应该提到这text_fefore = re.search('Programmes and qualifications with the following main content are classified here',text)
给了我span
句子的开头和结尾。所以我知道我在哪里停止获取文本,但我应该如何找到起点?
对于这个也:4_digit = re.search(r'\d\d\d\d',text)
我应该找到span
行尾的。这是我上述问题的答案。
解决方案
您可以尝试利用nltk
库来执行句子区分并选择下一个句子,然后是整数条件,这里我添加了代码片段
import nltk
sents = nltk.tokenize.sent_tokenize(text)
lines = text.split('\n')
res = [[],[]]
for idx, sent in enumerate(lines):
ints = re.findall(r'(\d+)', sent)
if sent[:4] in ints:
res[0].append(sent)
res[1].append(nltk.sent_tokenize(' '.join(lines[idx+1:]))[0])
out
出去:
[['0021 Literacy and numeracy', '0031 Personal skills'],
['Literacy and numeracy are programmes or qualifications arranged mainly for adults, designed to teach fundamental skills in reading, writing and arithmetic.',
'Personal skills are defined by reference to the effects on the individual’s capacity (mental, social etc.).']]
推荐阅读
- php - 如何在请求验证后使用 Laravel 在下拉列表中显示选定的值?
- facebook - Facebook API 评论数
- c# - 使用 TaskCompletionSource 将库转换为可等待的库?
- swift - 如何修复 View SwiftUi 半关闭的 bug
- opencv - 我可以使用什么 OpenCV 方法对 Windows 照片中的 Napa 过滤器等图像进行去噪?
- javascript - React JS 映射/显示元素
- ajax - 从请求中获取ajax数组并在laravel中显示
- mysql - SQL 查询以显示具有部门名称和负责人姓名的员工列表
- c++ - 移动类的成员作为 const 引用参数传递
- cassandra - Cassandra where 子句作为元组