首页 > 解决方案 > 如何按数字分解pdf文本

问题描述

所以我的问题不在于 pdf 提取。假设这是一个pdf文本提取

(a) 这是我的第一段,是一些垃圾文本

(b) 这是另一段,但顺便提及另一段,即第 945(d) 条

(c) 这又是第三段

现在,我正在尝试创建一个包含 3 个值的列表,每个值代表一个段落。

import re
entire_text = """(a) This is my first paragraph, which is some junk text

(b) This is another paragraph, but it incidentally has some reference to another paragraph which refers to clause 945(d) somewhere within this text

(c) This again is is some third paragraph"""
PDF_SUB_SECTIONS = ["(a) ", "(b) ", "(c) ", "(d) ", "(e) ", "(f) ", "(g) "]
regexPattern = '|'.join(map(re.escape,PDF_SUB_SECTIONS))
glSubSections = re.split(regexPattern, entire_text)

我所期望的是['这是我的第一段,这是一些垃圾文本','这是另一段,但它顺便提到了另一段,该段引用了该文本中某处的第 945(d) 条', '这又是第三段']

我得到的是['这是我的第一段,这是一些垃圾文本','这是另一个段落,但它顺便提到了另一个段落,它指的是第 945 条','在这个文本中的某个地方','这又是一些第三段']

更多信息:1) 第 945(d) 条 - 这样的“945”(或任何文本)和“(d”)之间永远不会有差距 2)我使用 PyPDF2 提取上面的文本

标签: pythonregexpdf

解决方案


使用正则表达式有几种方法可以做到这一点,但通常它会变得比这更复杂,可能不是最好的方法。例如,使用类似于以下的表达式:

^(?:\([^)]+\))\s*(.*)

测试re.findall

import re

regex = r"^(?:\([^)]+\))\s*(.*)"

test_str = ("(a) This is my first paragraph, which is some junk text\n\n"
    "(b) This is another paragraph, but it incidentally has some reference to another paragraph which refers to clause 945(d)\n\n"
    "(c) This again is is some third paragraph")

print(re.findall(regex, test_str, re.MULTILINE))

输出

['This is my first paragraph, which is some junk text', 'This is another paragraph, but it incidentally has some reference to another paragraph which refers to clause 945(d)', 'This again is is some third paragraph']

测试re.sub

import re

regex = r"^(?:\([^)]+\))\s*(.*)"

test_str = ("(a) This is my first paragraph, which is some junk text\n\n"
    "(b) This is another paragraph, but it incidentally has some reference to another paragraph which refers to clause 945(d)\n\n"
    "(c) This again is is some third paragraph")

subst = "\\1"

print(re.sub(regex, subst, test_str, 0, re.MULTILINE))

测试re.finditer

import re

regex = r"^(?:\([^)]+\))\s*(.*)"

test_str = ("(a) This is my first paragraph, which is some junk text\n\n"
    "(b) This is another paragraph, but it incidentally has some reference to another paragraph which refers to clause 945(d)\n\n"
    "(c) This again is is some third paragraph")

matches = re.finditer(regex, test_str, re.MULTILINE)

for matchNum, match in enumerate(matches, start=1):

    print ("Match {matchNum} was found at {start}-{end}: {match}".format(matchNum = matchNum, start = match.start(), end = match.end(), match = match.group()))

    for groupNum in range(0, len(match.groups())):
        groupNum = groupNum + 1

        print ("Group {groupNum} found at {start}-{end}: {group}".format(groupNum = groupNum, start = match.start(groupNum), end = match.end(groupNum), group = match.group(groupNum)))

该表达式在此演示的右上角面板中进行了说明,如果您希望探索/简化/修改它,并且在此链接中,您可以逐步观看它如何与一些示例输入匹配,如果您愿意的话。

正则表达式电路

jex.im可视化正则表达式:

在此处输入图像描述


推荐阅读