首页 > 解决方案 > 使用 pyparsing,如何对 OneOrMore(expre1|expr2) 匹配的表达式进行分组?

问题描述

我的网站接收允许用户发布一个字符串,其中包含多个问题,后跟多项选择答案。有一个强制的样式指南,允许通过正则表达式解析结果,然后将问题 + MCQ 选项存储在数据库中,稍后在随机练习考试中返回。

我想过渡到 pyparsing,因为正则表达式不是立即可读的,我觉得有点被它锁定了。我希望可以选择轻松扩展我的问题解析器的功能,而使用正则表达式感觉非常麻烦。

用户输入的形式为:

quiz = [<question-answer>, <q-start>]
<question-answer> = <question> + <answer>
<question> = [<q-text>, \n] ?!= <a-start>
<answer> = [<answer>, <a-start>]  ?!= <q-start>
<q-start> = <nums> + "." | ")"
<a-start> = <alphas> + "." | ")" 

长的用户输入字符串被分成问题答案,由下一个问题答案组的 q-start 分隔。问题都是 q-start 和 a-start 之间的文本。答案是 a-start 和 a-start 或以下 q-start 之间的所有文本的列表。

示例文本:

3. A lesion that affects N. Solitarius will result in the patient having problems related to:
a. taste and blood pressure regulation
c. swallowing and respiration
b. smell and taste
d. voice quality and taste
e. whistling and chewing

4. A patient comes to your office complaining of weakness on the right side of their body. You notice that their head is
turned slightly to the left and their right shoulder droops. When asked to protrude their tongue, it deviates to the right. Eye
movements and eye-related reflexes appear to be normal. The lesion most likely is located in the:
c. left ventral medulla
a. left ventral midbrain
b. right dorsal medulla
d. left ventral pons
e. right ventral pons

5. A colleague {...}

我一直在使用的正则表达式:

# matches a question-answer block. Matching q-start until an empty line.
regex1 = r"(^[\t ]*[0-9]+[\)\.][\t ]+[\s\S]*?(?=^[\n\r]))" 

# Within question-answer block, matches everything that does not start with a-start
regex6 = r"(^(?!(^[a-fA-F][\)\.]\s+[\s\S]+)).*)"

# Matches all text between a-start and the following a-start, or until the question-answer substring block ends.
regex5 = r"(^[a-fA-F][\)\.]\s+[\s\S]+)"       

然后用一点 Python 和 re 修剪问题编号、mcq 字母,加入所有有问题的断线,将 MCQ 附加到列表中。

在 pyparsing 我试过这个:

EOL = Suppress(LineEnd())
delim = oneOf(". )")
q_start = LineStart() + Word(nums) + delim
a_start = LineStart() + Char(alphas) + delim

question = Optional(EOL) + Group(Suppress(q_start) + OneOrMore(SkipTo(LineEnd()) + EOL, stopOn=a_start)).setResultsName('question', listAllMatches=True)

answer = Optional(EOL) + Group(Suppress(a_start) + OneOrMore( SkipTo(LineEnd()) + EOL, stopOn=(a_start | q_start | StringEnd()))).setResultsName('answer', listAllMatches=True)



qi = Group(OneOrMore(question|answer)).setResultsName('group', listAllMatches=True)
t = qi.parseString(test)
print(t.dump())

结果:

[[['The tectum of the midbrain comprises the:'], ['superior and inferior colliculi'], ['reticular formation'], ['internal arcuate fibers'], ['cerebellar peduncles'], ['pyramids'], ['Damage to the dorsal columns on one side of the spinal cord would results in:'], ['loss of MVP ipsilaterally below the level of the lesion'], ['hypertonicity of the contralateral limbs'], ['loss of pain and temperature contralaterally below the level of the lesion'], ['loss of MVP contralaterally above the level of the lesion'], ['loss of pain and temperature ipsilaterally above the level of the lesion']]]
- group: [[['The tectum of the midbrain comprises the:'], ['superior and inferior colliculi'], ['reticular formation'], ['internal arcuate fibers'], ['cerebellar peduncles'], ['pyramids'], ['Damage to the dorsal columns on one side of the spinal cord would results in:'], ['loss of MVP ipsilaterally below the level of the lesion'], ['hypertonicity of the contralateral limbs'], ['loss of pain and temperature contralaterally below the level of the lesion'], ['loss of MVP contralaterally above the level of the lesion'], ['loss of pain and temperature ipsilaterally above the level of the lesion']]]
  [0]:
    [['The tectum of the midbrain comprises the:'], ['superior and inferior colliculi'], ['reticular formation'], ['internal arcuate fibers'], ['cerebellar peduncles'], ['pyramids'], ['Damage to the dorsal columns on one side of the spinal cord would results in:'], ['loss of MVP ipsilaterally below the level of the lesion'], ['hypertonicity of the contralateral limbs'], ['loss of pain and temperature contralaterally below the level of the lesion'], ['loss of MVP contralaterally above the level of the lesion'], ['loss of pain and temperature ipsilaterally above the level of the lesion']]
    - answer: [['superior and inferior colliculi'], ['reticular formation'], ['internal arcuate fibers'], ['cerebellar peduncles'], ['pyramids'], ['loss of MVP ipsilaterally below the level of the lesion'], ['hypertonicity of the contralateral limbs'], ['loss of pain and temperature contralaterally below the level of the lesion'], ['loss of MVP contralaterally above the level of the lesion'], ['loss of pain and temperature ipsilaterally above the level of the lesion']]
      [0]:
        ['superior and inferior colliculi']
      [1]:
        ['reticular formation']
      [2]:
        ['internal arcuate fibers']
      [3]:
        ['cerebellar peduncles']
      [4]:
        ['pyramids']
      [5]:
        ['loss of MVP ipsilaterally below the level of the lesion']
      [6]:
        ['hypertonicity of the contralateral limbs']
      [7]:
        ['loss of pain and temperature contralaterally below the level of the lesion']
      [8]:
        ['loss of MVP contralaterally above the level of the lesion']
      [9]:
        ['loss of pain and temperature ipsilaterally above the level of the lesion']
    - question: [['The tectum of the midbrain comprises the:'], ['Damage to the dorsal columns on one side of the spinal cord would results in:']]
      [0]:
        ['The tectum of the midbrain comprises the:']
      [1]:
        ['Damage to the dorsal columns on one side of the spinal cord would results in:']

确实匹配问题和答案,并正确绕过可能中断问题或答案的换行符。我遇到的问题是它们没有按我预期的方式分组。我期待一些类似于 group[0] = question, answer[1:4] group[2] = question, answer[1:4] 的内容

有人有建议吗?

谢谢!

标签: pythonregexpyparsing

解决方案


我认为您走在正确的轨道上-我对您的解析器进行了单独的传递,并提出了非常相似的构造,但只有一些区别。

question = Combine(q_start.suppress() + SkipTo(EOL + a_start))
answer = Combine(a_start.suppress() + SkipTo(EOL + (a_start | q_start | StringEnd())))
q_a = Group(question("question") + answer[1, ...]("answers"))

for t in q_a[...].parseString(test):
    print(t.dump())

最大的不同是我用来解析你的文本的表达式不只是做OneOrMore(question | answer),而是定义了一个Group(question + OneOrMore(answer)). 这会为每个问题及其相关答案创建一个组。在您的解析器中,使用 listAllMatches 只会为所有问题创建一个结果名称,并为所有答案创建另一个名称,但会丢失它们之间的所有关联。通过创建“问题+一个或多个答案”组,这些关联得以维持。

如果您想删除 '\n's,您可以通过解析操作比 EOL 业务更轻松地做到这一点。


推荐阅读