首页 > 解决方案 > MCQ 类型字符串的正则表达式

问题描述

如何从文本文档中提取多项选择题及其选项。每个问题都以数字和点开头。每个问题可以跨越多行,并且可能/可能没有句号或问号。我想制作一本带有问题编号和相应问题和选项的字典。我为此使用python。

17.
If you go on increasing the stretching force on a wire in a
guitar, its frequency.
(a)
increases
(b)
decreases
(c)
remains unchanged
(d)
None of these

some random text between questions
18.
A vibrating body
(a)
will always produce sound
(b)
may or may not produce sound if the amplitude of
vibration is low
(c)
will produce sound which depends upon frequency
(d)
None of these
19.
The wavelength of infrasonics in air is of the order of
(a)
100 m
(b)
101 m
(c)
10–1 m
(d)
10–2 m

标签: pythonregexfile-io

解决方案


正则表达式:\d+\.([^(]+) 它得到数字,然后是一个点。

然后它捕获所有不是的东西((答案的开始)。

如果您不确定它是否那么容易,请在此处测试正则表达式。

Python代码:

import re # Imports the standard regex module

text_doc = """
17.
If you go on increasing the stretching force on a wire in a
guitar, its frequency.
(a)
increases
(b)
decreases
(c)
remains unchanged
(d)
None of these

some random text between questions
18.
A vibrating body
(a)
will always produce sound
(b)
may or may not produce sound if the amplitude of
vibration is low
(c)
will produce sound which depends upon frequency
(d)
None of these
19.
The wavelength of infrasonics in air is of the order of
(a)
100 m
(b)
101 m
(c)
10–1 m
(d)
10–2 m
"""

question_getter = re.compile('\\d+\\.([^(]+)')

print(question_getter.findall(text_doc))

编辑:但由于很多人在这里解析东西,我想我也会解析东西

获取可能答案的正则表达式:\([a-zA-Z]+\)\n(.+)

证明

更新的 Python:

import re # Imports the standard regex module


text_doc = """
17.
If you go on increasing the stretching force on a wire in a
guitar, its frequency.
(a)
increases
(b)
decreases
(c)
remains unchanged
(d)
None of these

some random text between questions
18.
A vibrating body
(a)
will always produce sound
(b)
may or may not produce sound if the amplitude of
vibration is low
(c)
will produce sound which depends upon frequency
(d)
None of these
19.
The wavelength of infrasonics in air is of the order of
(a)
100 m
(b)
101 m
(c)
10–1 m
(d)
10–2 m
"""

question_getter = re.compile('\\d+\\.([^(]+)')
answer_getter = re.compile('\\([a-zA-Z]+\\)\\n(.+)')


# This is where the magical parsing happens
# It could've been organized differently
parsed = {question:answer_getter.findall(text_doc)
    for question in question_getter.findall(text_doc)
}

print(parsed)


推荐阅读