首页 > 解决方案 > 使用正则表达式匹配成绩单中的名称、对话和动作

问题描述

给定一个如下所示的字符串对话,我需要找到与每个用户对应的句子。

text = 'CHRIS: Hello, how are you...
PETER: Great, you? PAM: He is resting.
[PAM SHOWS THE COUCH]
[PETER IS NODDING HIS HEAD]
CHRIS: Are you ok?'

对于上述对话,我想返回包含三个元素的元组:

  1. 人名

  2. 小写的句子和

  3. 括号内的句子

像这样的东西:

('CHRIS', 'Hello, how are you...', None)

('PETER', 'Great, you?', None)

('PAM', 'He is resting', 'PAM SHOWS THE COUCH. PETER IS NODDING HIS HEAD')

('CHRIS', 'Are you ok?', None)

etc...

我正在尝试使用正则表达式来实现上述目的。到目前为止,我能够使用以下代码获取用户的姓名。我正在努力识别两个用户之间的句子。

actors = re.findall(r'\w+(?=\s*:[^/])',text)

标签: pythonregexstring

解决方案


你可以这样做re.findall

>>> re.findall(r'\b(\S+):([^:\[\]]+?)\n?(\[[^:]+?\]\n?)?(?=\b\S+:|$)', text)
[('CHRIS', ' Hello, how are you...', ''),
 ('PETER', ' Great, you? ', ''),
 ('PAM',
  ' He is resting.',
  '[PAM SHOWS THE COUCH]\n[PETER IS NODDING HIS HEAD]\n'),
 ('CHRIS', ' Are you ok?', '')]

您将必须弄清楚如何自己删除方括号,而在尝试匹配所有内容的同时,正则表达式无法做到这一点。

正则表达式分解

\b              # Word boundary
(\S+)           # First capture group, string of characters not having a space
:               # Colon
(               # Second capture group
    [^          # Match anything that is not...
        :       #     a colon
        \[\]    #     or square braces
    ]+?         # Non-greedy match
)
\n?             # Optional newline
(               # Third capture group
    \[          # Literal opening brace
    [^:]+?      # Similar to above - exclude colon from match
    \] 
    \n?         # Optional newlines
)?              # Third capture group is optional
(?=             # Lookahead for... 
    \b          #     a word boundary, followed by  
    \S+         #     one or more non-space chars, and
    :           #     a colon
    |           # Or,
    $           # EOL
)

推荐阅读