首页 > 解决方案 > 从学术花括号格式中提取电子邮件地址

问题描述

我有一个文件,其中每一行都包含一个代表一个或多个电子邮件地址的字符串。多个地址可以在花括号内分组,如下所示:

{name.surname, name2.surnam2}@something.edu

这意味着地址name.surname@something.eduname2.surname2@something.edu有效(这种格式通常用于科学论文)。

此外,单行也可以包含多次大括号。例子:

{a.b, c.d, e.f}@uni.somewhere, {x.y, z.k}@edu.com

结果是:

a.b@uni.somewhere 
c.d@uni.somewhere 
e.f@uni.somewhere
x.y@edu.com
z.k@edu.com

关于如何解析这种格式以提取所有电子邮件地址的任何建议?我正在尝试使用正则表达式,但我目前正在苦苦挣扎。

标签: pythonparsingemail-address

解决方案


Pyparsing 是一个 PEG 解析器,它为您提供了一个嵌入式 DSL 来构建可以读取这样的表达式的解析器,生成的代码比正则表达式更具可读性(和可维护性),并且足够灵活,可以添加事后思考(等等,电子邮件可以用引号引起来吗?)。

pyparsing 使用 '+' 和 '|' 运算符从较小的位构建您的解析器。它还支持命名字段(类似于正则表达式命名组)和解析时回调。看看这一切是如何在下面一起滚动的:

import pyparsing as pp

LBRACE, RBRACE = map(pp.Suppress, "{}")
email_part = pp.quotedString | pp.Word(pp.printables, excludeChars=',{}@')

# define a compressed email, and assign names to the separate parts
# for easier processing - luckily the default delimitedList delimiter is ','
compressed_email = (LBRACE 
                    + pp.Group(pp.delimitedList(email_part))('names')
                    + RBRACE
                    + '@' 
                    + email_part('trailing'))

# add a parse-time callback to expand the compressed emails into a list
# of constructed emails - note how the names are used
def expand_compressed_email(t):
    return ["{}@{}".format(name, t.trailing) for name in t.names]
compressed_email.addParseAction(expand_compressed_email)

# some lists will just contain plain old uncompressed emails too
# Combine will merge the separate tokens into a single string
plain_email = pp.Combine(email_part + '@' + email_part)

# the complete list parser looks for a comma-delimited list of compressed 
# or plain emails
email_list_parser = pp.delimitedList(compressed_email | plain_email)

pyparsing 解析器附带了runTests一种针对各种测试字符串测试解析器的方法:

tests = """\
    # original test string
    {a.b, c.d, e.f}@uni.somewhere, {x.y, z.k}@edu.com

    # a tricky email containing a quoted string
    {x.y, z.k}@edu.com, "{a, b}"@domain.com

    # just a plain email
    plain_old_bob@uni.elsewhere

    # mixed list of plain and compressed emails
    {a.b, c.d, e.f}@uni.somewhere, {x.y, z.k}@edu.com, plain_old_bob@uni.elsewhere
"""

email_list_parser.runTests(tests)

印刷:

# original test string
{a.b, c.d, e.f}@uni.somewhere, {x.y, z.k}@edu.com
['a.b@uni.somewhere', 'c.d@uni.somewhere', 'e.f@uni.somewhere', 'x.y@edu.com', 'z.k@edu.com']

# a tricky email containing a quoted string
{x.y, z.k}@edu.com, "{a, b}"@domain.com
['x.y@edu.com', 'z.k@edu.com', '"{a, b}"@domain.com']

# just a plain email
plain_old_bob@uni.elsewhere
['plain_old_bob@uni.elsewhere']

# mixed list of plain and compressed emails
{a.b, c.d, e.f}@uni.somewhere, {x.y, z.k}@edu.com, plain_old_bob@uni.elsewhere
['a.b@uni.somewhere', 'c.d@uni.somewhere', 'e.f@uni.somewhere', 'x.y@edu.com', 'z.k@edu.com', 'plain_old_bob@uni.elsewhere']

披露:我是 pyparsing 的作者。


推荐阅读