首页 > 解决方案 > 如何匹配令牌后的字符串(使用正则表达式)?

问题描述

我正在尝试在令牌“eaddr:”之后提取邮件地址。所以它会匹配行条目中的所有出现,第一个连续的字符串,在该标记之后没有空格:我试过:

SELECT regexp_substr(tab.entry, 'eaddr:\(.*?\)',1,1,'e',1)
from (
select 'String, email@domain.com' as entry
union
select 'eaddr:mail1@domain.com eaddr:mail2@domain.com sometext     eaddr:   mail3@domain.com some4354% text' as entry
union
select 'eaddr:mail5@domain.org' as entry
union
select 'Just a string' as entry
) tab
;

但它不起作用。正确的结果集是:

null
mail1@domain.com mail2@domain.com mail3@domain.com
mail5@domain.org
null

标签: sqlregexsnowflake-cloud-data-platform

解决方案


首先,我建议使用更好的正则表达式来验证电子邮件格式。我受到 Gordon 的 SPLIT_TO_TABLE + LATERAL 方法的启发,并编写了一些示例查询来从条目中获取这些电子邮件。

如果你想把所有的电子邮件放在一起,你可以使用这个:

with t as (
select 'String, email@domain.com' as entry
union
select 'eaddr:mail1@domain.com eaddr:mail2@domain.com sometext     eaddr:   mail3@domain.com some4354% text' as entry
union
select 'eaddr:mail5@domain.org' as entry
union
select 'Just a string' as entry
) 
Select LISTAGG( regexp_substr( s.value, '[A-Z0-9a-z._%+-]+@[A-Za-z0-9.-]+\\.[A-Za-z]{2,64}' ) ,' ' ) emails from t,
lateral SPLIT_TO_TABLE(t.entry, 'eaddr:') s
where s.seq > 1;

+---------------------------------------------------------------------+
|                               EMAILS                                |
+---------------------------------------------------------------------+
| mail1@domain.com mail2@domain.com mail3@domain.com mail5@domain.org |
+---------------------------------------------------------------------+

要获得问题的确切结果,可以使用以下查询:

with t as (
select 'String, email@domain.com' as entry
union
select 'eaddr:mail1@domain.com eaddr:mail2@domain.com sometext     eaddr:   mail3@domain.com some4354% text' as entry
union
select 'eaddr:mail5@domain.org' as entry
union
select 'Just a string' as entry
) 
select emails from 
(
Select t.entry, s.*, 
LISTAGG( regexp_substr( IFF(s.seq = 1, '', s.value ), '[A-Z0-9a-z._%+-]+@[A-Za-z0-9.-]+\\.[A-Za-z]{2,64}' ) ,' ' ) 
OVER ( PARTITION BY s.seq ) emails
from t,
lateral SPLIT_TO_TABLE(t.entry, ' ') s ) 
where index = 1;

+----------------------------------------------------+
|                       EMAILS                       |
+----------------------------------------------------+
| NULL                                               |
| mail1@domain.com mail2@domain.com mail3@domain.com |
| NULL                                               |
| mail5@domain.org                                   |
+----------------------------------------------------+

推荐阅读