首页 > 解决方案 > 使用 python 正则表达式提取子字符串

问题描述

我想使用匹配两个字符串之间的任何文本的正则表达式:

   sample_string= "Message ID: SM9MatRNTnMAYaylR0QgOH///qUUveBCbw==  
    2021-07-10T20:48:23.997Z john s (X Y Bank) -
    john.s@xy.com:  
     [EVENT] 347376954900491 (john.s@xy.com) created room
    (roomName='CSTest' roomDescription='CS Test Chat Room' COPY_DISABLED=false
    READ_ONLY=false DISCOVERABLE=false MEMBER_ADD_USER_ENABLED=false
    roomType=PRIVATE conversationScope=internal owningCompany=X Y
    Bank)
    
    Message ID: nsabNaqeXfuEj9mBEhvS0n///qUUveAhbw==  
    2021-07-10T20:48:23.997Z john s (X Y Bank) -
    john.s@xy.comsays  
     [EVENT] 347376954900491 (john.s@xy.com) invited 347376954900486
    (kerren.n@xy.com) to room (CSTest|john s|16091907435583)
    
    Message ID: Nu/EYTkTQ5qdbqzZ0Rig8n///qUUvQ42dA==  
    2021-07-10T20:48:23.997Z john s (X Y Bank) -
    john.s@xy.comsays  
    
    Catchyou later
    
      
    
    Message ID: dy2yaByqhm+n88Gd3VQOhH///qUUrz8odA==  
    2021-07-10T20:48:23.997Z kerren n (X Y Bank) -
    nancy.n@xy.comsays  
    
    KeywordContent_ Cricket is a bat-and-ball game played between two teams of
    eleven players on a field at the centre of which is a 20-metre (22-yard) pitch
    with a wicket at each end, each comprising two bails balanced on three stumps.
    The batting side scores runs by striking the ball bowled at the wicket with
    the bat, while the bowling and fielding side tries to prevent this and dismiss
    each player (so they are "out").
    
      
    
    * * *
    
    Generated by Content Export Service | Stream Type: SymphonyPost |
    Stream ID: ZZo5pRRPFC18uzlonFjya3///qUUveBHdA== | Room Type: Private |
    Conversation Scope: internal | Owning Company: X Y Bank | File
    Generated Date: 2021-07-10T20:48:23.997Z | Content Start Date:
    2021-07-10T20:48:23.997Z | Content Stop Date: 2021-07-10T20:48:23.997Z  
    
    * * *
    
    *** (780787) Disclaimer: 
    (incorporated in paris with Ref. No. ZC18, is authorised by Prudential Regulation
    Authority (PRA) and regulated by Financial Conduct Authority and PRA. oyp and
    its affiliates (We) monitor this confidential message meant for your
    information only. We make no recommendation or offer. You should get
    independent advice. We accept no liability for loss caused hereby. See market
    commentary disclaimers (
    http://wholesalebanking.com/en/utility/Pages/d-mkt.aspx ),
    Dodd-Frank and EMIR disclosures (
    http://wholesalebanking.com/en/capabilities/financialmarkets/Pages/default.aspx
    ) "

在此示例中,我想提取emailID关键字之后的所有内容,Messaage ID: 因此预期输出为:

extracted_list =[':  
 [EVENT] 347376954900491 (john.s@xy.com) created room
(roomName='CSTest' roomDescription='CS Test Chat Room' COPY_DISABLED=false
READ_ONLY=false DISCOVERABLE=false MEMBER_ADD_USER_ENABLED=false
roomType=PRIVATE conversationScope=internal owningCompany=X Y
Bank)','says  
 [EVENT] 347376954900491 (john.s@xy.com) invited 347376954900486
(kerren.n@xy.com) to room (CSTest|john s|16091907435583)','says Catchyou later','says 
KeywordContent_ Cricket is a bat-and-ball game played between two teams of
eleven players on a field at the centre of which is a 20-metre (22-yard) pitch
with a wicket at each end, each comprising two bails balanced on three stumps.
The batting side scores runs by striking the ball bowled at the wicket with
the bat, while the bowling and fielding side tries to prevent this and dismiss
each player (so they are "out").']

注意:最后***后的所有内容都不是文本的一部分

到目前为止我尝试的是:

text = re.findall(r'\S+@\S+\s+(.*)Message ID', sample_string)
print (text)
##output: []

标签: python-3.xregex

解决方案


您可以使用

(?s)\S+@\S+?((?:says?|:)?\s.*?)\s+(?:Message ID|\* +\* +\*)

请参阅正则表达式演示

详情

  • (?s)- 与 , inline 修饰符相同,用于跨换行符re.DOTALL进行匹配.
  • \S+- 一个或多个非空白字符(可以替换为[^\s@]+
  • @- 一个@字符
  • \S+?- 尽可能少的一个或多个非空白字符
  • ((?:says?|:)?\s.*?)- 第 1 组:一个可选的says// say:然后是一个空格,然后是尽可能少的零个或多个字符
  • \s+- 一个或多个空格
  • (?:Message ID|\* +\* +\*)- 要么Message ID* * *喜欢子字符串。

推荐阅读