首页 > 解决方案 > 在正则表达式中不使用 OR 运算符的负向回溯 - Python

问题描述

我有两种情况可以从具有如下结构的日志文件中获取一些信息:

proc format;

2018-04-12T07:45:52,430 INFO  [00000009] :t707982 - 26         
2018-04-12T07:45:52,430 INFO  [00000009] :t707982 - 27         
2018-04-12T07:45:52,433 INFO  [00000009] :t707982 - 35         '0010','0019'="08"
2018-04-12T07:45:52,434 INFO  [00000009] :t707982 - 36         '0005','0007','0011','0013'="09"

NOTE: There were 95219365 observations read from the data set WORK.TESTE1.
2018-04-12T07:55:41,536 INFO  [00000018] :t707982 - NOTE: The data set WORK.TESTE1 has 95219365 observations and 9 variables.
2018-04-12T07:55:41,537 INFO  [00000018] :t707982 - NOTE: Compressing data set WORK.TESTE1 decreased size by 34.04 percent. 
2018-04-12T07:55:41,538 INFO  [00000018] :t707982 -       Compressed is 92230 pages; un-compressed would require 139823 pages.
2018-04-12T07:55:42,230 INFO  [00000018] :t707982 - NOTE: PROCEDURE FORMAT used (Total process time):
2018-04-12T07:55:42,231 INFO  [00000018] :t707982 -       real time           2:07.03
2018-04-12T07:55:42,231 INFO  [00000018] :t707982 -       user cpu time       1:56.98
2018-04-12T07:55:42,231 INFO  [00000018] :t707982 -       system cpu time     39.22 seconds
2018-04-12T07:55:42,231 INFO  [00000018] :t707982 -       memory              3159502.32k

proc format;

2018-04-12T08:45:52,430 INFO  [00000009] :t707982 - 26         
2018-04-12T08:45:52,434 INFO  [00000009] :t707982 - 36         '0005','0007','0011','0013'="09"
NOTE: There were 95219365 observations read from the data set WORK.TESTE2.
2018-04-12T08:55:41,536 INFO  [00000018] :t707982 - NOTE: The data set WORK.TESTE2 has 95219365 observations and 9 variables.
2018-04-12T08:55:41,537 INFO  [00000018] :t707982 - NOTE: Compressing data set WORK.TESTE2 decreased size by 34.04 percent. 
2018-04-12T08:55:41,538 INFO  [00000018] :t707982 -       Compressed is 92230 pages; un-compressed would require 139823 pages.
2018-04-12T08:55:42,230 INFO  [00000018] :t707982 - NOTE: PROCEDURE FORMAT used (Total process time):
2018-04-12T08:55:42,231 INFO  [00000018] :t707982 -       real time           2:07.03
2018-04-12T08:55:42,231 INFO  [00000018] :t707982 -       user cpu time       1:56.98
2018-04-12T08:55:42,231 INFO  [00000018] :t707982 -       system cpu time     39.22 seconds
2018-04-12T08:55:42,231 INFO  [00000018] :t707982 -       memory              3159502.32k

1)提取 proc {format} 和 note: procedure {format} 之间的所有信息

2)如果第一个 proc {format}没有note: procedure {format},它需要在找到另一个proc {format}时停止捕获并且不从第二个 proc {format返回note: procedure {format} },就像在这个例子中:

proc format;

2018-04-12T07:45:52,430 INFO  [00000009] :t707982 - 26         
2018-04-12T07:45:52,430 INFO  [00000009] :t707982 - 27         
2018-04-12T07:45:52,433 INFO  [00000009] :t707982 - 35         '0010','0019'="08"
2018-04-12T07:45:52,434 INFO  [00000009] :t707982 - 36         '0005','0007','0011','0013'="09"

NOTE: There were 95219365 observations read from the data set WORK.TESTE1.
2018-04-12T07:55:41,536 INFO  [00000018] :t707982 - NOTE: The data set WORK.TESTE1 has 95219365 observations and 9 variables.
2018-04-12T07:55:41,537 INFO  [00000018] :t707982 - NOTE: Compressing data set WORK.TESTE1 decreased size by 34.04 percent. 
2018-04-12T07:55:41,538 INFO  [00000018] :t707982 -       Compressed is 92230 pages; un-compressed would require 139823 pages.


proc format;

2018-04-12T08:45:52,430 INFO  [00000009] :t707982 - 26         
2018-04-12T08:45:52,434 INFO  [00000009] :t707982 - 36         '0005','0007','0011','0013'="09"
NOTE: There were 95219365 observations read from the data set WORK.TESTE2.
2018-04-12T08:55:41,536 INFO  [00000018] :t707982 - NOTE: The data set WORK.TESTE2 has 95219365 observations and 9 variables.
2018-04-12T08:55:41,537 INFO  [00000018] :t707982 - NOTE: Compressing data set WORK.TESTE2 decreased size by 34.04 percent. 
2018-04-12T08:55:41,538 INFO  [00000018] :t707982 -       Compressed is 92230 pages; un-compressed would require 139823 pages.
2018-04-12T08:55:42,230 INFO  [00000018] :t707982 - NOTE: PROCEDURE FORMAT used (Total process time):
2018-04-12T08:55:42,231 INFO  [00000018] :t707982 -       real time           2:07.03
2018-04-12T08:55:42,231 INFO  [00000018] :t707982 -       user cpu time       1:56.98
2018-04-12T08:55:42,231 INFO  [00000018] :t707982 -       system cpu time     39.22 seconds
2018-04-12T08:55:42,231 INFO  [00000018] :t707982 -       memory              3159502.32k

所以,我的问题是第二种情况。我的正则表达式不断从第二个proc format捕获note: procedure format,而它应该忽略第一个并仅捕获第二种情况:

(?s)(?<=proc[ ])(?P<type>\w+).*?(?:(?<=NOTE:[ ]PROCEDURE[ ])|(?<!=proc[ ]))(?P=type).*?(?=memory)

我用 OR 运算符尝试了负面的看法|(?<!=proc[ ]),但仍然没有成功。

你可以在这里看到我的正则表达式

你能帮助我吗?

标签: pythonregex

解决方案


对于该数据结构,要获取您之间的数据,proc {format} and note: procedure {format}您不必使用 inline 修饰符(?s)让点匹配换行符以防止不必要的回溯。

如果您想要介于两者之间的数据,您可以添加一个捕获组,而不是在开始时使用正面的lookbehind,匹配proc format;

要获取介于两者之间的数据,您可以匹配所有不以任何 proc 格式开头的行;或包含 NOTE: PROCEDURE

中间的数据是捕获组 2

^proc (?P<type>\w+);\r?\n\s*((?:(?!proc |.* NOTE: PROCEDURE ).*\r?\n)*.*(?= NOTE: PROCEDURE ))

解释

  • ^行首
  • proc 从字面上匹配
  • (?P<type>\w+);命名组type,匹配 1+ 个单词字符
  • \r?\n\s*匹配换行符和 0+ 个空格字符
  • (捕获组 2
    • (?:非捕获组
      • (?!proc |.* NOTE: PROCEDURE )断言右边的内容不是proc 或该行包含 NOTE: PROCEDURE
      • .*\r?\n匹配除换行符以外的任何字符 0+ 次,后跟换行符
    • )*关闭组并重复 0+ 次以匹配所有行
    • .*(?= NOTE: PROCEDURE )匹配除了换行符以外的任何字符,断言右边是 NOTE: PROCEDURE
  • )关闭组 2

第一个数据的正则表达式演示| 第二个数据的正则表达式演示


推荐阅读