python - 在正则表达式中不使用 OR 运算符的负向回溯 - Python
问题描述
我有两种情况可以从具有如下结构的日志文件中获取一些信息:
proc format;
2018-04-12T07:45:52,430 INFO [00000009] :t707982 - 26
2018-04-12T07:45:52,430 INFO [00000009] :t707982 - 27
2018-04-12T07:45:52,433 INFO [00000009] :t707982 - 35 '0010','0019'="08"
2018-04-12T07:45:52,434 INFO [00000009] :t707982 - 36 '0005','0007','0011','0013'="09"
NOTE: There were 95219365 observations read from the data set WORK.TESTE1.
2018-04-12T07:55:41,536 INFO [00000018] :t707982 - NOTE: The data set WORK.TESTE1 has 95219365 observations and 9 variables.
2018-04-12T07:55:41,537 INFO [00000018] :t707982 - NOTE: Compressing data set WORK.TESTE1 decreased size by 34.04 percent.
2018-04-12T07:55:41,538 INFO [00000018] :t707982 - Compressed is 92230 pages; un-compressed would require 139823 pages.
2018-04-12T07:55:42,230 INFO [00000018] :t707982 - NOTE: PROCEDURE FORMAT used (Total process time):
2018-04-12T07:55:42,231 INFO [00000018] :t707982 - real time 2:07.03
2018-04-12T07:55:42,231 INFO [00000018] :t707982 - user cpu time 1:56.98
2018-04-12T07:55:42,231 INFO [00000018] :t707982 - system cpu time 39.22 seconds
2018-04-12T07:55:42,231 INFO [00000018] :t707982 - memory 3159502.32k
proc format;
2018-04-12T08:45:52,430 INFO [00000009] :t707982 - 26
2018-04-12T08:45:52,434 INFO [00000009] :t707982 - 36 '0005','0007','0011','0013'="09"
NOTE: There were 95219365 observations read from the data set WORK.TESTE2.
2018-04-12T08:55:41,536 INFO [00000018] :t707982 - NOTE: The data set WORK.TESTE2 has 95219365 observations and 9 variables.
2018-04-12T08:55:41,537 INFO [00000018] :t707982 - NOTE: Compressing data set WORK.TESTE2 decreased size by 34.04 percent.
2018-04-12T08:55:41,538 INFO [00000018] :t707982 - Compressed is 92230 pages; un-compressed would require 139823 pages.
2018-04-12T08:55:42,230 INFO [00000018] :t707982 - NOTE: PROCEDURE FORMAT used (Total process time):
2018-04-12T08:55:42,231 INFO [00000018] :t707982 - real time 2:07.03
2018-04-12T08:55:42,231 INFO [00000018] :t707982 - user cpu time 1:56.98
2018-04-12T08:55:42,231 INFO [00000018] :t707982 - system cpu time 39.22 seconds
2018-04-12T08:55:42,231 INFO [00000018] :t707982 - memory 3159502.32k
1)提取 proc {format} 和 note: procedure {format} 之间的所有信息
2)如果第一个 proc {format}没有note: procedure {format},它需要在找到另一个proc {format}时停止捕获并且不从第二个 proc {format返回note: procedure {format} },就像在这个例子中:
proc format;
2018-04-12T07:45:52,430 INFO [00000009] :t707982 - 26
2018-04-12T07:45:52,430 INFO [00000009] :t707982 - 27
2018-04-12T07:45:52,433 INFO [00000009] :t707982 - 35 '0010','0019'="08"
2018-04-12T07:45:52,434 INFO [00000009] :t707982 - 36 '0005','0007','0011','0013'="09"
NOTE: There were 95219365 observations read from the data set WORK.TESTE1.
2018-04-12T07:55:41,536 INFO [00000018] :t707982 - NOTE: The data set WORK.TESTE1 has 95219365 observations and 9 variables.
2018-04-12T07:55:41,537 INFO [00000018] :t707982 - NOTE: Compressing data set WORK.TESTE1 decreased size by 34.04 percent.
2018-04-12T07:55:41,538 INFO [00000018] :t707982 - Compressed is 92230 pages; un-compressed would require 139823 pages.
proc format;
2018-04-12T08:45:52,430 INFO [00000009] :t707982 - 26
2018-04-12T08:45:52,434 INFO [00000009] :t707982 - 36 '0005','0007','0011','0013'="09"
NOTE: There were 95219365 observations read from the data set WORK.TESTE2.
2018-04-12T08:55:41,536 INFO [00000018] :t707982 - NOTE: The data set WORK.TESTE2 has 95219365 observations and 9 variables.
2018-04-12T08:55:41,537 INFO [00000018] :t707982 - NOTE: Compressing data set WORK.TESTE2 decreased size by 34.04 percent.
2018-04-12T08:55:41,538 INFO [00000018] :t707982 - Compressed is 92230 pages; un-compressed would require 139823 pages.
2018-04-12T08:55:42,230 INFO [00000018] :t707982 - NOTE: PROCEDURE FORMAT used (Total process time):
2018-04-12T08:55:42,231 INFO [00000018] :t707982 - real time 2:07.03
2018-04-12T08:55:42,231 INFO [00000018] :t707982 - user cpu time 1:56.98
2018-04-12T08:55:42,231 INFO [00000018] :t707982 - system cpu time 39.22 seconds
2018-04-12T08:55:42,231 INFO [00000018] :t707982 - memory 3159502.32k
所以,我的问题是第二种情况。我的正则表达式不断从第二个proc format捕获note: procedure format,而它应该忽略第一个并仅捕获第二种情况:
(?s)(?<=proc[ ])(?P<type>\w+).*?(?:(?<=NOTE:[ ]PROCEDURE[ ])|(?<!=proc[ ]))(?P=type).*?(?=memory)
我用 OR 运算符尝试了负面的看法|(?<!=proc[ ])
,但仍然没有成功。
你能帮助我吗?
解决方案
对于该数据结构,要获取您之间的数据,proc {format} and note: procedure {format}
您不必使用 inline 修饰符(?s)
让点匹配换行符以防止不必要的回溯。
如果您想要介于两者之间的数据,您可以添加一个捕获组,而不是在开始时使用正面的lookbehind,匹配proc format;
要获取介于两者之间的数据,您可以匹配所有不以任何 proc 格式开头的行;或包含 NOTE: PROCEDURE
中间的数据是捕获组 2
^proc (?P<type>\w+);\r?\n\s*((?:(?!proc |.* NOTE: PROCEDURE ).*\r?\n)*.*(?= NOTE: PROCEDURE ))
解释
^
行首proc
从字面上匹配(?P<type>\w+);
命名组type
,匹配 1+ 个单词字符\r?\n\s*
匹配换行符和 0+ 个空格字符(
捕获组 2(?:
非捕获组(?!proc |.* NOTE: PROCEDURE )
断言右边的内容不是proc
或该行包含NOTE: PROCEDURE
.*\r?\n
匹配除换行符以外的任何字符 0+ 次,后跟换行符
)*
关闭组并重复 0+ 次以匹配所有行.*(?= NOTE: PROCEDURE )
匹配除了换行符以外的任何字符,断言右边是NOTE: PROCEDURE
)
关闭组 2
推荐阅读
- node.js - 如何在 AngularJS 和 Raspberry 之间使用带有 WebSockets 的自签名 SSL?
- php - 在控制器中使用对象
- r - 如何使用方法在 S4 对象 r 中设置值(无需输入值)
- mysql - PHP - 从 mySQLi 数据库获取平均处理时间 (AHT)
- typescript - 包装函数并保留输入和返回类型?
- r - ggplot2中的x轴和误差线(两条线的折线图)有问题
- javascript - 关闭 ajax 函数成功事件的 webview
- python - 通过指定其标签的一部分来获取 xml 树对象
- excel - 复杂的 Excel 转置
- node.js - 使用带有 nodejs + redbird (node-http-proxy) 的反向代理加载 iframe 时出错