首页 > 解决方案 > Using re.finditer to generate iterative object, but no return, the regex code is ok when testing separately

问题描述

Here is the regex code

pattern="""
(?P<host>\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})
(\ \-\ )
(?P<user_name>[a-z]{1,100}\d{4}|\-{1})
( \[)(?P<time>\d{2}\/[A-Za-z]{3}\/\d{4}\:\d{2}\:\d{2}\:\d{2}\ -\d{4})
(\] ")
(?P<request>.+)
(")
"""
for item in re.finditer(pattern,text,re.VERBOSE):
    # We can get the dictionary returned for the item with .groupdict()
    print(item.groupdict())

And I use Jupyter Notebook to run those codes.

The testing text is

146.204.224.152 - feest6811 [21/Jun/2019:15:45:24 -0700] "POST /incentivize HTTP/1.1" 302 4622
197.109.77.178 - kertzmann3129 [21/Jun/2019:15:45:25 -0700] "DELETE /virtual/solutions/target/web+services HTTP/2.0" 203 26554

标签: regexloggingre

解决方案


The main issue is that you did not escape the literal space in your pattern. When using re.X / re.VERBOSE any whitespace (when outside of a character class) in the pattern is treated as formatted whitespace and not accounted for in the end. In Python re pattern, [ ] will always match a literal space, but this is not guaranteed in other language flavors, so the best way to match a space in the pattern that is compiled with the re.X like flag is escaping the space.

Besides, there are other things to note:

  • {1} is always redundant, remove it
  • Repeated patterns can be grouped in a non-capturing group and quantified with an appropriate quantifier, e.g. \d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3} => \d{1,3}(?:\.\d{1,3}){3}
  • There is no need to escape / and : (anywhere in the pattern) and - (when outside a character class) in the re regex.

Thus, you can use

pattern = r'''(?P<host>\d{1,3}(?:\.\d{1,3}){3})
(\ -\ )
(?P<user_name>[a-z]{1,100}\d{4}|-)
(\ \[)(?P<time>\d{2}/[A-Za-z]{3}/\d{4}:\d{2}:\d{2}:\d{2}\ -\d{4})
(\]\ ")
(?P<request>.+)
(")'''

See the regex demo and the Python demo:

import re
text = '''146.204.224.152 - feest6811 [21/Jun/2019:15:45:24 -0700] "POST /incentivize HTTP/1.1" 302 4622
197.109.77.178 - kertzmann3129 [21/Jun/2019:15:45:25 -0700] "DELETE /virtual/solutions/target/web+services HTTP/2.0" 203 26554'''
pattern = r'''(?P<host>\d{1,3}(?:\.\d{1,3}){3})
(\ -\ )
(?P<user_name>[a-z]{1,100}\d{4}|-)
(\ \[)(?P<time>\d{2}/[A-Za-z]{3}/\d{4}:\d{2}:\d{2}:\d{2}\ -\d{4})
(\]\ ")
(?P<request>.+)
(")'''
for item in re.finditer(pattern,text,re.VERBOSE):
    print(item.groupdict()) # We can get the dictionary returned for the item with .groupdict()

Output:

{'host': '146.204.224.152', 'user_name': 'feest6811', 'time': '21/Jun/2019:15:45:24 -0700', 'request': 'POST /incentivize HTTP/1.1'}
{'host': '197.109.77.178', 'user_name': 'kertzmann3129', 'time': '21/Jun/2019:15:45:25 -0700', 'request': 'DELETE /virtual/solutions/target/web+services HTTP/2.0'}

推荐阅读