python - Python 从访问日志中解析 GET|POST 路径
问题描述
假设我们有一些这样的访问日志
83.198.250.175 - - [22/Mar/2009:07:40:06 +0100] "GET /images/ht1.gif HTTP/1.1" 200 61 " http://www.facades.fr/ " "Mozilla/4.0 (兼容;MSIE 7.0;Windows NT 5.1;Wanadoo 6.7;Orange 8.0)""-"
65.33.94.190 - - [05/Apr/2003:17:26:27 -0500] “POST /samples/dem/tt.php ?x=e2323 HTTP/1.0”404 276
151.227.152.48 - - [02/Jul/2014:14:35:55 +0100]“GET /css/main.css HTTP/1.1”200 4658“ http://stanmore.menczykowski.co.uk/ ”“Mozilla /5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.153 Safari/537.36"
10.143.2.119 64.103.161.112 - [06/Jan/1970:00:48:01 +0000] "GET /right_arrow.jpg HTTP/1.1" 304 0 " http://64.103.161.112/index_eth_diag.html " "Mozilla/ 5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/28.0.1500.95 Safari/537.36"
我需要在POST和GET(文件路径)之后获取粗体文本部分。
日志格式可能会有所不同,但请求类型和路径将始终存在。
我尝试了以下但它并不总是有效,因为日志格式不一样
parts = [
r'(?P<host>\S+)', # host %h
r'\S+', # indent %l (unused)
r'(?P<user>\S+)', # user %u
r'\[(?P<time>.+)\]', # time %t
r'"(?P<request>.*)"', # request "%r"
r'(?P<status>[0-9]+)', # status %>s
r'(?P<size>\S+)', # size %b (careful, can be '-')
r'"(?P<referrer>.*)"', # referrer "%{Referer}i"
r'"(?P<agent>.*)"', # user agent "%{User-agent}i"
]
def get_structured_access_logs_list(access_logs):
pattern = re.compile(r'\s+'.join(parts) + r'\s*\Z')
# Initialize required variables
log_data = []
# Get components from each line of the log file into a structured dict
for line in access_logs:
try:
log_data.append(pattern.match(line).groupdict())
except:
pass
return log_data
def parse_path(request_string) :
rx = re.compile(r'^(?:GET|POST)\s+([^?\s]+).*$', re.M)
return rx.findall(request_string)
def get_file_paths(access_logs_list):
file_path_set = set()
for dict in access_logs_list:
if 'request' in dict.keys():
file_name = parse_path(dict['request'])[0] # passing a single line, the list will contain only 1 element
if file_name is not None:
file_path_set.add(full_path)
return accessed_file_set
更新:调整代码后,函数“get_file_paths”将返回一组包含访问日志中访问的文件的完整路径
def parse_path(request_string) :
rx = re.compile(r'"(?:GET|POST)\s+([^\s?]*)', re.M)
return rx.findall(request_string)
def get_file_paths(access_logs):
file_set = set()
for line in access_logs:
matches = parse_accessed_file_name_list(line) # passing a single line, the list will contain only 1 element
if matches is None or len(matches) <= 0:
continue
full_path = root_path + matches[0]
if os.path.isfile(full_path):
file_set.add(full_path)
return file_set
解决方案
您可以使用
(?x)^
(?P<host>\S+) \s+ # host %h
\S+ \s+ # indent %l (unused)
(?P<user>\S+) \s+ # user %u
\[(?P<time>.*?)\] \s+ # time %t
"\S+\s+(?P<request>[^"?\s]*)[^"]*" \s+ # request "%r"
(?P<status>[0-9]+) \s+ # status %>s
(?P<size>\S+) (?:\s+ # size %b (careful, can be '-')
"(?P<referrer>[^"?\s]*[^"]*)" \s+ # referrer "%{Referer}i"
"(?P<agent>[^"]*)" (?:\s+ # user agent "%{User-agent}i"
"[^"]*" )? )? # unused
$
请参阅正则表达式演示。
我介绍了许多小的改进(请参阅[^"]*
代替.*
),主要的改进是可选的非捕获组,以匹配可能丢失的引荐来源网址和代理字段以及request
看起来像(?P<request>[^"?\s]*)
并且仅捕获 0 个或更多字符而不是空格的模式,?
和"
char, 而后续[^"]*"
匹配字段的其余部分。
此外,编译模式一次是有意义的,而不是像处理每一行时那样。
修饰符启用自由间距模式,(?x)
从而可以在多行上格式化图案并添加注释。
import re
pattern = re.compile(r"""(?x)^
(?P<host>\S+) \s+ # host %h
\S+ \s+ # indent %l (unused)
(?P<user>\S+) \s+ # user %u
\[(?P<time>.*?)\] \s+ # time %t
"\S+\s+(?P<request>[^"?\s]*)[^"]*" \s+ # request "%r"
(?P<status>[0-9]+) \s+ # status %>s
(?P<size>\S+) (?:\s+ # size %b (careful, can be '-')
"(?P<referrer>[^"?\s]*[^"]*)" \s+ # referrer "%{Referer}i"
"(?P<agent>[^"]*)" (?:\s+ # user agent "%{User-agent}i"
"[^"]*" )?)? # optional argument (unused)
$""")
def get_structured_access_logs_list(access_logs):
# Initialize required variables
log_data = []
# Get components from each line of the log file into a structured dict
for line in access_logs:
try:
log_data.append(pattern.match(line).groupdict())
except:
pass
return log_data
lines = ['83.198.250.175 - - [22/Mar/2009:07:40:06 +0100] "GET /images/ht1.gif HTTP/1.1" 200 61 "http://www.facades.fr/" "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Wanadoo 6.7; Orange 8.0)" "-"',
'65.33.94.190 - - [05/Apr/2003:17:26:27 -0500] "POST /samples/dem/tt.php?x=e2323 HTTP/1.0" 404 276',
'151.227.152.48 - - [02/Jul/2014:14:35:55 +0100] "GET /css/main.css HTTP/1.1" 200 4658 "http://stanmore.menczykowski.co.uk/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.153 Safari/537.36"',
'10.143.2.119 64.103.161.112 - [06/Jan/1970:00:48:01 +0000] "GET /right_arrow.jpg HTTP/1.1" 304 0 "http://64.103.161.112/index_eth_diag.html" "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/28.0.1500.95 Safari/537.36"']
for res in get_structured_access_logs_list(lines):
print(res)
推荐阅读
- c# - 有没有办法在 GUILayout.Button 内的 EditorWindow 中使用 EditorGUILayout.LabelField?
- python - Python selenium scraper 在 Windows 上完美运行,但在 raspian raspbian 上却不行
- firebase - Firebase Firestore 查询返回旧数据
- python - Discord.py:json.decoder.JSONDecodeError:预期值:第 1 行第 1 列(char 0)
- sql - 使用 FIRST_VALUE 用前面的非 NULL 值填充 NULL 值
- azure - 从 REST API 或 PowerShell 获取 Azure Web 应用实例的 VM 名称
- sql - 为简单的 SQL 查询创建前端/GUI 的方向?
- c++ - 模棱两可的重载函数仅因参数的模板形参不同
- python - 如何让 Python 返回函数的抽象表达式并获取其中的参数?
- javascript - Laravel:节点模块不可用