python - Python正则表达式解析文本文件
问题描述
我的目标是从文本文件中提取一些值并使用 matplotlib 生成绘图...
我有几个从调用 tensorflow 的 python 脚本生成的大型(~100MB)文本日志文件。我保存了运行脚本的终端输出,如下所示:
python my_script.py 2>&1 | tee mylog.txt
这是我试图解析并转换为字典的文本文件的一个片段:
Epoch 00001: saving model to /root/data-cache/data/tmp/models/ota-cfo-full_20200626-173916_01_0.05056382_0.99.h5
5938/5938 [==============================] - 4312s 726ms/step - loss: 0.1190 - accuracy: 0.9583 - val_loss: 0.0506 - val_accuracy: 0.9854
我特别想提升 100 个 epoch 的 epoch 数(0001)、以秒为单位的时间(4312)、loss(0.1190)、accuracy(0.9538)、val_loss(0.0506)和 val_accuracy,这样我就可以使用 matplotlib 绘制图。
日志文件中充满了我不想要的其他垃圾,例如:
Epoch 1/100
1/5938 [..............................] - ETA: 0s - loss: 1.7893 - accuracy: 0.31252020-06-26 17:39:45.253972: I tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1479] CUPTI activity buffer flushed
2020-06-26 17:39:45.255588: I tensorflow/core/profiler/internal/gpu/device_tracer.cc:216] GpuTracer has collected 179 callback api events and 179 activity events.
2020-06-26 17:39:45.276306: I tensorflow/core/profiler/rpc/client/save_profile.cc:168] Creating directory: /root/data-cache/data/tmp/20200626-173933/train/plugins/profile/2020_06_26_17_39_45
2020-06-26 17:39:45.284235: I tensorflow/core/profiler/rpc/client/save_profile.cc:174] Dumped gzipped tool data for trace.json.gz to /root/data-cache/data/tmp/20200626-173933/train/plugins/profile/2020_06_26_17_39_45/ddfc870f32d1.trace.json.gz
2020-06-26 17:39:45.286639: I tensorflow/core/profiler/utils/event_span.cc:288] Generation of step-events took 0.049 ms
2020-06-26 17:39:45.288257: I tensorflow/python/profiler/internal/profiler_wrapper.cc:87] Creating directory: /root/data-cache/data/tmp/20200626-173933/train/plugins/profile/2020_06_26_17_39_45Dumped tool data for overview_page.pb to /root/data-cache/data/tmp/20200626-173933/train/plugins/profile/2020_06_26_17_39_45/ddfc870f32d1.overview_page.pb
Dumped tool data for input_pipeline.pb to /root/data-cache/data/tmp/20200626-173933/train/plugins/profile/2020_06_26_17_39_45/ddfc870f32d1.input_pipeline.pb
Dumped tool data for tensorflow_stats.pb to /root/data-cache/data/tmp/20200626-173933/train/plugins/profile/2020_06_26_17_39_45/ddfc870f32d1.tensorflow_stats.pb
Dumped tool data for kernel_stats.pb to /root/data-cache/data/tmp/20200626-173933/train/plugins/profile/2020_06_26_17_39_45/ddfc870f32d1.kernel_stats.pb
2/5938 [..............................] - ETA: 6:24 - loss: 1.7824 - accuracy: 0.2656
3/5938 [..............................] - ETA: 17:03 - loss: 1.7562 - accuracy: 0.2396
4/5938 [..............................] - ETA: 22:27 - loss: 1.7368 - accuracy: 0.2344
5/5938 [..............................] - ETA: 22:55 - loss: 1.7387 - accuracy: 0.2375
6/5938 [..............................] - ETA: 24:16 - loss: 1.7175 - accuracy: 0.2656
7/5938 [..............................] - ETA: 24:34 - loss: 1.6885 - accuracy: 0.2812
Epoch 54/100
1500/1500 [==============================] - ETA: 0s - loss: 0.0088 - accuracy: 0.9984
Epoch 00054: saving model to /root/data-cache/data/tmp/models/ota-cfo-10k-clean_20200701-205945_54_0.0054215402_1.00.h5
1500/1500 [==============================] - 942s 628ms/step - loss: 0.0088 - accuracy: 0.9984 - val_loss: 0.0054 - val_accuracy: 0.9993
2020-07-02 14:42:29.102025: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:184] Filling up shuffle buffer (this may take a while): 674 of 1000
2020-07-02 14:42:33.511163: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:233] Shuffle buffer filled.
Epoch 55/100
1500/1500 [==============================] - ETA: 0s - loss: 0.0136 - accuracy: 0.9978
Epoch 00055: saving model to /root/data-cache/data/tmp/models/ota-cfo-10k-clean_20200701-205945_55_0.0036424326_1.00.h5
1500/1500 [==============================] - 948s 632ms/step - loss: 0.0136 - accuracy: 0.9978 - val_loss: 0.0036 - val_accuracy: 0.9990
2020-07-02 14:58:32.042963: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:184] Filling up shuffle buffer (this may take a while): 690 of 1000
2020-07-02 14:58:36.302518: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:233] Shuffle buffer filled.
这几乎可以工作,但没有得到纪元:
regular_exp = re.compile(r'(?P<loss>\d+\.\d+)\s+-\s+accuracy:\s*(?P<accuracy>\d+\.\d+)\s+-\s+val_loss:\s*(?P<val_loss>\d+\.\d+)\s*-\s*val_accuracy:\s*(?P<val_accuracy>\d+\.\d+)', re.M)
with open(log_file, 'r') as file:
results = [ match.groupdict() for match in regular_exp.finditer(file.read()) ]
我也试过只读取文件,但到处都有这些奇怪的 x08。
from pprint import pprint as pp
log_file = 'mylog.txt'
text_file = open(log_file, "r")
lines = text_file.readlines()
pp (lines)
' 291/1500 [====>.........................] - ETA: 12:19 - loss: 0.7179 - '
'accuracy: '
'0.7163\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\n',
' 292/1500 [====>.........................] - ETA: 12:18 - loss: 0.7164 - '
'accuracy: '
'0.7168\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\n',
有人可以帮我在 python 中构造一个正则表达式,让我制作这些值的字典吗?
我的目标是这样的:
[{'iteration': '00', 'seconds': '1802', 'loss': '0.3430', 'accuracy': '0.8753', 'val_loss': '0.1110', 'val_accuracy': '0.9670', 'epoch_num': '00002', 'epoch_file': '/root/data-cache/data/tmp/models/ota-cfo-10k_20200527-001913_02_0.069291627_0.98.h5'}, {'iteration': '1500/1500', 'seconds': '1679', 'loss': '0.0849', 'accuracy': '0.9739', 'val_loss': '0.0693', 'val_accuracy': '0.9807', 'epoch_num': '00003', 'epoch_file': '/root/data-cache/data/tmp/models/ota-cfo-10k_20200527-001913_03_0.055876694_0.98.h5'}, {'iteration': '1500/1500', 'seconds': '1674', 'loss': '0.0742', 'accuracy': '0.9791', 'val_loss': '0.0559', 'val_accuracy': '0.9845', 'epoch_num': '00004', 'epoch_file': '/root/data-cache/data/tmp/models/ota-cfo-10k_20200527-001913_04_0.053867317_0.99.h5'}, {'iteration': '1500/1500', 'seconds': '1671', 'loss': '0.0565', 'accuracy': '0.9841', 'val_loss': '0.0539', 'val_accuracy': '0.9850', 'epoch_num': '00005', 'epoch_file': '/root/data-cache/data/tmp/models/ota-cfo-10k_20200527-001913_05_0.053266536_0.99.h5'}]
解决方案
您可以使用正确的正则表达式、列表理解groupdict
、和finditer
首先,我们需要一个基线和标准化的文本格式。这很重要 - 如果您认为您的文本内容与此不匹配,也许尝试用\x08
空格替换所有字节(以及所有其他不必要的字节)。(\x08
仅表示退格)
data = """Epoch 54/100
1500/1500 [==============================] - ETA: 0s - loss: 0.0088 - accuracy: 0.9984
Epoch 00054: saving model to /root/data-cache/data/tmp/models/ota-cfo-10k-clean_20200701-205945_54_0.0054215402_1.00.h5
1500/1500 [==============================] - 942s 628ms/step - loss: 0.0088 - accuracy: 0.9984 - val_loss: 0.0054 - val_accuracy: 0.9993
2020-07-02 14:42:29.102025: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:184] Filling up shuffle buffer (this may take a while): 674 of 1000
2020-07-02 14:42:33.511163: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:233] Shuffle buffer filled.
Epoch 55/100
1500/1500 [==============================] - ETA: 0s - loss: 0.0136 - accuracy: 0.9978
Epoch 00055: saving model to /root/data-cache/data/tmp/models/ota-cfo-10k-clean_20200701-205945_55_0.0036424326_1.00.h5
1500/1500 [==============================] - 948s 632ms/step - loss: 0.0136 - accuracy: 0.9978 - val_loss: 0.0036 - val_accuracy: 0.9990
2020-07-02 14:58:32.042963: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:184] Filling up shuffle buffer (this may take a while): 690 of 1000
2020-07-02 14:58:36.302518: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:233] Shuffle buffer filled."""
这是您提供的最新示例的一对一复制。它似乎最“完整”,我将使用它。
您需要的正则表达式应该是 -
ETA: (?P<ETA>[\d\.]+)s - loss: (?P<loss>[\d\.]+) - accuracy: (?P<accuracy>[\d\.]+)\s+Epoch (?P<iteration>\d+)
查看演示!
现在,你需要一条奇异的线来解决所有的魔法——
info_list = [match.groupdict() for match in re.finditer(r'ETA: (?P<ETA>[\d\.]+)s - loss: (?P<loss>[\d\.]+) - accuracy: (?P<accuracy>[\d\.]+)\s+Epoch (?P<iteration>\d+)', data)]
不过,对于这种特定情况,我绝对建议先编译模式。
DATA_PATTERN = re.compile(r'ETA: (?P<ETA>[\d\.]+)s - loss: (?P<loss>[\d\.]+) - accuracy: (?P<accuracy>[\d\.]+)\s+Epoch (?P<iteration>\d+)')
info_list = [match.groupdict() for match in DATA_PATTERN.finditer(data)]
输出-
[{'ETA': '0', 'loss': '0.0088', 'accuracy': '0.9984', 'iteration': '00054'},
{'ETA': '0', 'loss': '0.0136', 'accuracy': '0.9978', 'iteration': '00055'}]
推荐阅读
- python - 字符图片网格 Python ATBS 索引超出范围
- c#-9.0 - 这是什么 C# 功能?
- javascript - 当使用新查询推送相同的路由时,创建的钩子不再运行
- swift - swift _ 推入 ViewController 会出现黑屏
- visual-studio-code - macOS 中的文本编辑器对两个带有 ANSI 转义码的文件的差异有问题吗?
- regex - 正则表达式获得文字匹配,直到下一个空格
- python - kivy中我们所说的轮播背景画面
- php - Codeigniter 4:如何使用构建器类或什至使用常规查询获取一系列下一行和上一行
- python - CIFAR10 子集上的自定义转换 - PyTorch
- c# - C# JSON 反序列化为没有名称的数组