首页 > 解决方案 > Python:解析 csv 数据并加载到数据框

问题描述

我在下面粘贴了一些 csv 数据。我想解析它并将其加载到数据框中,以便更容易分析。

我想根据 logStreamName 的每个分组来获取值,如下所示:

df = pd.read_csv('mydata.csv')

logs = df['logStreamName'].unique()

for i in logs:
    grouped_df = df[df['logStreamName'] == i]
    

但是然后我如何解析每个子集的数据帧以获取关联的值

CSV 数据:

message,logStreamName
20/10/07 17:40:42 - INFO - dse_run_model - n_i*n_j*n_k*n_l: 247632,data-science-dse-cplex/default/27f44fce-90f8-40b2-83f7-3e1aef216fa6
20/10/07 17:40:42 - INFO - dse_model_assets - n_i*n_j*n_k*n_l = 247632,data-science-dse-cplex/default/27f44fce-90f8-40b2-83f7-3e1aef216fa6
20/10/07 17:40:42 - INFO - dse_run_model - len(placed_ijkl): 40944,data-science-dse-cplex/default/27f44fce-90f8-40b2-83f7-3e1aef216fa6
20/10/07 17:40:42 - INFO - dse_run_model - len(placed_region_ijl): 1706,data-science-dse-cplex/default/27f44fce-90f8-40b2-83f7-3e1aef216fa6
20/10/07 17:40:42 - INFO - dse_run_model - len(not_placed_region_ijl): 1706,data-science-dse-cplex/default/27f44fce-90f8-40b2-83f7-3e1aef216fa6
20/10/07 17:41:01 - INFO - __main__ - Maximum memory usage: 12258.98828125,data-science-dse-cplex/default/27f44fce-90f8-40b2-83f7-3e1aef216fa6
20/10/07 17:40:24 - INFO - dse_run_model - n_i*n_j*n_k*n_l: 323680,data-science-dse-cplex/default/11c5884b-f7c5-4600-99d2-70584036ba3d
20/10/07 17:40:24 - INFO - dse_model_assets - n_i*n_j*n_k*n_l = 323680,data-science-dse-cplex/default/11c5884b-f7c5-4600-99d2-70584036ba3d
20/10/07 17:40:24 - INFO - dse_run_model - len(placed_ijkl): 59280,data-science-dse-cplex/default/11c5884b-f7c5-4600-99d2-70584036ba3d
20/10/07 17:40:24 - INFO - dse_run_model - len(placed_region_ijl): 2964,data-science-dse-cplex/default/11c5884b-f7c5-4600-99d2-70584036ba3d
20/10/07 17:40:24 - INFO - dse_run_model - len(not_placed_region_ijl): 2964,data-science-dse-cplex/default/11c5884b-f7c5-4600-99d2-70584036ba3d
20/10/07 17:41:01 - INFO - __main__ - Maximum memory usage: 12313.5390625,data-science-dse-cplex/default/11c5884b-f7c5-4600-99d2-70584036ba3d
20/10/07 17:40:24 - INFO - dse_run_model - n_i*n_j*n_k*n_l: 301312,data-science-dse-cplex/default/cb304e99-2c5f-4a13-b454-32de8e1370e2
20/10/07 17:40:24 - INFO - dse_model_assets - n_i*n_j*n_k*n_l = 301312,data-science-dse-cplex/default/cb304e99-2c5f-4a13-b454-32de8e1370e2
20/10/07 17:40:25 - INFO - dse_run_model - len(placed_ijkl): 44128,data-science-dse-cplex/default/cb304e99-2c5f-4a13-b454-32de8e1370e2
20/10/07 17:40:25 - INFO - dse_run_model - len(placed_region_ijl): 2758,data-science-dse-cplex/default/cb304e99-2c5f-4a13-b454-32de8e1370e2
20/10/07 17:40:25 - INFO - dse_run_model - len(not_placed_region_ijl): 2758,data-science-dse-cplex/default/cb304e99-2c5f-4a13-b454-32de8e1370e2
20/10/07 17:41:07 - INFO - __main__ - Maximum memory usage: 12286.75,data-science-dse-cplex/default/cb304e99-2c5f-4a13-b454-32de8e1370e2

最终输出:

d = {'n_i*n_j*n_k*n_l': [247632, 323680, 301312], 'len(placed_ijkl)': [40944, 59280, 44128], 
     'len(placed_region_ijl)':[1706, 2964, 2758], 'len(not_placed_region_ijl)': [1706, 2964, 2758],
     'Maximum memory usage': [12258.98828125, 12313.5390625, 12286.75]}
df = pd.DataFrame(data=d)

标签: pythonpandas

解决方案


您可以使用正则表达式从信息列中捕获相关位。然后用于pivot创建最终输出:

df[["id", "value"]] = df["message"].str.extract(".*-\s.*-\s(?P<id>.*)(?:\:\s|\s=\s)(?P<value>(?:\d+|\d+\.\d+)$)")

out = df.drop_duplicates(["logStreamName", "id"]).pivot(index="logStreamName", columns="id", values="value")

print(out)
id                   Maximum memory usage len(not_placed_region_ijl) len(placed_ijkl) len(placed_region_ijl) n_i*n_j*n_k*n_l
logStreamName                                                                                                               
data-science-dse-...        12313.5390625                 2964                  59280                 2964            323680
data-science-dse-...       12258.98828125                 1706                  40944                 1706            247632
data-science-dse-...             12286.75                 2758                  44128                 2758            301312

推荐阅读