首页 > 解决方案 > 循环遍历Python中多列中的不同值

问题描述

我想在df下面循环查找活动的持续时间,其中:

  1. 对于相同的X,有两个Y值。
  2. 如果a轮到任何一个唯一值(例如,for 、变为 for 或变为 for ),1则标志着活动的开始;如果轮到的 两个唯一值,则标志着活动的结束(例如, for 、变为for和变为for )。YX ==18a1Y==13a1Y==14a0YX ==18a0Y==13a0Y==14
    Timestamp               X   Y   a   b       Type
0   2000-10-26 10:08:27.060 18  14  0.0 24.5    medium
1   2000-10-26 10:39:24.310 18  13  1.0 24.0    low     # Start 
2   2000-10-26 11:50:48.190 18  14  1.0 23.5    medium
3   2000-10-26 17:18:07.610 18  14  1.0 23.5    medium
4   2000-10-26 17:18:09.610 18  14  0.0 23.5    medium
5   2000-10-26 17:29:10.610 18  14  0.0 26.5    medium
6   2000-10-26 17:29:10.770 18  14  1.0 26.5    medium
7   2000-10-26 17:29:12.610 18  14  1.0 53.5    medium
8   2000-10-26 17:29:14.610 18  14  1.0 62.0    medium
9   2000-10-26 17:29:14.770 18  13  1.0 24.0    low
10  2000-10-26 17:29:16.610 18  14  1.0 64.5    medium
11  2000-10-26 17:29:18.770 18  14  0.0 64.5    medium
12  2000-10-26 17:29:18.770 18  13  0.0 24.0    low     # End 
13  2000-10-26 17:29:28.770 18  14  0.0 63.5    medium
14  2000-10-26 17:29:34.770 19  16  0.0 62.0    medium
15  2000-10-26 17:29:40.770 19  16  1.0 61.0    medium  # Start 
16  2000-10-26 17:29:46.770 19  16  1.0 60.0    medium
17  2000-10-26 17:32:01.180 19  17  1.0 25.0    low
18  2000-10-26 17:32:01.180 19  16  0.0 51.5    medium
19  2000-10-26 17:32:35.180 19  17  0.0 50.0    medium  # End 

可重现的例子:

from datetime import *
from pandas import *

df= pd.DataFrame({'Timestamp': {0: Timestamp('2000-10-26 10:08:27.060000'),
  1: Timestamp('2000-10-26 10:39:24.310000'),
  2: Timestamp('2000-10-26 11:50:48.190000'),
  3: Timestamp('2000-10-26 17:18:07.610000'),
  4: Timestamp('2000-10-26 17:18:09.610000'),
  5: Timestamp('2000-10-26 17:29:10.610000'),
  6: Timestamp('2000-10-26 17:29:10.770000'),
  7: Timestamp('2000-10-26 17:29:12.610000'),
  8: Timestamp('2000-10-26 17:29:14.610000'),
  9: Timestamp('2000-10-26 17:29:14.770000'),
  10: Timestamp('2000-10-26 17:29:16.610000'),
  11: Timestamp('2000-10-26 17:29:18.770000'),
  12: Timestamp('2000-10-26 17:29:18.770000'),
  13: Timestamp('2000-10-26 17:29:28.770000'),
  14: Timestamp('2000-10-26 17:29:34.770000'),
  15: Timestamp('2000-10-26 17:29:40.770000'),
  16: Timestamp('2000-10-26 17:29:46.770000'),
  17: Timestamp('2000-10-26 17:32:01.180000'),
  18: Timestamp('2000-10-26 17:32:01.180000'),
  19: Timestamp('2000-10-26 17:32:35.180000')},
 'X': {0: 18,
  1: 18,
  2: 18,
  3: 18,
  4: 18,
  5: 18,
  6: 18,
  7: 18,
  8: 18,
  9: 18,
  10: 18,
  11: 18,
  12: 18,
  13: 18,
  14: 19,
  15: 19,
  16: 19,
  17: 19,
  18: 19,
  19: 19},
 'Y': {0: 14,
  1: 13,
  2: 14,
  3: 14,
  4: 14,
  5: 14,
  6: 14,
  7: 14,
  8: 14,
  9: 13,
  10: 14,
  11: 14,
  12: 13,
  13: 14,
  14: 14,
  15: 14,
  16: 14,
  17: 13,
  18: 14,
  19: 13},
 'a': {0: 0.0,
  1: 1.0,
  2: 1.0,
  3: 1.0,
  4: 0.0,
  5: 0.0,
  6: 1.0,
  7: 1.0,
  8: 1.0,
  9: 1.0,
  10: 1.0,
  11: 0.0,
  12: 0.0,
  13: 0.0,
  14: 0.0,
  15: 1.0,
  16: 1.0,
  17: 1.0,
  18: 0.0,
  19: 0.0},
 'b': {0: 24.5,
  1: 24.0,
  2: 23.5,
  3: 23.5,
  4: 23.5,
  5: 26.5,
  6: 26.5,
  7: 53.5,
  8: 62.0,
  9: 24.0,
  10: 64.5,
  11: 64.5,
  12: 24.0,
  13: 63.5,
  14: 62.0,
  15: 61.0,
  16: 60.0,
  17: 25.0,
  18: 51.5,
  19: 50.0},
 'Type': {0: 'medium',
  1: 'low',
  2: 'medium',
  3: 'medium',
  4: 'medium',
  5: 'medium',
  6: 'medium',
  7: 'medium',
  8: 'medium',
  9: 'low',
  10: 'medium',
  11: 'medium',
  12: 'low',
  13: 'medium',
  14: 'medium',
  15: 'medium',
  16: 'medium',
  17: 'low',
  18: 'medium',
  19: 'medium'}})

预期输出:

Start                    End                      X   
2000-10-26 10:39:24.310  2000-10-26 17:29:18.770  18
2000-10-26 17:29:40.770  2000-10-26 17:32:35.180  19

如何在 for 循环中执行此操作,以便我可以进一步操作数据框,例如计算Y活动中每个唯一值的平均值?


编辑:

对任何混淆表示歉意。活动应该通过a值的变化来确定,并且应该是连续的,即,在一个活动结束后,我们重置并计算下一个活动。因此,例如以下数据帧应生成相同的输出:


    Timestamp               X   Y   a   b       Type
0   2000-10-26 10:08:27.060 18  14  0.0 24.5    medium
1   2000-10-26 10:39:24.310 18  13  1.0 24.0    low
2   2000-10-26 11:50:48.190 18  14  1.0 23.5    medium
3   2000-10-26 17:18:07.610 18  14  1.0 23.5    medium
4   2000-10-26 17:18:09.610 18  14  0.0 23.5    medium
5   2000-10-26 17:29:10.610 18  14  0.0 26.5    medium
6   2000-10-26 17:29:10.770 18  14  1.0 26.5    medium
7   2000-10-26 17:29:12.610 18  14  1.0 53.5    medium
8   2000-10-26 17:29:14.610 18  14  1.0 62.0    medium
9   2000-10-26 17:29:14.770 18  13  1.0 24.0    low
10  2000-10-26 17:29:16.610 18  14  1.0 64.5    medium
11  2000-10-26 17:29:18.770 18  14  0.0 64.5    medium
12  2000-10-26 17:29:18.770 18  13  0.0 24.0    low
13  2000-10-26 17:29:28.770 18  14  0.0 63.5    medium
14  2000-10-26 17:29:34.770 18  14  0.0 62.0    medium
15  2000-10-26 17:29:40.770 18  14  1.0 61.0    medium
16  2000-10-26 17:29:46.770 18  14  1.0 60.0    medium
17  2000-10-26 17:32:01.180 18  13  1.0 25.0    low
18  2000-10-26 17:32:01.180 18  14  0.0 51.5    medium
19  2000-10-26 17:32:35.180 18  13  0.0 50.0    medium

预期输出:

Start                    End                      X   
2000-10-26 10:39:24.310  2000-10-26 17:29:18.770  18
2000-10-26 17:29:40.770  2000-10-26 17:32:35.180  18

标签: pythonpandasnumpyfor-loop

解决方案


这是一个可能的解决方案,问题是end根据您的约束可以是两个值

# Create dict for the solution
expected_dict = {'start': [], 'end': [], 'X': [], 'b_means': []}

# Loop through each group
for name, gp in df.groupby('X'):
  gp['y_shift'] = gp['Y'].shift(-1).ffill().astype(int)
  astart = gp[gp['a'].eq(1.0)].iloc[0]['pd.Timestamp']
  # End logic that can be updated
  aend = gp[gp['a'].eq(0.0) & gp['Y'].ne(gp['y_shift'])].iloc[-1]['pd.Timestamp']
  b_means = gp['b'].mean()

  # Collect all values
  expected_dict ['start'].append(astart)
  expected_dict ['end'].append(aend)
  expected_dict ['X'].append(name)
  expected_dict ['b_means'].append(name)

edf = pd.DataFrame(expected_dict)

                    start                     end   X    b_means
0 2000-10-26 10:39:24.310 2000-10-26 17:29:18.770  18  37.714286
1 2000-10-26 17:29:40.770 2000-10-26 17:32:01.180  19  51.583333

推荐阅读