首页 > 解决方案 > 如何迭代地求和数据框中一列的元素

问题描述

我是新手。我有一个数据框,其中包含 100 个消费者的年度 5 分钟分辨率电力负荷数据(每个 csv 有 >100,000 行)。对于特定客户,我需要找到每个月的滚动最大半小时需求。由于我的数据是每 5 分钟一次,因此我在列负载功率 (kW) 中连续汇总了 6 个条目

ID  Date_time   load power (kW)
0   1/01/2018 0:00  0.191566105
1   1/01/2018 0:05  0.193145833
2   1/01/2018 0:10  0.192853152
3   1/01/2018 0:15  0.270974475
4   1/01/2018 0:20  0.290183338
5   1/01/2018 0:25  0.185485805
6   1/01/2018 0:30  0.208431765
7   1/01/2018 0:35  0.269577658
8   1/01/2018 0:40  0.268412758
9   1/01/2018 0:45  0.286095837
10  1/01/2018 0:50  0.301008341
11  1/01/2018 0:55  0.390496602
12  1/01/2018 1:00  0.406787652
13  1/01/2018 1:05  0.229862502
14  1/01/2018 1:10  0.190870833
15  1/01/2018 1:15  0.190224999

负载功率(kW)值需要连续对 6 个条目(5 分钟 *6 = 30 分钟)求和,因此索引 0-5、索引 1-6、2-7... 并在每个月的 succ_6 列中找到最大值

ID  Date_time   load power (kW) succ_6
0   1/01/2018 0:00  0.191566105 1.324208707
1   1/01/2018 0:05  0.193145833 1.341074367
2   1/01/2018 0:10  0.192853152 1.417506192
3   1/01/2018 0:15  0.270974475 1.493065799
4   1/01/2018 0:20  0.290183338 1.508187161
5   1/01/2018 0:25  0.185485805 1.519012164
6   1/01/2018 0:30  0.208431765 1.724022961
7   1/01/2018 0:35  0.269577658 1.922378848
8   1/01/2018 0:40  0.268412758 1.882663692
9   1/01/2018 0:45  0.286095837 1.805121767
10  1/01/2018 0:50  0.301008341 1.70925093
11  1/01/2018 0:55  0.390496602 1.604063424
12  1/01/2018 1:00  0.406787652 1.408709679
13  1/01/2018 1:05  0.229862502 1.192568766
14  1/01/2018 1:10  0.190870833 1.236928491
15  1/01/2018 1:15  0.190224999 1.321553317

每个月都必须这样做

例如 0-5, 1-6,2-7,3-8 ......

我编写了一个效率低下的代码,我认为它正在按预期进行,但对于一个消费者来说需要 36 分钟,而且我需要超过 100 分钟。

def monthly_hh_maximum (df_input, filenames,file_path):
    # to store maximum half hourly demand (m_max) for each consumer for each month of year (m_o_y)
    df_m_hh_max =pd.DataFrame(columns=['filename','m_o_y','m_max'])
    # 100 consumers, 100 filenames
    for filename in filenames: 
        print(filename)
        #finds out unique months of year from date in given time series data
        month_o_year=df_input[filename]['Date_conv'].unique()  
        #looping over months of year for one consumer
        for m_o_y in month_o_year:
            # find out the number of days in the given month
            df_input_m_o_y_len=((df_input[filename].loc[df_input[filename]['Date_conv']==m_o_y]))['Date_conv'].size 
            df_temp=df_input[filename].loc[df_input[filename]['Date_conv']==m_o_y]
            print(df_input_m_o_y_len)
            monthly_mm=0 
            for i in range (0,df_input_m_o_y_len-6):
                #Next line retrieves a view of dataframe which has data for that particular
                # month and iteratively sums successive 6 elements of column load power to find the maximum 
                monthly_mm_temp= ((df_input[filename].loc[df_input[filename]['Date_conv']==m_o_y]))['load power (kW)'].iloc[i:i+6].sum()
                print(i)
                print(monthly_mm_temp)
                print(m_o_y)
                if(monthly_mm>monthly_mm_temp):
                    monthly_mm=monthly_mm
                else:
                    monthly_mm=monthly_mm_temp
            df_m_hh_max['filename']=filename
            df_m_hh_max['m_o_y'] = m_o_y
            df_m_hh_max['m_max'] = monthly_mm

我正在尝试减少计算时间,因为我的资源有限,并且我意识到我的代码效率低下。

标签: pythonpandas

解决方案


使用上面来自@Dan 和@Zipa 的建议,在这里我有一个 str 月份列,我在该列上按月份对其进行分组,然后为每个具有 6 个元素的级联窗口生成 rolling_sum。然后,再次按月分组并生成最大值,然后生成唯一值。我正在生成所有这些值以检查结果。最终的代码会短很多。

def demand_a_savings (df_input, filenames,file_path):
    for filename in filenames:
        df_input[filename]['rolling_sum_b'] = df_input[filename].groupby('Month')['load power (kW)'].transform(lambda x: x.rolling(6,6).sum())
        df_input[filename]['max_value_b']=df_input[filename].groupby('Month')['rolling_sum_b'].transform(lambda x:x.max())
        print(df_input[filename]['max_value_b'].unique())

推荐阅读