首页 > 解决方案 > 改进 Python 代码 - 基于某些列的行计算

问题描述

我在 Python 世界中相当新,所以我的代码还不是“pythonic”。我有一个包含几千行的数据库,其中每一行代表一个酒吧/咖啡馆,并且对于一年中的每个月,我都有一个 0-1(0-100%)的数字,它告诉我平均有多少,那个酒吧已经满了。例如,酒吧“Python Friends”从 1 月到 10 月一直客满,但在一年中的最后 2 个月内客满(示例照片)。这个任务的主要目标是计算每个酒吧的平均覆盖率,所以是逐行计算,但是问题是有些酒吧是在某个月份才开始营业的,所以我需要考虑只有那些列(例如,如果它从 6 月开始,我只计算最后 6 列)。同样,有些在某个月份关闭,所以我不考虑这一点。 这是一个示例,其中存在每个组合的几行

根据我的指示,我需要将每个月的覆盖率乘以该月的天数,将所有这些月相加,然后除以总天数。例如,如果一家酒吧只有一年中最后 3 个月的覆盖率,并且每个月的覆盖率是 100%,我将进行以下计算:(31 * 1 + 30 * 1 + 31 * 1)/92

我尝试了以下代码,但我确信它可以改进,因为即使我有一台相当不错的笔记本电脑,我也要等待很长时间才能编译它。此外,我需要对此进行改进并进行其他计算,例如获取季度数据覆盖率,因此我的代码目前还不是很好。有什么可以改进的?

#create dictionary to get months data
months = {'January' : 31,
          'February': 29,
          'March' : 31,
          'April': 30,
          'May' : 31,
          'June': 30,
          'July' : 31,
          'August': 31,
          'September' : 30,
          'October': 31,
          'November' : 30,
          'December': 31
        }

tobeappended = []
for i,j,m in zip(df['Start Month'],df['End Month'],df.index):
    if (i!='Not Available') & (j!='Not Available'): #data coverage for those rows in which both 'Active From' and 'Active To' exist
        middlemonths = 0
        totaldays = 0
        calc=0
        newdf= df.loc[m:m,i:j] # make a subset of the current dataframe view in the for loop, to be able to acces its columns
        for u in range(0,len(newdf.columns)): #get row calculation of days of coverage, then divide by total number of days
            middlemonths += newdf.iloc[:,u] * months[newdf.columns[u]]
        totaldays = np.array([months[x] for x in newdf.columns]).sum()
        calc = middlemonths / totaldays
        calc = calc.item()
        tobeappended.append(calc) #list of coverages for those rows in which active to/active from are available, from the same year as the data
    elif(i!='Not Available') & (j=='Not Available'):
        aftermonths = 0
        totaldays = 0
        calc = 0
        newdf = df.loc[m:m,i:'December']
        for u in range(0, len(newdf.columns)):
            aftermonths += newdf.iloc[:,u] * months[newdf.columns[u]]
        totaldays = np.array([months[x] for x in newdf.columns]).sum()
        calc = aftermonths / totaldays
        calc = calc.item()
        tobeappended.append(calc)
    elif(i=='Not Available') & (j!='Not Available'):
        beforemonths = 0
        totaldays = 0
        calc = 0
        newdf = df.loc[m:m,'January':j]
        for u in range(0, len(newdf.columns)):
            beforemonths += newdf.iloc[:,u] * months[newdf.columns[u]]
        totaldays = np.array([months[x] for x in newdf.columns]).sum()
        calc = beforemonths / totaldays
        calc = calc.item()
        tobeappended.append(calc)
    else:
        newdf = df.loc[m:m, 'January':'December']
        allmonths = 0
        for u in range(0, len(newdf.columns)):
            allmonths += newdf.iloc[:,u] * months[newdf.columns[u]]
        totaldays = np.array([months[x] for x in newdf.columns]).sum()
        calc = allmonths / totaldays
        calc = calc.item()
        tobeappended.append(calc)
    
tobeappended = np.array(tobeappended)
df['Data Coverage'] = tobeappended

PS:这不是全部代码,我之前需要做一些清理。正如我所说,我的主要目标是改进此代码并使其更快、更 Python 化,以便轻松添加更多部分。我知道这是一个很长的帖子,如果有任何问题,请告诉我,我会提供。感谢您的帮助和时间!

标签: pythonpandasnumpy

解决方案


可以有不同的方法来做你想做的事。一种选择可能是拥有一个融化的DataFrame。

例如

import numpy as np
import pandas as pd

df = pd.DataFrame({
    'Asset': ['Python Friends', 'Flying Dudes', 'Polterger', 'Beerzone'],
    'January': [1, 0, 1, 0],
    'February': [1, .5, 1, 0],
    'March': [np.nan, .5, 1, .5],
    'April': [1, .5, 1, 1],
    'May': [1, 1, 1, 1],
    'June': [1, 1, 1, 1],
    'July': [1, 1, 1, 1],
    'August': [1, 0, 1, 1],
    'September': [1, 0, 1, 0],
    'October': [1, 0, 1, 0],
    'November': [.5, 0, 1, 0],
    'December': [.5, 0, 1, 0],
    'Start Month': ['Not Available', 'Not Available', 'May', 'June'],
    'End Month': ['Not Available', 'April', 'Not Available', 'August'],
})

您可以创建一个融化的 DataFrame,例如

melt_df = df.melt(
    id_vars=['Asset', 'Start Month', 'End Month'], 
    var_name='Month', value_name='Coverage')

那会给

print(melt_df.head(5))

            Asset    Start Month      End Month     Month Coverage
0  Python Friends  Not Available  Not Available   January        1
1    Flying Dudes  Not Available          April   January        0
2       Polterger            May  Not Available   January        1
3        Beerzone           June         August   January        0
4  Python Friends  Not Available  Not Available  February      1.0

现在,一种解决方法可能是

  • 为每个月的名称分配一个从1到的数字12
  • Not Available在列中分配Start Date一个值0
  • Not Available在列中分配End Date一个值13

例如

month_map = {
    0: 0,
    'January': 1,
    'February': 2,
    'March': 3,
    'April': 4,
    'May': 5,
    'June': 6,
    'July': 7,
    'August': 8,
    'September': 9,
    'October': 10,
    'November': 11,
    'December': 12,
    13: 13,
}

melt_df['Month'] = melt_df.Month.map(month_map)

melt_df.loc[melt_df['Start Month']=='Not Available', 'Start Month'] = 0
melt_df.loc[melt_df['End Month']=='Not Available', 'End Month'] = 13

melt_df['Start Month'] = melt_df['Start Month'].map(month_map)
melt_df['End Month'] = melt_df['End Month'].map(month_map)

现在我们可以分配NaN给不包括在月份中的所有Start Month保险End Month

# we presume that End Month means
# that it stops at the end of the month
melt_df.loc[
    (melt_df.Month<melt_df['Start Month'])
    |
    (melt_df.Month>melt_df['End Month'] + 1),
    'Coverage'
] = np.nan

并删除所有NaNs 所以我们只需要值

melt_df.dropna(inplace=True)

要处理月份并计算平均覆盖率,最好使用内置函数to_datetime(如果您知道年份)

melt_df['year_month'] = "2020-" + melt_df.Month.astype(str)
melt_df['year_month'] = pd.to_datetime(melt_df.year_month)

然后,对于 each Asset,使用填充 s 并计算平均值的方法重新采样到每天(不包括最后一天,即下个月)ffillNaN

for asset in melt_df.Asset.unique():
    _df = melt_df[melt_df.Asset==asset].copy(deep=True)
    _df.set_index('year_month', inplace=True)
    _df = _df.resample('D', closed='left').mean()
    _df.fillna(method='ffill', inplace=True)
    print(asset, f"{_df.iloc[:-1].Coverage.mean():.1%}")
    # exclude last day  ^^^^^^^^^

Python Friends 95.5%
Flying Dudes 37.2%
Polterger 100.0%
Beerzone 100.0%

推荐阅读