首页 > 解决方案 > 将函数应用于 GroupBy pandas 数据框时出现 iterrows 错误

问题描述

我正在处理这样的熊猫数据框

ID  have        time
1   NaN     2010-07-01
1   1       2010-07-08
1   5       2011-07-08
1   NaN     2011-08-08
1   NaN     2012-05-08
1   NaN     2012-09-08
1   1       2012-10-08
2   NaN     2013-01-18
2   1       2013-02-18
2   NaN     2013-03-18

我想用ID组(个人)替换缺失值,并且只用一年内的个人级非缺失值替换记录:

ID    have  want    time
1     NaN   NaN     2010-07-01
1     1     1       2010-07-08
1     5     5       2011-07-08
1     NaN   5       2011-08-08
1     NaN   5       2012-05-08
1     NaN   NaN     2012-09-08
1     1     1       2012-10-08
2     NaN   NaN     2013-01-18
2     1     1       2013-02-18
2     NaN   1       2013-03-18

有没有一些有效的方法来完成这项工作?

我正在使用以下代码,它似乎适用于每一行

df = pd.DataFrame([
    [1.0, np.nan, np.nan, "2010-07-01"],
    [1.0,"1",  "1", "2010-07-08"],
    [1.0,"5",  "5", "2011-07-08"],
    [1.0,np.nan, "5", "2011-08-08"],
    [1.0, np.nan, "5", "2012-05-08"],
    [1.0, np.nan,np.nan,  "2012-09-08"],
    [1.0,"1",   "1",  "2012-10-08"],
    [2.0, np.nan, np.nan, "2013-01-18"],
    [2.0, "1",    "1", "2013-02-18"],
    [2.0, np.nan, "1", "2013-03-18"]
    ], columns = ['ID', 'have', 'want', 'time'])
df['time']=pd.to_datetime(df['time'], format='%Y-%m-%d')

def want(df):
    for ind, row in df.iterrows():
        df.loc[ind,'ewant']=df.loc[ind,'edatum']
        if ind != 0:
            if pd.isnull(df.loc[ind,'dosage']) == 1:
                temp = ind - 1
                df.loc[ind,'ewant']=df.loc[temp,'ewant']
            else:
                pass
        else:
            pass
        df.loc[ind,'timespan']=(df.loc[ind,'edatum'] - df.loc[ind,'ewant']).days
        df.loc[ind,'impu']=np.where( 0< (df.loc[ind,'edatum'] - df.loc[ind,'ewant']).days <= 365 , 1, 0)

    return df

want(df)

但是当我尝试在“ID”组级别应用它时

want(df.groupby(['ID']))

我得到了这个迭代错误:

AttributeError: Cannot access callable attribute 'iterrows' of 'DataFrameGroupBy' objects, try using the 'apply' method

有没有办法解决这个 iterrows 错误?谢谢!

标签: pythonpandaspandas-groupby

解决方案


这是完美的修复merge_asof

df1=df.dropna()
df=pd.merge_asof(df,df1,by='ID',on='time',tolerance=pd.Timedelta(12, unit='M'))
df#have_y is the column you want 
   ID  have_x       time  have_y
0   1     NaN 2010-07-01     NaN
1   1     1.0 2010-07-08     1.0
2   1     5.0 2011-07-08     5.0
3   1     NaN 2011-08-08     5.0
4   1     NaN 2012-05-08     5.0
5   1     NaN 2012-09-08     NaN
6   1     1.0 2012-10-08     1.0
7   2     NaN 2013-01-18     NaN
8   2     1.0 2013-02-18     1.0
9   2     NaN 2013-03-18     1.0

推荐阅读