首页 > 解决方案 > Pandas:在循环中构建新数据框时出现“返回视图与副本”警告

问题描述

假设我有一个包含两个日期时间列的数据框,我想分析它们之间的区别:

import pandas as pd

csv = [
         ['2019-08-03 00:00:00', '2019-08-01 15:00:00', 4],
         ['2019-08-03 00:00:00', '2019-08-01 10:00:00', 6],
         ['2019-08-03 00:00:00', '2019-08-01 16:00:00', 8],
         ['2019-08-04 00:00:00', '2019-08-02 19:00:00', 3],
         ['2019-08-04 00:00:00', '2019-08-02 13:00:00', 4],
         ['2019-08-04 00:00:00', '2019-08-02 11:00:00', 5]
]

df = pd.DataFrame(csv, columns=['delivery_date', 'dispatch_date', 'order_size'])
df['delivery_date'] = pd.to_datetime(df['delivery_date'])
df['dispatch_date'] = pd.to_datetime(df['dispatch_date'])
df['transit_time'] = (df['delivery_date']-df['dispatch_date'])
df = df.set_index(['delivery_date','transit_time'])

好的,现在我们有这样的东西:

                                    dispatch_date  order_size
delivery_date transit_time                                   
2019-08-03    1 days 09:00:00 2019-08-01 15:00:00           4
              1 days 14:00:00 2019-08-01 10:00:00           6
              1 days 08:00:00 2019-08-01 16:00:00           8
2019-08-04    1 days 05:00:00 2019-08-02 19:00:00           3
              1 days 11:00:00 2019-08-02 13:00:00           4
              1 days 13:00:00 2019-08-02 11:00:00           5

例如,对于每个交货日期,我想知道哪个交货最快(交货时间最短)。我想将结果保存到一个新的数据框中,其中包含原始数据框中的所有列。所以我这样迭代:

delivery_dates = df.index.get_level_values(0).unique()
df_ouput = pd.DataFrame()

for date in delivery_dates:    
    df_analyzed = df.loc[(date, )].sort_index()
    df_result = df_analyzed.iloc[[df_analyzed.index.get_loc(0, method='nearest')]]    
    df_result.loc[:,'delivery_date'] = date
    df_ouput = df_ouput.append(df_result)

df_ouput = df_ouput.reset_index().set_index(['delivery_date'])

结果是正确的:

                 transit_time       dispatch_date  order_size
delivery_date                                                
2019-08-03    1 days 08:00:00 2019-08-01 16:00:00           8
2019-08-04    1 days 05:00:00 2019-08-02 19:00:00           3

但我收到警告:

试图在 DataFrame 中的切片副本上设置一个值。尝试改用 .loc[row_indexer,col_indexer] = value 查看文档中的注意事项:http: //pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy

而且我不知道为什么,因为我已经在使用“.loc”方法进行分配:

df_result.loc[:,'delivery_date'] = date

但我无法摆脱警告,所以我来到了这个罕见的解决方案:

delivery_dates = df.index.get_level_values(0).unique()
df_ouput = pd.DataFrame()

for date in delivery_dates:    
    df_analyzed = df.loc[(date, )].sort_index()
    df_result = df_analyzed.iloc[[df_analyzed.index.get_loc(0, method='nearest')]]    
    df_result_2 = df_result.copy()
    df_result_2.loc[:,'delivery_date'] = date
    df_ouput = df_ouput.append(df_result_2)

df_ouput = df_ouput.reset_index().set_index(['delivery_date'])

如果进行复制,则不会显示警告。但为什么?有没有更好的方法来做我想做的事?

标签: pythonpandas

解决方案


您的解决方案应更改copy为进行过滤:

delivery_dates = df.index.get_level_values(0).unique()
df_ouput = pd.DataFrame()

for date in delivery_dates:    
    df_analyzed = df.loc[date].sort_index()
    df_result = df_analyzed.iloc[[df_analyzed.index.get_loc(0, method='nearest')]].copy()    
    df_result['delivery_date'] = date
    df_ouput = df_ouput.append(df_result)

df_ouput = df_ouput.reset_index().set_index(['delivery_date'])
print (df_ouput)
                 transit_time       dispatch_date  order_size
delivery_date                                                
2019-08-03    1 days 08:00:00 2019-08-01 16:00:00           8
2019-08-04    1 days 05:00:00 2019-08-02 19:00:00           3

具有自定义功能的更好解决方案GroupBy.apply

def f(x):
    x = x.sort_index(level=1)
    s = x.iloc[[x.index.get_level_values(1).get_loc(0, method='nearest')]]
    return s

df = df.groupby(level=0).apply(f).reset_index(level=0, drop=True)
print (df)
                                    dispatch_date  order_size
delivery_date transit_time                                   
2019-08-03    1 days 08:00:00 2019-08-01 16:00:00           8
2019-08-04    1 days 05:00:00 2019-08-02 19:00:00           3

或者:

def f(x):
    x = x.sort_index(level=1)
    s = x.iloc[[x.index.get_level_values(1).get_loc(0, method='nearest')]]
    return s

df = df.groupby(level=0, group_keys=False).apply(f)
print (df)
                                    dispatch_date  order_size
delivery_date transit_time                                   
2019-08-03    1 days 08:00:00 2019-08-01 16:00:00           8
2019-08-04    1 days 05:00:00 2019-08-02 19:00:00           3

如果理解得好:

df = df.sort_index()
df = df[~df.index.get_level_values(0).duplicated()]
print (df)
                                    dispatch_date  order_size
delivery_date transit_time                                   
2019-08-03    1 days 08:00:00 2019-08-01 16:00:00           8
2019-08-04    1 days 05:00:00 2019-08-02 19:00:00           3

推荐阅读