首页 > 解决方案 > Sort multiple columns' values from min to max, and put in new columns in pandas dataframe

问题描述

I have a dataframe with datetime objects in columns 3 through 6. I want to sort these dates into new columns: P_min, P_2, P_3, P_max, from earliest ("min") to latest date ("max"). I can easily get the min and max values and put them into their own column. However, how can I get the middle values (P_2 and P_3)?

This is what I have so far:

import pandas as pd
df = pd.DataFrame(data={'Name':['a','b','c','d'],'Number':[1,2,3,4], 'Contact':['foo1','foo2','foo3','foo4'],3:[pd.to_datetime('1/1/2015'),pd.NaT,pd.NaT,pd.to_datetime('1/1/2015')],4:[pd.to_datetime('2/20/2002'),pd.to_datetime('2/20/2002'),pd.to_datetime('2/20/2002'),pd.to_datetime('2/20/2002')], 5:[pd.NaT,pd.NaT,pd.NaT,pd.to_datetime('3/15/2015')], 6:[pd.NaT,pd.to_datetime('3/15/2015'),pd.NaT,pd.to_datetime('4/10/2007')]}); 
> df

   Name NumberContact   3           4           5          6
0   a   1   foo1        2015-01-01  2002-02-20  NaT        NaT
1   b   2   foo2        NaT         2002-02-20  NaT        2015-03-15
2   c   3   foo3        NaT         2002-02-20  NaT        NaT
3   d   4   foo4        2015-01-01  2002-02-20  2015-03-15 2007-04-10

Then I can manually set the min and max values:

df['P_min'] = df.iloc[:,3:6].min(axis=1) #axis=1 is the column
df['P_max'] = df.iloc[:,3:6].max(axis=1) #axis=1 is the column

I'm trying to make something work where I replace the min/max values so I could get a new min value which would be P_2, and so forth...

df.iloc[:,3:7].replace(to_replace=df.iloc[:,3:7].min(axis=1), value=pd.NaT)

Could someone please help with a more efficient or easy method such as a for loop?

标签: pythonpandasdataframe

解决方案


这是一个优雅的解决方案,将其转换为 int 的 numpy 矩阵 -> 排序 -> 将其转换回日期时间

import pandas as pd
import numpy as np


df = pd.DataFrame(data={'Name':['a','b','c','d'],'Number':[1,2,3,4], 'Contact':['foo1','foo2','foo3','foo4'],3:[pd.to_datetime('1/1/2015'),pd.NaT,pd.NaT,pd.to_datetime('1/1/2015')],4:[pd.to_datetime('2/20/2002'),pd.to_datetime('2/20/2002'),pd.to_datetime('2/20/2002'),pd.to_datetime('2/20/2002')], 5:[pd.NaT,pd.NaT,pd.NaT,pd.to_datetime('3/15/2015')], 6:[pd.NaT,pd.to_datetime('3/15/2015'),pd.NaT,pd.to_datetime('4/10/2007')]}); 

matrix = np.array(df[df.columns[3:7]].astype(int))
matrix.sort(axis = 1)


df_t = pd.DataFrame(matrix, columns = ['P_min', 'P_2', 'P_3', 'P_max'])
conc = [pd.to_datetime(df_t[x]) for x in df_t.columns]

pd.concat([df] + conc, axis = 1)

Out[1]:
        Name    Number  Contact         3            4          5           6           P_min           P_2         P_3         P_max
    0   a         1       foo1    2015-01-01        2002-02-20  NaT         NaT          NaT            NaT        2002-02-20   2015-01-01
    1   b         2       foo2     NaT              2002-02-20  NaT        2015-03-15    NaT            NaT        2002-02-20   2015-03-15
    2   c         3       foo3     NaT              2002-02-20  NaT         NaT          NaT            NaT         NaT         2002-02-20
    3   d         4       foo4    2015-01-01        2002-02-20  2015-03-15  2007-04-10  2002-02-20  2007-04-10     2015-01-01   2015-03-15

如何将所有 P_min 标准化为实际日期以避免 NaT 的棘手方法

import pandas as pd
import numpy as np


df = pd.DataFrame(data={'Name':['a','b','c','d'],'Number':[1,2,3,4], 'Contact':['foo1','foo2','foo3','foo4'],3:[pd.to_datetime('1/1/2015'),pd.NaT,pd.NaT,pd.to_datetime('1/1/2015')],4:[pd.to_datetime('2/20/2002'),pd.to_datetime('2/20/2002'),pd.to_datetime('2/20/2002'),pd.to_datetime('2/20/2002')], 5:[pd.NaT,pd.NaT,pd.NaT,pd.to_datetime('3/15/2015')], 6:[pd.NaT,pd.to_datetime('3/15/2015'),pd.NaT,pd.to_datetime('4/10/2007')]}); 

matrix = np.array(df[df.columns[3:7]].astype(int))

matrix[matrix == -9223372036854775808] = 4102444800000000000   # it gives you 2100-01-01 after convertation, you can easily filtered it out then
matrix.sort(axis = 1)

df_t = pd.DataFrame(matrix, columns = ['P_min', 'P_2', 'P_3', 'P_max'])
conc = [pd.to_datetime(df_t[x]) for x in df_t.columns]
pd.concat([df] + conc, axis = 1)

Out[2]:
Name    Number    Contact        3            4           5             6          P_min        P_2         P_3         P_max
0       a   1   foo1       2015-01-01   2002-02-20  NaT          NaT           2002-02-20   2015-01-01  2100-01-01  2100-01-01
1       b   2   foo2       NaT          2002-02-20  NaT          2015-03-15    2002-02-20   2015-03-15  2100-01-01  2100-01-01
2       c   3   foo3       NaT          2002-02-20  NaT          NaT           2002-02-20   2100-01-01  2100-01-01  2100-01-01
3       d   4   foo4       2015-01-01   2002-02-20  2015-03-15   2007-04-10    2002-02-20   2007-04-10  2015-01-01  2015-03-15

推荐阅读