首页 > 解决方案 > 如何从另一行和列的 idxmin 获取列值?

问题描述

如果我不能完美地说出问题的标题,我很抱歉。

这实际上与我之前在这里问过的非常相似。我实际上得到了解决方案,但现在我面临不同的问题。

所以我有以下数据框

df = pd.DataFrame({'person':['A', 'A', 'B', 'B', 'A', 'C', 'B', 'C'],
                   'datetime':['2018-02-26 10:49:32', '2018-02-26 10:58:03', '2018-02-26 10:51:10',
                               '2018-02-26 10:58:45', '2018-02-26 10:43:34', '2018-02-26 10:49:51',
                              '2018-02-26 10:51:51', '2018-02-26 10:55:10'], 
                   'location':['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h']})

person  datetime    location
A   2018-02-26 10:49:32 a
A   2018-02-26 10:58:03 b
B   2018-02-26 10:51:10 c
B   2018-02-26 10:58:45 d
A   2018-02-26 10:43:34 e
C   2018-02-26 10:49:51 f
B   2018-02-26 10:51:51 g
C   2018-02-26 10:55:10 h

如果我们按人和时间排序,我们得到

df.sort_values(by=['person', 'datetime'])

person  datetime    location
A   2018-02-26 10:43:34 e
A   2018-02-26 10:49:32 a
A   2018-02-26 10:58:03 b
B   2018-02-26 10:51:10 c
B   2018-02-26 10:51:51 g
B   2018-02-26 10:58:45 d
C   2018-02-26 10:49:51 f
C   2018-02-26 10:55:10 h

您可以将其解读为 A 前往位置“e”,然后前往位置“a”,然后前往“b”。与人 B 一样,他去了位置“c”,然后是“g”,然后是“d”,依此类推。

我想把它变成这样

person  prev_datetime       prev_loc    curr_datetime       curr_loc    next_datetime   next_loc
A       2018-02-26 10:43:34     e       2018-02-26 10:49:32     a       2018-02-26 10:58:03 b
A       2018-02-26 10:49:32     a       2018-02-26 10:58:03     b       2018-02-26 10:43:34 e
B       2018-02-26 10:51:10     c       2018-02-26 10:51:51     g       2018-02-26 10:58:45 d
B       2018-02-26 10:51:51     g       2018-02-26 10:58:45     d       2018-02-26 10:51:10 c
C       2018-02-26 10:49:51     f       2018-02-26 10:55:10     h       2018-02-26 10:49:51 f

所以,你看到人 A 去位置'e',然后去位置'a',然后去'b',但最后我们让这个人再次回到 e(开始)。

我应用了从我提出的上一个问题中学到的知识并进行了一些修改。

df['datetime'] = pd.to_datetime(df['datetime'])
df1 = df.sort_values(by=['person', 'datetime'])

df1[['curr_datetime','curr_loc']] = df1.groupby('person')['datetime','location'].shift(-1)

d = {'datetime':'prev_datetime','location':'prev_loc'}
df2 = df1[df1['person'].duplicated(keep='last')].rename(columns=d)

输出

person  prev_datetime   prev_loc    next_datetime   next_loc
A   2018-02-26 10:43:34 e   2018-02-26 10:49:32 a
A   2018-02-26 10:49:32 a   2018-02-26 10:58:03 b
B   2018-02-26 10:51:10 c   2018-02-26 10:51:51 g
B   2018-02-26 10:51:51 g   2018-02-26 10:58:45 d
C   2018-02-26 10:49:51 f   2018-02-26 10:55:10 h

然后

df2[['next_datetime','next_loc']] = df2.groupby('person')['curr_datetime','curr_loc'].shift(-1)

输出

person  prev_datetime   prev_loc    curr_datetime   curr_loc    next_datetime   next_loc
A   2018-02-26 10:43:34 e   2018-02-26 10:49:32 a   2018-02-26 10:58:03 b
A   2018-02-26 10:49:32 a   2018-02-26 10:58:03 b   NaT nan
B   2018-02-26 10:51:10 c   2018-02-26 10:51:51 g   2018-02-26 10:58:45 d
B   2018-02-26 10:51:51 g   2018-02-26 10:58:45 d   NaT nan
C   2018-02-26 10:49:51 f   2018-02-26 10:55:10 h   NaT nan

默认情况下,其中一些是 Nan,所以我做了以下操作,我想用每个人的行程的 min() 填充这些 nan 值。

df2.loc[df2["next_loc"].isnull()][['next_datetime', 'next_loc']] = df2.loc[df2.groupby("person")["prev_datetime"].idxmin()][['prev_datetime', 'prev_loc']]

哪个不起作用,没有错误,但数据框仍然没有按预期显示数据(仍然是 Nan)。我很困惑,因为没有错误消息,但它仍然不起作用。我不知道是否有比这更好的解决方案。谢谢。

标签: pythonpandaspandas-groupby

解决方案


numpy.roll与 一起使用GroupBy.transform

df['datetime'] = pd.to_datetime(df['datetime'])
df1 = df.sort_values(by=['person', 'datetime'])

g = df1.groupby('person')['datetime','location']
df1[['curr_datetime','curr_loc']] = g.transform(np.roll, -1)
df1[['next_datetime','next_loc']] = g.transform(np.roll, 1)
print (df1)
  person            datetime location       curr_datetime curr_loc  \
4      A 2018-02-26 10:43:34        e 2018-02-26 10:49:32        a   
0      A 2018-02-26 10:49:32        a 2018-02-26 10:58:03        b   
1      A 2018-02-26 10:58:03        b 2018-02-26 10:43:34        e   
2      B 2018-02-26 10:51:10        c 2018-02-26 10:51:51        g   
6      B 2018-02-26 10:51:51        g 2018-02-26 10:58:45        d   
3      B 2018-02-26 10:58:45        d 2018-02-26 10:51:10        c   
5      C 2018-02-26 10:49:51        f 2018-02-26 10:55:10        h   
7      C 2018-02-26 10:55:10        h 2018-02-26 10:49:51        f   

        next_datetime next_loc  
4 2018-02-26 10:58:03        b  
0 2018-02-26 10:43:34        e  
1 2018-02-26 10:49:32        a  
2 2018-02-26 10:58:45        d  
6 2018-02-26 10:51:10        c  
3 2018-02-26 10:51:51        g  
5 2018-02-26 10:55:10        h  
7 2018-02-26 10:49:51        f  

d = {'datetime':'prev_datetime','location':'prev_loc'}
df2 = df1[df1['person'].duplicated(keep='last')].rename(columns=d)

print (df2)
  person       prev_datetime prev_loc       curr_datetime curr_loc  \
4      A 2018-02-26 10:43:34        e 2018-02-26 10:49:32        a   
0      A 2018-02-26 10:49:32        a 2018-02-26 10:58:03        b   
2      B 2018-02-26 10:51:10        c 2018-02-26 10:51:51        g   
6      B 2018-02-26 10:51:51        g 2018-02-26 10:58:45        d   
5      C 2018-02-26 10:49:51        f 2018-02-26 10:55:10        h   

        next_datetime next_loc  
4 2018-02-26 10:58:03        b  
0 2018-02-26 10:43:34        e  
2 2018-02-26 10:58:45        d  
6 2018-02-26 10:51:10        c  
5 2018-02-26 10:55:10        h  

推荐阅读