首页 > 解决方案 > 根据日期和字符串长度修改DataFrame中的数据

问题描述

我需要清理 Pandas DataFrame 中的一些数据并为此苦苦挣扎。

样本数据:

Date       | ID     | Name             | Address
-----------------------------------------------------------------------------------------------
1-4-1987   | 124578 | T.Hilpert        | 518 Hessel Plaza Lake Lonzo, AZ 11863
23-6-1990  | 947383 | Birdie Reynolds  | 964 Weissnat Green Suite 568 Rennerbury
12-5-1960  | 746732 | Earline Schulist | 57367 Alfredo Vista East Bertaburgh
9-9-2010   | 947383 | Birdie Reynolds  | 964 Weissnat Green Suite 568 Rennerbury, WV 16241-5205
27-12-2017 | 124578 | Theresia Hilpert | 518 Hessel Plaza Lake Lonzo

我想做的就是这个。按 ID 分组,从最近的日期获取名称并获取最长的地址字符串。将这些用于所有出现的 ID(在两个新列中:Name_newAddress_New)。请在下面找到所需的样本:

Date       | ID     | Name             | Address                                                | Name_New         | Address_New
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------
27-12-2017 | 124578 | Theresia Hilpert | 518 Hessel Plaza Lake Lonzo                            | Theresia Hilpert | 518 Hessel Plaza Lake Lonzo, AZ 11863
1-4-1987   | 124578 | T. Hilpert       | 518 Hessel Plaza Lake Lonzo, AZ 11863                  | Theresia Hilpert | 518 Hessel Plaza Lake Lonzo, AZ 11863
23-6-1990  | 947383 | Birdie Reynolds  | 964 Weissnat Green Suite 568 Rennerbury                | Birdie Reynolds  | 964 Weissnat Green Suite 568 Rennerbury, WV 16241-5205
9-9-2010   | 947383 | Birdie Reynolds  | 964 Weissnat Green Suite 568 Rennerbury, WV 16241-5205 | Birdie Reynolds  | 964 Weissnat Green Suite 568 Rennerbury, WV 16241-5205
12-5-1960  | 746732 | Earline Schulist | 57367 Alfredo Vista East Bertaburgh                    | Earline Schulist | 57367 Alfredo Vista East Bertaburgh

我已经尝试过了,但无法将它结合起来以获得所需的结果。

def f1(s):
    return max(s, key=len)

df_new = df['New_Address'] = df.groupby('ID').agg({'Address': f1})


df_new = df[df.groupby('ID').Date.transform('max') == df['Date']]

特别感谢您的帮助。

标签: pythonpython-3.xpandasdataframe

解决方案


用于与 original 具有相同大小transform的返回,然后按列创建索引并通过最大值获取值:SeriesDataFrameNameDateidxmax

df['Date'] = pd.to_datetime(df['Date'], format='%d-%m-%Y')
df['Address_New'] = df.groupby('ID')['Address'].transform(lambda s: max(s, key=len))
df['Name_New'] = df.set_index('Name').groupby('ID')['Date'].transform('idxmax').values
print (df)
        Date      ID              Name  \
0 1987-04-01  124578         T.Hilpert   
1 1990-06-23  947383   Birdie Reynolds   
2 1960-05-12  746732  Earline Schulist   
3 2010-09-09  947383   Birdie Reynolds   
4 2017-12-27  124578  Theresia Hilpert   

                                             Address  \
0              518 Hessel Plaza Lake Lonzo, AZ 11863   
1            964 Weissnat Green Suite 568 Rennerbury   
2                57367 Alfredo Vista East Bertaburgh   
3  964 Weissnat Green Suite 568 Rennerbury, WV 16...   
4                        518 Hessel Plaza Lake Lonzo   

                                         Address_New          Name_New  
0              518 Hessel Plaza Lake Lonzo, AZ 11863  Theresia Hilpert  
1  964 Weissnat Green Suite 568 Rennerbury, WV 16...   Birdie Reynolds  
2                57367 Alfredo Vista East Bertaburgh  Earline Schulist  
3  964 Weissnat Green Suite 568 Rennerbury, WV 16...   Birdie Reynolds  
4              518 Hessel Plaza Lake Lonzo, AZ 11863  Theresia Hilpert  

推荐阅读