python - 如何将多索引数据框转换为复杂结构?
问题描述
我的 DataFrame 看起来像这样:
我需要将其转换为如下所示的结构:
{1234: [[(1504010302, 45678), (1504016546, 78908)], [(1506691286,23208)]],
4576: [[(1529577322, 789323)], [(1532173522, 1094738), (1532190922, 565980)]]}
所以基本上,我需要使用一级索引('userID')作为特定用户的所有会话列表的键,并根据二级索引( '会话索引')。我试图实现这个解决方案:Convert dataframe to dictionary of list of tuples。但我不知道如何修改它以获得我需要的结构。
from datetime import datetime
# I'm creating the sample of different sessions
iterator = iter([{'user': 1234,
'timestamp': 1504010302,
'pageid': 45678},
{'user': 1234,
'timestamp': 1504016546,
'pageid':78908},
{'user': 1234,
'timestamp': 1506691286,
'pageid':23208}
,
{'user': 4567,
'timestamp': 1529577322,
'pageid': 789323},
{'user': 4567,
'timestamp': 1532173522,
'pageid': 1094738},
{'user': 4567,
'timestamp': 1532190922,
'pageid': 565980}])
# Then I'm creating an empty DataFrame
df = pd.DataFrame(columns=['userID', 'session_index', 'timestamp', 'pageid'])
# Then I'm filling the empty DataFrame based on the logic that I need to get in the final structure
for entry in iterator:
if not (df.userID == entry['user']).any():
df = df.append([{'userID': entry['user'], 'session_index': 1,
'timestamp': entry['timestamp'], 'pageid': entry['pageid']}],
ignore_index=True)
else:
session_numbers = df[(df.userID == entry['user'])
&
(df.timestamp.apply(lambda x: abs(datetime.fromtimestamp(x)
- datetime.fromtimestamp(entry['timestamp'])).days*24
+ abs(datetime.fromtimestamp(x)
- datetime.fromtimestamp(entry['timestamp'])).seconds // 3600
) <= 24)]
if len(session_numbers.session_index.values) == 0:
df = df.append([{'userID': entry['user'], 'session_index':
df.session_index[df.userID == entry['user']].max() + 1,
'timestamp': entry['timestamp'], 'pageid': entry['pageid']}],
ignore_index=True)
else:
df = df.append([{'userID': entry['user'], 'session_index': session_numbers.session_index.values[0],
'timestamp': entry['timestamp'], 'pageid': entry['pageid']}],
ignore_index=True)
# Then I'm setting the Multi Index
df = df.set_index(['userID', 'session_index'])
print(df.index)
# Then I'm trying to get t
new_dict = df.apply(tuple, axis=1)\
.groupby(level=0)\
.agg(lambda x: list(x.values))\
.to_dict()
解决方案
您的代码很难理解。我以更 Pythonic 的方式重写了它。试试看(它适用于pandas 0.23.0
):
rows = [{'user': 1234,
'timestamp': 1504010302,
'pageid': 45678},
{'user': 1234,
'timestamp': 1504016546,
'pageid':78908},
{'user': 1234,
'timestamp': 1506691286,
'pageid':23208}
,
{'user': 4567,
'timestamp': 1529577322,
'pageid': 789323},
{'user': 4567,
'timestamp': 1532173522,
'pageid': 1094738},
{'user': 4567,
'timestamp': 1532190922,
'pageid': 565980}]
d = pd.DataFrame(rows)
d["time_diff"] = d.groupby("user")["timestamp"]\
.rolling(2).apply(lambda x: x[1] - x[0] > 24 * 3600)\
.fillna(0)\
.values
d["session_index"] = d.groupby("user")["time_diff"].cumsum()\
.astype(int) + 1
d.drop("time_diff", axis=1, inplace=True)
d = d.set_index(['user', 'session_index'])
d.apply(lambda x: list(x)[::-1], axis=1)\
.groupby(level=0)\
.agg(lambda x: list(x.values))\
.to_dict()
结果:
{1234: [[1504010302, 45678], [1504016546, 78908], [1506691286, 23208]],
4567: [[1529577322, 789323], [1532173522, 1094738], [1532190922, 565980]]}
推荐阅读
- npm - Vue-CLI 和 NPM:如何手动升级项目依赖项?
- ios - 是否可以在 SwiftUI iOS 应用程序中结合 WindowGroup 和 DocumentGroup ?
- excel - 创建唯一编号时使用连接和增量的 Excel 问题
- swift - SwiftUI .focused() 视图修改器不起作用 .onAppear
- react-native - React Native 环境设置问题,无法初始化类 com.android.build.gradle.internal.TaskManager
- javascript - 在 JavaScript 中,为什么函数打印语句没有任何变量值?
- python - 使用python进行数据稀疏优化?
- javascript - jQuery 数据过滤器:更新表格内容
- python - 如何从字符串列表中查找唯一对并打印到列表?
- python - 读取 Python Pandas 中的本地 html 表作为数据框