python - 根据用户名确定两个状态之间的时间差
问题描述
我正在尝试计算应用程序上用户名状态从原始安装/重新安装到卸载(反之亦然)之间的时间。我已经用 Python 编程了大约一年左右,所以我的函数有时速度较慢或不是最优的;我对这个任务有类似的问题。
我使用以下循环来计算状态之间的时间差。主要问题是这个函数非常慢(对于我正在处理的 700k 行数据),并且它不会返回所有时间间隔。
# Import modules
import pandas as pd
import numpy as np
# Load sample data
data = pd.read_csv('username_status_dates.csv').drop(columns='index')
data['Date'] = pd.to_datetime(data['Date'], format='%Y-%M-%d')
# Create dataframe to record time intervals
customerStatuses_df = pd.DataFrame(columns=['username', 'status1', 'status2', 'timeToUninstall', 'timeToReinstall'])
# Iterate over data, calculate time intervals (by username), and add to customerStatuses
for user in data:
# Get user sub data
user_df = data.loc[ customerStatuses_df['Username'] == user ]
dates = list(user_df['Date'])
status = list(user_df['Status'])
for val in np.arange(0, len(status) - 1):
# If status in installed (original or re)
if (status[val] == 'Original Install') or (status[val] == 'Reinstall'):
# If next status is uninstall (un)
if status[val + 1] == 'Uninstall':
# Calculate time interval between statuses
d_unin = dates[val + 1] - dates[val]
# Add to customerStatuses
customerStatuses_df = customerStatuses_df.append({'username': user, 'status1': status[val], 'status2': status[val+1], 'timeToUninstall': d_unin}, ignore_index=True)
# If status in not installed (un)
elif (status[val] == 'Uninstall'):
# If next status is installed (re)
if status[val + 1] == 'Reinstall':
# Calculate time interval between statuses
d_rein = dates[val + 1] - dates[val]
# Add to customerStatuses
customerStatuses_df = customerStatuses_df.append({'username': user, 'status1': status[val], 'status2': status[val+1], 'timeToReinstall': d_rein}, ignore_index=True)
样本数据: username_status_dates.csv
Username Status Date
0 Joey74 Original Install 2019-01-20 00:09:00
1 Sam22 Original Install 2019-01-15 00:10:00
2 Chloe35 Uninstall 2019-01-30 00:10:00
3 Sam22 Uninstall 2019-01-01 00:11:00
4 Zoe44 Reinstall 2019-01-10 00:12:00
5 Joey74 Reinstall 2019-01-20 00:12:00
6 Shelby99 Original Install 2020-01-07 00:01:00
7 Joey74 Uninstall 2020-01-05 00:01:00
8 Joey74 Reinstall 2020-01-30 00:01:00
9 Susane11 Reinstall 2020-01-04 00:02:00
函数结果: customerStatuses_df
username status1 status2 timeToUninstall timeToReinstall
0 Joey74 Reinstall Uninstall 349 days 23:49:00 NaN
1 Joey74 Uninstall Reinstall NaT 25 days 00:00:00
2 Sam22 Original Install Uninstall -14 days +00:01:00 NaN
我在这里错过了什么?
解决方案
我和一位同事意识到这个循环的核心问题是它试图迭代一个数据帧本身,与其他组合数据帧的方法相比,它的速度非常慢。这部分解释了为什么我的函数(上图)在使用更大的数据集时如此缓慢。我们还发现,使用矢量化数字比使用 Date 列的 timedelta 格式要快得多。
我们找到了一个很棒的StackOverflow 源,它展示了如何将数据帧转换(或内爆)成字典,以取代对数据帧本身进行迭代的需要。为了回答上面的问题,我按用户名组织数据,对 Date 列进行矢量化,并将数据帧转换为字典,然后计算状态变化之间的时间:
# Sort by username, then date
customerStatuses_df = customerStatuses_df.sort_values(by=['Username', 'Date'])
# Vectorize the Date column with .diff()
vectorized_time = customerStatuses_df['Date'].diff()
customerStatuses_df['vectorizedTime'] = pd.Series(vectorized_time)
# Implode the dataframe by making it a dictionary
customerStatuses_dict = customerStatuses_df.to_dict('index')
# Create lists to form a final dataframe
username = []
status_1 = []
status_2 = []
time_change = []
# Track timing of function
from tqdm import tqdm
# Determine difference between states, by user
imploded_dict = {}
for key, value in tqdm(list(customerStatuses_dict.items())):
# Grab all install Statuses and vectorizedTime attributed to the Username 'key'
try:
# Python prefers 'ask forgiveness, not permission' -> attempt to append to list
imploded_dict[value['Username']].append((value['Status'], value['vectorizedTime']))
except KeyError: # but if list does not exist, create the list
imploded_dict[value['Username']] = [(value['Status'], value['vectorizedTime'])]
# Using the now-imploded data, calculate time change between states
for user_id, status_tuple_list in tqdm(list(imploded_dict.items())):
# Iterate key value pairs in dictionary
for idx, (status, time_delta) in enumerate(status_tuple_list):
if idx == 0:
continue
# Add to lists for final dataframe
username.append(user_id)
status_1.append(status_tuple_list[idx - 1][0])
status_2.append(status_tuple_list[idx][0])
time_change.append(time_delta)
# Create dataframe to hold time changes
results_df = pd.DataFrame(list(zip(username, status_1, status_2, time_change)), columns=['username', 'status1', 'status2', 'timeToChange'])
结果_df
username status1 status2 timeToChange
0 Joey74 Original Install Reinstall 0 days 00:03:00
1 Joey74 Reinstall Uninstall 349 days 23:49:00
2 Joey74 Uninstall Reinstall 25 days 00:00:00
3 Sam22 Uninstall Original Install 13 days 23:59:00
推荐阅读
- javascript - Material UI – 全局复选框焦点样式(非本地)
- c# - Reading Json file, and putting information into an array c#
- networking - TCP如何处理三次握手期间的数据包丢失?
- sql - 如何按键提取字段中的jsonb列
- angular - 使用renderer2 Angular在按键时将字符包裹在span元素中
- excel-formula - Multiple COUNTIFS
- json - VBA compile error: Variable not defined ( JSON )
- angular - How to unit test a method that updates a filter in Angular?
- javascript - How to register Events using Firebase Analytics
- javascript - .then() 的 Javascript/ReactJS 执行执行得太早