首页 > 解决方案 > 将重复项转换为 nan 仅保留最后一次出现

问题描述

尝试创建last_visit_idx显示每个访问者上次访问的索引行号的新列。但是,我希望这个值只出现在该访问者最后一次访问的实际行中。请看下面我到目前为止的想法:

import pandas as pd
import datetime as dt
import numpy as np
from datetime import datetime, timedelta, date, time



df  = pd.DataFrame({'date': ['2017-07-02 09:00:00', '2017-07-03 15:00:00', '2018-04-05 15:00:00', 
                                    '2018-12-20 11:00:00', '2019-01-06 14:00:00', '2020-09-06 17:00:00']})


df['date'] = pd.to_datetime(df['date'])

df['visitor'] = ['Dave', 'Dave', 'Dave', 'Peter', 'Peter', 'rob']

df['last_visit_idx'] = np.searchsorted(df.visitor, df.visitor, side='right')

df.loc[df.duplicated(['last_visit_idx','visitor']), 'last_visit_idx'] = np.nan

df['last_visit_idx'] = np.where(df['last_visit_idx'] > 0, pd.to_numeric(df['last_visit_idx']) - 1, np.nan)

当前代码产生以下内容:

                 date visitor  last_visit_idx
0 2017-07-02 09:00:00    Dave             2.0
1 2017-07-03 15:00:00    Dave             NaN
2 2018-04-05 15:00:00    Dave             NaN
3 2018-12-20 11:00:00   Peter             4.0
4 2019-01-06 14:00:00   Peter             NaN
5 2020-09-06 17:00:00     rob             5.0

目标是实现以下目标:

                 date visitor  last_visit_idx
0 2017-07-02 09:00:00    Dave             NaN
1 2017-07-03 15:00:00    Dave             NaN
2 2018-04-05 15:00:00    Dave             2.0
3 2018-12-20 11:00:00   Peter             NaN
4 2019-01-06 14:00:00   Peter             4.0
5 2020-09-06 17:00:00     rob             5.0

请给我一些急需的指导。

标签: pythonindexingduplicates

解决方案


推荐阅读