首页 > 解决方案 > 按索引行值与前一行值比较数据帧组

问题描述

只是想知道是否有以下问题的简单解决方案。采取以下设置

import datetime
import pandas

data = [
    {"date": datetime.date(2020, 1, 1), "ticker": "ticker-1", "internal_id": "T1", "score_1": 10.0, "score_2": 20.0},
    {"date": datetime.date(2020, 1, 5), "ticker": "ticker-1", "internal_id": "T1", "score_1": 20.0, "score_2": 20.0},
    {"date": datetime.date(2020, 1, 8), "ticker": "ticker-1", "internal_id": "T1", "score_1": 20.0, "score_2": 20.0},
    {"date": datetime.date(2020, 1, 10), "ticker": "ticker-1", "internal_id": "T1-A", "score_1": 10.0, "score_2": 30.0},

    {"date": datetime.date(2020, 1, 2), "ticker": "ticker-2", "internal_id": "T2", "score_1": 10.0, "score_2": 20.0},
    {"date": datetime.date(2020, 1, 4), "ticker": "ticker-2", "internal_id": "T2", "score_1": 10.0, "score_2": 20.0},
    {"date": datetime.date(2020, 1, 9), "ticker": "ticker-2", "internal_id": "T2", "score_1": 30.0, "score_2": 20.0},
]

df = pandas.DataFrame(data)
df = df.set_index(["date", "ticker"])
df['product'] = df.index.get_level_values('ticker')
df['date'] = df.index.get_level_values('date')

我需要能够比较某些列(internal_id, score_1, score_2)的最后一个值并将它们与上一列进行比较ticker,如果它与前一列的值不同,则输出它,否则显示None/ NaN

例如,按照上面的例子,这是我想要的输出:

output = [
    {"date": datetime.date(2020, 1, 1), "ticker": "ticker-1", "internal_id": "T1", "score_1": 10.0, "score_2": 20.0},
    {"date": datetime.date(2020, 1, 5), "ticker": "ticker-1", "internal_id": None, "score_1": 20.0, "score_2": None},
    {"date": datetime.date(2020, 1, 8), "ticker": "ticker-1", "internal_id": None, "score_1": None, "score_2": None},
    {"date": datetime.date(2020, 1, 10), "ticker": "ticker-1", "internal_id": "T1-A", "score_1": 10.0, "score_2": 30.0},

    {"date": datetime.date(2020, 1, 2), "ticker": "ticker-2", "internal_id": "T2", "score_1": None, "score_2": 20.0},
    {"date": datetime.date(2020, 1, 4), "ticker": "ticker-2", "internal_id": None, "score_1": None, "score_2": None},
    {"date": datetime.date(2020, 1, 9), "ticker": "ticker-2", "internal_id": None, "score_1": 30.0, "score_2": None},
]

如您所见,我需要对代码进行分组,然后与之前的日期列值进行比较。这需要跨字符串以及整数/浮点数工作。

标签: pythonpandasdataframepandas-groupbydata-science

解决方案


与和DataFrame.mask的比较值一起使用:DataFrameGroupBy.shiftDataFrame.eq

df = pandas.DataFrame(data)

df = df.mask(df.groupby('ticker').shift().eq(df))
print (df)
         date    ticker internal_id  score_1  score_2
0  2020-01-01  ticker-1          T1     10.0     20.0
1  2020-01-05  ticker-1         NaN     20.0      NaN
2  2020-01-08  ticker-1         NaN      NaN      NaN
3  2020-01-10  ticker-1        T1-A     10.0     30.0
4  2020-01-02  ticker-2          T2     10.0     20.0
5  2020-01-04  ticker-2         NaN      NaN      NaN
6  2020-01-09  ticker-2         NaN     30.0      NaN

推荐阅读