首页 > 解决方案 > 如何知道记录是否已被修改或包含在 pandas 数据框中的新记录

问题描述

我有一个可以包含新行的数据框,但是我必须知道进入数据框的新行是否是对某些现有记录的修改,或者相反,它是一条新记录。

例如,输入数据框:

一种 人口 开始 结尾 时间戳
A1 B1 100 2021-05-15 00:00:00 2021-06-30 00:00:00 2021-07-06 00:00:00
A1 B1 250 2021-05-30 00:00:00 2021-06-02 00:00:00 2021-06-06 00:00:00
A2 B3 350 2021-05-10 00:00:00 2021-05-12 00:00:00 2021-07-06 00:00:00
A2 B4 125 2021-06-02 00:00:00 2021-06-04 00:00:00 2021-07-06 00:00:00

我们可以看到第 1 行是第 0 行的修改,注意时间戳更高并且除了日期和弹出值被修改。

预期输出:

一种 人口 人口上一个 开始 开始_上一页 结尾 End_prev 类型 时间戳
A1 B1 100 250 2021-05-15 00:00:00 2021-05-30 00:00:00 2021-06-30 00:00:00 2021-06-02 00:00:00 模组 2021-07-06 00:00:00
A2 B3 350 2021-05-10 00:00:00 2021-05-12 00:00:00 新的 2021-07-06 00:00:00
A2 B4 125 2021-06-02 00:00:00 2021-06-04 00:00:00 新的 2021-07-06 00:00:00

谢谢!

标签: pythonpandasdataframedatedatetime

解决方案


因此,如果您按时间戳排序,并groupby在定义唯一行的列上使用,您可以获得所需的所有信息。用于last获取每组中的最后一行,并nth获取倒数第二行:

>>> groups = df.sort_values('timestamp').groupby(['A', 'B'])
>>> groups.last()
         Population                 Start                   End            timestamp
A   B                                                                               
A1  B1          100  2021-05-15 00:00:00   2021-06-30 00:00:00   2021-07-06 00:00:00
A2  B3          350  2021-05-10 00:00:00   2021-05-12 00:00:00   2021-07-06 00:00:00
    B4          125  2021-06-02 00:00:00   2021-06-04 00:00:00   2021-07-06 00:00:00
>>> groups.nth(-2)
A1  B1          250  2021-05-30 00:00:00   2021-06-02 00:00:00   2021-06-06 00:00:00

现在所有这些数据帧都在列上建立索引AB因此您可以简单地join添加后缀,重置索引,然后您就完成了:

>>> mod = groups.last().join(groups.nth(-2), rsuffix='_prev').reset_index()
>>> mod
     A    B  Population                 Start                   End            timestamp  Population_prev            Start_prev              End_prev       timestamp_prev
0  A1   B1          100  2021-05-15 00:00:00   2021-06-30 00:00:00   2021-07-06 00:00:00            250.0  2021-05-30 00:00:00   2021-06-02 00:00:00   2021-06-06 00:00:00
1  A2   B3          350  2021-05-10 00:00:00   2021-05-12 00:00:00   2021-07-06 00:00:00              NaN                   NaN                   NaN                  NaN
2  A2   B4          125  2021-06-02 00:00:00   2021-06-04 00:00:00   2021-07-06 00:00:00              NaN                   NaN                   NaN                  NaN

然后一些细节使它看起来像你所拥有的:

>>> col_order = [
...     *df.columns[:2],
...     *(new_col for col in df.columns[2:-1] for new_col in [col, f'{col}_prev']),
...     'type', 'timestamp'
... ]
>>> row_type = mod['timestamp_prev'].isna().map({True: 'New', False: 'Mod'})
>>> mod.join(row_type.rename('type')).reindex(col_order, axis='columns')
     A    B  Population  Population_prev                 Start            Start_prev                   End              End_prev type            timestamp
0  A1   B1          100            250.0  2021-05-15 00:00:00   2021-05-30 00:00:00   2021-06-30 00:00:00   2021-06-02 00:00:00   Mod  2021-07-06 00:00:00
1  A2   B3          350              NaN  2021-05-10 00:00:00                    NaN  2021-05-12 00:00:00                    NaN  New  2021-07-06 00:00:00
2  A2   B4          125              NaN  2021-06-02 00:00:00                    NaN  2021-06-04 00:00:00                    NaN  New  2021-07-06 00:00:00

另一种适用于任意数量重复值的技术是使用pivot. 让我们使用相同的 groupby 但 withcumcount()来定义列的顺序:

>>> num = df.sort_values('timestamp').groupby(['A', 'B']).cumcount().rename('num')
>>> num
1    0
0    1
2    0
3    0
Name: num, dtype: int64
>>> pvt = df.join(num).pivot(index=['A', 'B'], columns='num', values=['Population', 'Start', 'End'])
>>> pvt
      Population                     Start                                       End                     
num            0    1                    0                    1                    0                    1
A  B                                                                                                     
A1 B1        250  100  2021-05-30 00:00:00  2021-05-15 00:00:00  2021-06-02 00:00:00  2021-06-30 00:00:00
A2 B3        350  NaN  2021-05-10 00:00:00                  NaN  2021-05-12 00:00:00                  NaN
   B4        125  NaN  2021-06-02 00:00:00                  NaN  2021-06-04 00:00:00                  NaN

正如您所看到的,它为您提供了您想要的东西,但在列中有一个多索引。让我们将其展平为普通列,我们就完成了:

>>> pvt.columns = [f'{col}_prev{n if n > 1 else ""}' if n > 0 else col for col, n in pvt.columns]
>>> pvt.reset_index()
    A   B Population Population_prev                Start           Start_prev                  End             End_prev
0  A1  B1        250             100  2021-05-30 00:00:00  2021-05-15 00:00:00  2021-06-02 00:00:00  2021-06-30 00:00:00
1  A2  B3        350             NaN  2021-05-10 00:00:00                  NaN  2021-05-12 00:00:00                  NaN
2  A2  B4        125             NaN  2021-06-02 00:00:00                  NaN  2021-06-04 00:00:00                  NaN

推荐阅读