首页 > 解决方案 > 在组内比较 python pandas

问题描述

我有一个这样的数据框

import pandas as pd

df_test = pd.DataFrame({"ID": [912665, 455378, 938724, 557830
                         ],
                    "Company Name": ["112 ","112 ","SSS","SSS"
                            ],

                   "Date": ['2018-09-02 00:00:00','2019-02-27 00:00:00','2019-05-05 00:00:00','2018-03-21 00:00:00' 
                        ],
                    "Type": ['Type1','Type2','Type1','Type2' 
                        ],
                   "ngroup": [0, 0,1,1]}

                      )

df_test

我需要按日期(或任何其他列,如果我需要)在每个 'ngroup' 0,1... 中进行比较。

对于此示例,我有名为 ngroup 的组 0 和 1。在每个组中,每个组我只有两行。公司类型称为类型 1 和类型 2 我需要检查类型 1 的日期是否大于类型 2。如果是,那么我需要说例如“类型 1 首先加入”,如果不是, 然后类型 2 先加入。

之后,我还想将其作为新列状态添加到我的初始数据框中。

UPD:所以我的预期结果会是这样

df_test_expected_result = pd.DataFrame({"ID": [912665, 455378, 938724, 557830
                         ],
                    "Company Name": ["112 ","112 ","SSS","SSS"
                            ],

                   "Date": ['2018-09-02 00:00:00','2019-02-27 00:00:00','2019-05-05 00:00:00','2018-03-21 00:00:00' 
                        ],
                    "Type": ['Type1','Type2','Type1','Type2' 
                        ],
                   "ngroup": [0, 0,1,1],
                    "expected_result": ["Type 1 joined first","Type 1 joined first","Type 2 joined first","Type 2 joined first" ]
                                       }

                      )
df_test_expected_result

实现此结果的最佳方法是什么?

标签: pythonpandas

解决方案


IIUC,我们需要一个比较布尔值来测试每个组。

bool_comp = df_test.groupby(['ngroup'])['Date'].transform('min')

df_test["res"] = np.where(
    df_test["Date"] <= bool_comp,
    df_test["Type"] + " Joined First",
    df_test["Type"] + " Joined Later",
)

print(df_test)

       ID Company Name       Date   Type  ngroup                 res
0  912665         112  2018-09-02  Type1       0  Type1 Joined First
1  455378         112  2019-02-27  Type2       0  Type2 Joined Later
2  938724          SSS 2019-05-05  Type1       1  Type1 Joined Later
3  557830          SSS 2018-03-21  Type2       1  Type2 Joined First

编辑,刚刚看到您的预期输出,我们可以应用您的第一个条件,然后按组转发和回填。

df_test["res"] = np.where(
    df_test["Date"] <= bool_comp,
    df_test["Type"] + " Joined First",
    np.nan
)

df_test["res"] = df_test["res"].fillna(df_test.groupby("ngroup")["res"].ffill().bfill())

       ID Company Name       Date   Type  ngroup                 res
0  912665         112  2018-09-02  Type1       0  Type1 Joined First
1  455378         112  2019-02-27  Type2       0  Type1 Joined First
2  938724          SSS 2019-05-05  Type1       1  Type2 Joined First
3  557830          SSS 2018-03-21  Type2       1  Type2 Joined First

推荐阅读