首页 > 解决方案 > Pandas - 通过检查同一数据帧的其他行的条件来派生新列的最佳方法

问题描述

这是我的数据,

df = pd.DataFrame()
df["GroupType"] = ["Group_1","Group_2","Group_3","Group_1","Group_2","Group_3",
                   "Group_1","Group_2","Group_3","Group_1","Group_4","Group_4","Group_4"]
df["ID_1"] = ["10","10","10",
              "10","10","10",
              "10","10", "10",
              "12","12","12","13"]
df["ID_2"] = [pd.NA,"100",pd.NA,
              pd.NA,"200",pd.NA,
              pd.NA,"300",pd.NA,
              "400","400","400",pd.NA]
df["Price"] = [1,2,3,4,5,6,7,8,9,10,11,12,13]
df
   GroupType ID_1  ID_2  Price
0    Group_1   10  <NA>      1
1    Group_2   10   100      2
2    Group_3   10  <NA>      3
3    Group_1   10  <NA>      4
4    Group_2   10   200      5
5    Group_3   10  <NA>      6
6    Group_1   10  <NA>      7
7    Group_2   10   300      8
8    Group_3   10  <NA>      9
9    Group_1   12   400     10
10   Group_4   12   400     11
11   Group_4   12   400     12
12   Group_4   13  <NA>     13

要求是得出“实际价格”。逻辑是如果 'GroupType' == 'Group_4' 然后从具有相同 'ID_2' 的匹配 Group_1 中获取 'Price'。例如,上面的第 9 行是第 10 行和第 11 行的匹配“Group_1”记录。如果“GroupType”!=“Group_4”,则只填充“Price”。

预期数据框:

   GroupType ID_1  ID_2  Price  Actual_Price
0    Group_1   10  <NA>      1             1
1    Group_2   10   100      2             2
2    Group_3   10  <NA>      3             3
3    Group_1   10  <NA>      4             4
4    Group_2   10   200      5             5
5    Group_3   10  <NA>      6             6
6    Group_1   10  <NA>      7             7
7    Group_2   10   300      8             8
8    Group_3   10  <NA>      9             9
9    Group_1   12   400     10            10
10   Group_4   12   400     11            10
11   Group_4   12   400     12            10
12   Group_4   13  <NA>     13            pd.NA

我的解决方案:

def get_linked_actual_price(
        x: pd.Series, source_df: pd.DataFrame
):
    mask = (source_df["ID_2"]== x["ID_2"]) & (source_df["GroupType"]== "Group_1")
    if not mask.any():
        return pd.NA
    source_df = source_df.loc[mask]
    if len(source_df.index) > 1:
        source_df = source_df[:1]
    result_series = source_df.squeeze()
    return result_series.get("Price", pd.NA)

group_4_mask = df["GroupType"] == "Group_4"
df.loc[~group_4_mask,"Actual Price"] = df.loc[~group_4_mask, "Price"]
df.loc[group_4_mask, "Actual_Price"] = df.loc[group_4_mask].apply(get_linked_actual_price, axis=1, source_df=df)
print(df) 

我担心的是,因为我在每一行上都使用了应用程序,所以这不是执行的。所以想检查在性能方面是否有更好的解决方案?

标签: pythonpandasdataframe

解决方案


这是我将如何解决这个问题

  • 屏蔽Price其中GroupType不等于的值,Group_1然后将屏蔽列分组ID_2transform使用first
  • 更新使用从中获得的值的Price列中GroupType的值Group_4step 1
m = df['GroupType'].eq('Group_1')
s = df['Price'].where(m).groupby(df['ID_2']).transform('first')
df['Price'] = df['Price'].mask(df['GroupType'].eq('Group_4'), s)

   GroupType ID_1  ID_2  Price  Actual_price
0    Group_1   10  <NA>      1           1.0
1    Group_2   10   100      2           2.0
2    Group_3   10  <NA>      3           3.0
3    Group_1   10  <NA>      4           4.0
4    Group_2   10   200      5           5.0
5    Group_3   10  <NA>      6           6.0
6    Group_1   10  <NA>      7           7.0
7    Group_2   10   300      8           8.0
8    Group_3   10  <NA>      9           9.0
9    Group_1   12   400     10          10.0
10   Group_4   12   400     11          10.0
11   Group_4   12   400     12          10.0
12   Group_4   13  <NA>     13           NaN

推荐阅读