首页 > 解决方案 > Pandas:基于两个列值创建一个新列

问题描述

学习列比较。如何基于两列创建新列?

我可以做两个条件水果或蔬菜。但是对于第三个条件,做不到。:(

df
    basket1     basket2
0   fruit       fruit
1   vegetable   vegetable 
2   vegetable   both
3   fruit       both

结果

纽德夫

    basket1    basket2    total
0   fruit      fruit      fruit
1   vegetable  vegetable  vegetable  
2   vegetable  both       Unknown
3   fruit      both      fruit

非常感谢你的帮助!

标签: pythonpandasdataframe

解决方案


更新

重温这一点,DataFrame.apply是缓慢的自动对焦。让我们看看其他一些选项,然后进行比较。

其他选项DataFrame.apply

  1. numpy.where

当我们只有两个选项时,可以应用此方法。在您的情况下,这是正确的,因为我们返回df.awhendf.a == df.bdf.a == 'fruit' and df.b == 'both'。语法是np.where(condition, value_if_true, value_if_false).

In [42]: df['np_where'] = np.where(
    ...:     ((df.a == df.b) | ((df.a == 'fruit') & (df.b == 'both'))),
    ...:     df.a,
    ...:     'Unknown'
    ...: )
  1. numpy.select

如果您有多个条件,您将使用此选项。其语法是np.select(condition, values, default)wheredefault是一个可选参数。

In [43]: conditions = df.a == df.b, (df.a == 'fruit') & (df.b == 'both')

In [44]: choices = df['a'], df['a']

In [45]: df['np_select'] = np.select(conditions, choices, default='Unknown')

请注意,出于演示的目的,即使结果产生相同的结果,我也创建了两个条件。

比较选项

如您所见,所有三种方法都有相同的结果。

In [47]: df
Out[47]:
           a          b   np_where  np_select   df_apply
0      fruit      fruit      fruit      fruit      fruit
1  vegetable  vegetable  vegetable  vegetable  vegetable
2  vegetable       both    Unknown    Unknown    Unknown
3      fruit       both      fruit      fruit      fruit

但是它们在速度方面如何比较?为了检查这一点,让我们创建一个更新的、更大的DataFrame. 我们这样做是为了看看我们的选项如何处理大量数据。

In [48]: df_large = pd.DataFrame({
    ...:     'a': np.random.choice(['fruit', 'vegetable'], size=1_000_000),
    ...:     'b': np.random.choice(['fruit', 'vegetable', 'both'], size=1_000_000)
    ...: })

In [49]: %timeit df_large['np_where'] = np.where(((df_large.a == df_large.b) | ((df_large.a == 'fruit')
    ...:  & (df_large.b == 'both'))), df_large.a, 'Unknown')
379 ms ± 64.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [50]: %timeit df_large['np_select'] = np.select(((df_large.a == df_large.b), ((df_large.a == 'fruit'
    ...: ) & (df_large.b == 'both'))), (df_large.a, df_large.a), default='Unknown')
580 ms ± 101 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [51]: %timeit df_large['df_apply'] = df_large.apply(total, axis=1)
40.5 s ± 6 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

哇!如您所见DataFrame.apply,它比我们的其他两个选项要慢得多,并且np.where脱颖而出np.select

结论

  • np.where如果您只有两个选择,请使用
  • np.select如果您有多种选择,请使用
  • 不要使用DataFrame.apply(尤其是对于大型数据集)!

资源


创建自己的函数并使用DataFrame.apply

In [104]: def total(r):
     ...:     if r.a == r.b:
     ...:         return r.a
     ...:     elif (r.a == 'fruit') and (r.b == 'both'):
     ...:         return r.a
     ...:     return 'Unknown'
     ...:

In [105]: df = pd.DataFrame({'a': ['fruit', 'vegetable', 'vegetable', 'fruit'], 'b': ['fruit', 'vegetable', 'both', 'both']})

In [106]: df
Out[106]:
           a          b
0      fruit      fruit
1  vegetable  vegetable
2  vegetable       both
3      fruit       both

In [107]: df['total'] = df.apply(total, axis=1)

In [108]: df
Out[108]:
           a          b      total
0      fruit      fruit      fruit
1  vegetable  vegetable  vegetable
2  vegetable       both    Unknown
3      fruit       both      fruit

推荐阅读