首页 > 解决方案 > 应用在行子集上对 Pandas 数据帧进行操作的函数

问题描述

我有一个接收数据帧并返回一个新数据帧的函数,它是相同的,但添加了一些列。举个例子:

def arbitrary_function_that_adds_columns(df):
    # In this trivial example I am adding only 1 column, but this function may add an arbitrary number of columns.
    df['new column'] = df['A'] + df['B'] / 8 + df['A']**3
    return df

将此函数应用于整个数据框很容易:

import pandas

df = pandas.DataFrame({'A': [1,2,3,4], 'B': [2,3,4,5]})

df = arbitrary_function_that_adds_columns(df)
print(df)

如何将arbitrary_function_that_adds_columns函数应用于行的子集?我正在尝试这个

import pandas

df = pandas.DataFrame({'A': [1,2,3,4], 'B': [2,3,4,5]})

rows = df['A'].isin({1,3})
df.loc[rows] = arbitrary_function_that_adds_columns(df.loc[rows])

print(df)

但我收到了原始数据框。我期望得到的结果是

   A  B  new column
0  1  2         NaN
1  2  3      10.375
2  3  4         NaN
3  4  5      68.625

标签: pythonpandasdataframe

解决方案


利用pandas.combine_first

请注意,根据预期的输出,您想要rows=[1,3],而不是rows = df['A'].isin({1,3})。后者选择“A”值为 1 或 3 的所有行。

import pandas as pd 

def arbitrary_function_that_adds_columns(df):
    # make sure that the function doesn't mutate the original DataFrame
    # Otherwise, you will get a SettingWithCopyWarning 
    df = df.copy()

    df['new column'] = df['A'] + df['B'] / 8 + df['A']**3
    return df

df = pd.DataFrame({'A': [1,2,3,4], 'B': [2,3,4,5]})

rows = [1, 3]
# the function is applied to a copy of a DataFrame slice 
>>> sub_df = arbitrary_function_that_adds_columns(df.loc[rows])
>>> sub_df

   A  B  new column
1  2  3      10.375
3  4  5      68.625

# Add the new information to the original df 
>>> df = df.combine_first(sub_df)
>>> df

   A  B  new column
0  1  2         NaN
1  2  3      10.375
2  3  4         NaN
3  4  5      68.625

这是另一种不涉及复制 DataFrame 子集的方法。

def arbitrary_function_that_adds_columns(df, rows='all'):
    if rows == 'all':
        rows = df.index     
    sub_df = df.loc[rows]

    df.loc[rows, 'new column'] = sub_df['A'] + sub_df['B'] / 8 + sub_df['A']**3
    
    return df

>>> df = arbitrary_function_that_adds_columns(df, rows)
>>> df 

   A  B  new column
0  1  2         NaN
1  2  3      10.375
2  3  4         NaN
3  4  5      68.625

推荐阅读