首页 > 解决方案 > Proper way to update pandas dataframe column with function having other columns as arguments

问题描述

I'm pretty new to numpy and and pandas, so I can't wrap my head around this yet.

I'm trying to store arrays to pandas dataframe column. The arrays are created with a function that take values from other columns as arguments.

EDIT (5.4.2020): This code is a simplified example used for clarity

I set up my dataframe like this:

tdf = pd.DataFrame(columns=['a','b','c'])

#setting up the dataframe to hold arrays in column 'c'
dt = {'a':'int32','b':'int32','c':'object'}
tdf = tdf.astype(dt)

#inserting data to columns 'a' and 'b'
row = pd.DataFrame({'a':np.arange(1,3),'b':np.arange(3,5)})
tdf = tdf.append(row,sort=False,ignore_index=True)

I want to accomplish something like this:

tdf.at[0,'c'] = np.arange(tdf.at[0,'a'],2*tdf.at[0,'b'])
tdf.at[1,'c'] = np.arange(tdf.at[1,'a'],2*tdf.at[1,'b'])

# Output is the desired end result:
   a  b                   c
0  1  3     [1, 2, 3, 4, 5]
1  2  4  [2, 3, 4, 5, 6, 7]

But because I need to do more complex manipulation, I planned to do it inside a function like this:

def nar(x,y):
    # more complex processing done than here
    ar = np.array(x,2*y)
    return ar

tdf['c'] = nar(tdf['a'],tdf['b'])

# Not desired end result:
   a  b  c
0  1  4  1
1  2  5  2

I have also tried:

# Raises TypeError: ('data type not understood', 'occurred at index 0')
tdf['c'] = tdf.apply(lambda x: nar(x['a'], x['b']), axis=1)

as suggested to be used for processing per row in "Apply function to pandas column having other column as argument".

Also tested:

# Raises TypeError: cannot convert the series to <class 'float'>
tdf['c'] = np.array(tdf['a'],2*tdf['b'])

# and
x = np.arange(1,3)
y = np.arange(3,5)

# Raises TypeError: cannot convert the series to <class 'float'>
z = np.arange(x,2*y)

That makes me think that dataframe might actually work correctly with nar function, but it's the underlaying numpy that might require different approach.

Iterating through the rows with iterrows() is an option, but it is not very elegant and it also goes with a warning not to modify anything you are iterating over.

What is a proper way to accomplish this?

Thanks in advance!

标签: pythonpandas

解决方案


您可以借助apply.

tdf['c'] = tdf.apply(lambda row: list(range(row["a"], 2 * row["b"])) ,axis = 1)

输出:

   a  b                   c
0  1  3     [1, 2, 3, 4, 5]
1  2  4  [2, 3, 4, 5, 6, 7]

不过需要注意的是,apply在处理大型数据集时速度非常慢。


推荐阅读