首页 > 解决方案 > 当涉及的列根据行而变化时,如何在熊猫中执行算术。对不同长度的列求和

问题描述

# Create a test data set based on the below dictionary
import pandas as pd
import math
import random
num_of_rows = 10000
data = {
    'x1' : [random.uniform(0.0, 1.0) for x in range(num_of_rows)],
    'x2' : [random.uniform(0.0, 1.0) for x in range(num_of_rows)],
    'x3' : [random.uniform(0.0, 1.0) for x in range(num_of_rows)],
    'term': [random.randint(1, 3) for x in range(num_of_rows)]
}
df = pd.DataFrame(data)
df.head()
x1 x2 x3 学期
0.103324 0.304647 0.979813 3
0.420082 0.416356 0.848054 2
0.722017 0.888290 0.728066 3
0.796869 0.535150 0.833837 2
0.764554 0.244415 0.479697 1
%%timeit
# To goal is to sum as month columns as indicated by the term column
def sum_rows(list = [], cnt = 0):
    return sum(list[:int(cnt)])
df['sum'] = df.apply(lambda x: sum_rows([x['x1'], x['x2'], x['x3']], x['term']), axis=1)
df

输出 - 每个循环 217 毫秒 ± 1.06 毫秒(平均值 ± 标准偏差。7 次运行,每个循环 1 个)

%%timeit
#  This appraoch advoids the apply function, but is very confusing to read
df['sum'] = df['x1'] * (df['term'] >= 1) + df['x2'] * (df['term'] >= 2) + df['x3'] * (df['term'] >= 3)
df

输出 - 每个循环 1.89 毫秒 ± 6.03 微秒(平均值 ± 标准偏差。7 次运行,每次 1000 次循环)

第二种方法显示比使用 apply 函数快 100 倍以上 是否有其他方法可以执行上述计算,什么被认为是更好的?如果我想将列相乘(而不是相加),第二种方法将不再有效

总和的输出是:

x1 x2 x3 学期
0.103324 0.304647 0.979813 3 1.387784
0.420082 0.416356 0.848054 2 0.836438
0.722017 0.888290 0.728066 3 2.338373
0.796869 0.535150 0.833837 2 1.332019
0.764554 0.244415 0.479697 1 0.764554

标签: pythonpandasapply

解决方案


推荐阅读