pandas - Pandas 在整个 DataFrame 中对函数对象的应用进行矢量化
问题描述
我有一个基本数据帧,我想对另一个具有相同索引和列的数据帧给出的每个元素应用一个特定的函数,例如:
df = pd.DataFrame([[1,2],[3,4]])
df_format = pd.DataFrame([[lambda x: f'{x:.1f}', lambda x: f'{x:.2f}'],
[lambda x: f'{x:.3f}', lambda x: f'{x:.4f}']])
for i in range(2):
for j in range(2):
df.iloc[i,j] = df_format.iloc[i,j](df.iloc[i,j])
print(df)
0 1
0 1.0 2.00
1 3.000 4.0000
这不是矢量化的,我想知道是否有更有效的方法来做到这一点,尤其是对于较大的 DataFrame
解决方案
You could make the code much faster by working on columns and using the vectorize
function of Numpy. Indeed, direct accesses to Pandas dataframe (using iloc
) or internal Numpy arrays (using arr[i]
) are slow. Python loops are very slow too. Moreover, data are stored by column internally making column-wise operations faster than row-wise ones.
Here is a solution to vectorize your operation:
def callOn(func, value):
return func(value)
for j in range(2):
# np.vectorize(callOn) generate a function calling callOn(x,y) for
# each input pair (x,y) of zip(df_format[j],df[j]).
df[j] = np.vectorize(callOn)(df_format[j],df[j])
However, note that Numpy do not truely vectorize the calls internally since it deals with Python objects/functions. But this problem inherently comes from the assumption that all lambda could be different and are defined as plain Python objects.
On my machine this code is about 200 times faster than the initial one using the following setup:
nRows, nCols = 1000, 20
fList = [lambda x: f'{x:.1f}', lambda x: f'{x:.2f}', lambda x: f'{x:.3f}', lambda x: f'{x:.4f}']
df = pd.DataFrame(np.random.randint([[10 for j in range(nCols)] for i in range(nRows)]))
df_format = pd.DataFrame(np.random.choice(fList, size=(nRows, nCols)))
推荐阅读
- c++ - if 语句的条件
- haskell - 如何使用 `binary` 将 5 字节的内容读入 `Word64`?
- elasticsearch - Elasticsearch:如何排名高的完全匹配?
- python - 是否有一种更简单、更快捷的方法来获取索引 dict,其中包含列表或 numpy 数组中相同元素的索引
- tensorflow - 如何检查 tflite 文件中的量化权重
- node.js - 为什么节点 cron 不在 forEach() 中运行?
- html - 如何创建水平显示项目的组合框?
- bash - Bash 模式匹配循环超级慢
- python - Tensorflow 序列到序列 CustomHelper
- maven - 添加了 Intellij 模块依赖,但编译仍然失败