python - Python 在类实例与本地(numpy)变量上的性能
问题描述
我已经阅读了其他关于 python 速度/性能应该如何相对不受正在运行的代码只是在 main 中、在函数中还是定义为类属性的帖子的文章,但这些并不能解释我看到的性能差异很大在使用类与局部变量时,尤其是在使用 numpy 库时。为了更清楚,我在下面做了一个脚本示例。
import numpy as np
import copy
class Test:
def __init__(self, n, m):
self.X = np.random.rand(n,n,m)
self.Y = np.random.rand(n,n,m)
self.Z = np.random.rand(n,n,m)
def matmul1(self):
self.A = np.zeros(self.X.shape)
for i in range(self.X.shape[2]):
self.A[:,:,i] = self.X[:,:,i] @ self.Y[:,:,i] @ self.Z[:,:,i]
return
def matmul2(self):
self.A = np.zeros(self.X.shape)
for i in range(self.X.shape[2]):
x = copy.deepcopy(self.X[:,:,i])
y = copy.deepcopy(self.Y[:,:,i])
z = copy.deepcopy(self.Z[:,:,i])
self.A[:,:,i] = x @ y @ z
return
t1 = Test(300,100)
%%timeit
t1.matmul1()
#OUTPUT: 20.9 s ± 1.37 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
t1.matmul2()
#OUTPUT: 516 ms ± 6.49 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
在这个脚本中,我将一个具有 X、Y 和 Z 属性的类定义为 3 路数组。我还有两个函数属性(matmul1 和 matmul2),它们循环遍历数组的第三个索引,矩阵乘以 3 个切片中的每一个来填充数组,A. matmul1 只是循环遍历类变量和矩阵乘法,而 matmul2 创建本地副本对于循环中的每个矩阵乘法。Matmul1 比 matmul2 慢约 40 倍。有人可以解释为什么会这样吗?也许我正在考虑如何错误地使用类,但我也不认为变量应该一直被深度复制。基本上,深度复制对我的性能产生如此显着影响的原因是什么,并且在使用类属性/变量时这是不可避免的吗?似乎它不仅仅是调用类属性的开销,如所讨论的在这里。任何输入表示赞赏,谢谢!
编辑:我真正的问题是为什么副本,而不是类实例变量的子数组的视图,会为这些类型的方法带来更好的性能。
解决方案
If you put the m
dimension first, you could do this product without iteration:
In [146]: X1,Y1,Z1 = X.transpose(2,0,1), Y.transpose(2,0,1), Z.transpose(2,0,1)
In [147]: A1 = X1@Y1@Z1
In [148]: np.allclose(A, A1.transpose(1,2,0))
Out[148]: True
However sometimes, working with very large arrays is slower, due to memory management complexities.
It might worth testing
A1[i] = X1[i] @ Y1[i] @ Z1[i]
where the iteration is on the outermost dimension.
My computer is too small to do good timings on these array sizes.
edit
I added these alternatives to your class, and tested with a smaller case:
In [67]: class Test:
...: def __init__(self, n, m):
...: self.X = np.random.rand(n,n,m)
...: self.Y = np.random.rand(n,n,m)
...: self.Z = np.random.rand(n,n,m)
...: def matmul1(self):
...: A = np.zeros(self.X.shape)
...: for i in range(self.X.shape[2]):
...: A[:,:,i] = self.X[:,:,i] @ self.Y[:,:,i] @ self.Z[:,:,i]
...: return A
...: def matmul2(self):
...: A = np.zeros(self.X.shape)
...: for i in range(self.X.shape[2]):
...: x = self.X[:,:,i].copy()
...: y = self.Y[:,:,i].copy()
...: z = self.Z[:,:,i].copy()
...: A[:,:,i] = x @ y @ z
...: return A
...: def matmul3(self):
...: x = self.X.transpose(2,0,1).copy()
...: y = self.Y.transpose(2,0,1).copy()
...: z = self.Z.transpose(2,0,1).copy()
...: return (x@y@z).transpose(1,2,0)
...: def matmul4(self):
...: x = self.X.transpose(2,0,1).copy()
...: y = self.Y.transpose(2,0,1).copy()
...: z = self.Z.transpose(2,0,1).copy()
...: A = np.zeros(x.shape)
...: for i in range(x.shape[0]):
...: A[i] = x[i]@y[i]@z[i]
...: return A.transpose(1,2,0)
In [68]: t1=Test(100,50)
In [69]: np.max(np.abs(t1.matmul2()-t1.matmul4()))
Out[69]: 0.0
In [70]: np.allclose(t1.matmul3(),t1.matmul2())
Out[70]: True
The view
iteration is 10x slower:
In [71]: timeit t1.matmul1()
252 ms ± 424 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [72]: timeit t1.matmul2()
26 ms ± 475 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
The additions are about the same:
In [73]: timeit t1.matmul3()
30.8 ms ± 4.33 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [74]: timeit t1.matmul4()
27.3 ms ± 172 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Without the copy()
, the transpose
produces a view, and times are similar to matmul1
(250ms).
My guess is that with "fresh" copies, matmul
is able to pass them to the best BLAS function by reference. With views, as in matmul1
, it has to take some sort of slower route.
But if I use dot
instead of matmul
, I get the faster time, even with the matmul1
iteation.
In [77]: %%timeit
...: A = np.zeros(X.shape)
...: for i in range(X.shape[2]):
...: A[:,:,i] = X[:,:,i].dot(Y[:,:,i]).dot(Z[:,:,i])
25.2 ms ± 250 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
It sure looks like matmul
with views is taking some suboptimal calculation choice.
推荐阅读
- amazon-web-services - 将名称服务器添加到 Route 53 子域
- r - 在R中求和一个系列
- javascript - 如何解决 TypeScript 接口的重负载?
- python - python将当前窗口写入文件并编码文件
- dart - 如何在 VScode 中创建 dart 项目并接受输入
- rdp - 如何在 apache Guacamole 中创建动态虚拟通道?
- powershell - Sharepoint 通过 Powershell 进行身份验证:Invoke-WebRequest 与 Net.WebRequest
- node.js - 使用 node-rdkafka 在工作线程中生成消息
- plotly - 如何随着时间的变化更新图形的数据值
- php - 更新其他字段时,数据库中的图像字段为空