首页 > 解决方案 > Python 在类实例与本地(numpy)变量上的性能

问题描述

我已经阅读了其他关于 python 速度/性能应该如何相对不受正在运行的代码只是在 main 中、在函数中还是定义为类属性的帖子的文章,但这些并不能解释我看到的性能差异很大在使用类与局部变量时,尤其是在使用 numpy 库时。为了更清楚,我在下面做了一个脚本示例。

import numpy as np
import copy 

class Test:
    def __init__(self, n, m):
        self.X = np.random.rand(n,n,m)
        self.Y = np.random.rand(n,n,m)
        self.Z = np.random.rand(n,n,m)
    def matmul1(self):
        self.A = np.zeros(self.X.shape)
        for i in range(self.X.shape[2]):
            self.A[:,:,i] = self.X[:,:,i] @ self.Y[:,:,i] @ self.Z[:,:,i]
        return
    def matmul2(self):
        self.A = np.zeros(self.X.shape)
        for i in range(self.X.shape[2]):
            x = copy.deepcopy(self.X[:,:,i])
            y = copy.deepcopy(self.Y[:,:,i])
            z = copy.deepcopy(self.Z[:,:,i])
            self.A[:,:,i] = x @ y @ z
        return

t1 = Test(300,100) 
%%timeit   
t1.matmul1()
#OUTPUT: 20.9 s ± 1.37 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

%%timeit
t1.matmul2()
#OUTPUT: 516 ms ± 6.49 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

在这个脚本中,我将一个具有 X、Y 和 Z 属性的类定义为 3 路数组。我还有两个函数属性(matmul1 和 matmul2),它们循环遍历数组的第三个索引,矩阵乘以 3 个切片中的每一个来填充数组,A. matmul1 只是循环遍历类变量和矩阵乘法,而 matmul2 创建本地副本对于循环中的每个矩阵乘法。Matmul1 比 matmul2 慢约 40 倍。有人可以解释为什么会这样吗?也许我正在考虑如何错误地使用类,但我也不认为变量应该一直被深度复制。基本上,深度复制对我的性能产生如此显着影响的原因是什么,并且在使用类属性/变量时这是不可避免的吗?似乎它不仅仅是调用类属性的开销,如所讨论的在这里。任何输入表示赞赏,谢谢!

编辑:我真正的问题是为什么副本,而不是类实例变量的子数组的视图,会为这些类型的方法带来更好的性能。

标签: pythonperformancenumpyclassmatrix-multiplication

解决方案


If you put the m dimension first, you could do this product without iteration:

In [146]: X1,Y1,Z1 = X.transpose(2,0,1), Y.transpose(2,0,1), Z.transpose(2,0,1)
In [147]: A1 = X1@Y1@Z1
In [148]: np.allclose(A, A1.transpose(1,2,0))
Out[148]: True

However sometimes, working with very large arrays is slower, due to memory management complexities.

It might worth testing

 A1[i] = X1[i] @ Y1[i] @ Z1[i]

where the iteration is on the outermost dimension.

My computer is too small to do good timings on these array sizes.

edit

I added these alternatives to your class, and tested with a smaller case:

In [67]: class Test:
    ...:     def __init__(self, n, m):
    ...:         self.X = np.random.rand(n,n,m)
    ...:         self.Y = np.random.rand(n,n,m)
    ...:         self.Z = np.random.rand(n,n,m)
    ...:     def matmul1(self):
    ...:         A = np.zeros(self.X.shape)
    ...:         for i in range(self.X.shape[2]):
    ...:             A[:,:,i] = self.X[:,:,i] @ self.Y[:,:,i] @ self.Z[:,:,i]
    ...:         return A
    ...:     def matmul2(self):
    ...:         A = np.zeros(self.X.shape)
    ...:         for i in range(self.X.shape[2]):
    ...:             x = self.X[:,:,i].copy()
    ...:             y = self.Y[:,:,i].copy()
    ...:             z = self.Z[:,:,i].copy()
    ...:             A[:,:,i] = x @ y @ z
    ...:         return A
    ...:     def matmul3(self):
    ...:         x = self.X.transpose(2,0,1).copy()
    ...:         y = self.Y.transpose(2,0,1).copy()
    ...:         z = self.Z.transpose(2,0,1).copy()
    ...:         return (x@y@z).transpose(1,2,0)
    ...:     def matmul4(self):
    ...:         x = self.X.transpose(2,0,1).copy()
    ...:         y = self.Y.transpose(2,0,1).copy()
    ...:         z = self.Z.transpose(2,0,1).copy()
    ...:         A = np.zeros(x.shape)
    ...:         for i in range(x.shape[0]):
    ...:             A[i] = x[i]@y[i]@z[i]
    ...:         return A.transpose(1,2,0)

In [68]: t1=Test(100,50)
In [69]: np.max(np.abs(t1.matmul2()-t1.matmul4()))
Out[69]: 0.0
In [70]: np.allclose(t1.matmul3(),t1.matmul2())
Out[70]: True

The view iteration is 10x slower:

In [71]: timeit t1.matmul1()
252 ms ± 424 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [72]: timeit t1.matmul2()
26 ms ± 475 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

The additions are about the same:

In [73]: timeit t1.matmul3()
30.8 ms ± 4.33 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [74]: timeit t1.matmul4()
27.3 ms ± 172 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Without the copy(), the transpose produces a view, and times are similar to matmul1 (250ms).

My guess is that with "fresh" copies, matmul is able to pass them to the best BLAS function by reference. With views, as in matmul1, it has to take some sort of slower route.

But if I use dot instead of matmul, I get the faster time, even with the matmul1 iteation.

In [77]: %%timeit
    ...: A = np.zeros(X.shape)
    ...: for i in range(X.shape[2]):
    ...:     A[:,:,i] = X[:,:,i].dot(Y[:,:,i]).dot(Z[:,:,i])
25.2 ms ± 250 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

It sure looks like matmul with views is taking some suboptimal calculation choice.


推荐阅读