首页 > 解决方案 > 在 Tensorflow GPU 中训练一个简单模型比 CPU 慢

问题描述

我在 Tensorflow 中设置了一个简单的线性回归问题,并在 1.13.1 中使用 Tensorflow CPU 和 GPU 创建了简单的 conda 环境(在 NVIDIA Quadro P600 的后端使用 CUDA 10.0)。

但是,看起来 GPU 环境总是比 CPU 环境花费更长的时间。我正在运行的代码如下。

import time
import warnings
import numpy as np
import scipy

import tensorflow as tf
import tensorflow_probability as tfp

from tensorflow_probability import edward2 as ed
from tensorflow.python.ops import control_flow_ops
from tensorflow_probability import distributions as tfd



# Handy snippet to reset the global graph and global session.
def reset_g():
    with warnings.catch_warnings():
        warnings.simplefilter('ignore')
        tf.reset_default_graph()
        try:
            sess.close()
        except:
            pass


N = 35000
inttest = np.ones(N).reshape(N, 1)
stddev_raw = 0.09

true_int = 1.
true_b1 = 0.15
true_b2 = 0.7

np.random.seed(69)

X1 = (np.atleast_2d(np.linspace(
    0., 2., num=N)).T).astype(np.float64)
X2 = (np.atleast_2d(np.linspace(
    2., 1., num=N)).T).astype(np.float64)
Ytest = true_int + (true_b1*X1) + (true_b2*X2) + \
    np.random.normal(size=N, scale=stddev_raw).reshape(N, 1)

Ytest = Ytest.reshape(N, )
X1 = X1.reshape(N, )
X2 = X2.reshape(N, )

reset_g()

# Create data and param
model_X1 = tf.placeholder(dtype=tf.float64, shape=[N, ])
model_X2 = tf.placeholder(dtype=tf.float64, shape=[N, ])
model_Y = tf.placeholder(dtype=tf.float64, shape=[N, ])

alpha = tf.get_variable(shape=[1], name='alpha', dtype=tf.float64)
# these two params need shape of one if using trainable distro
beta1 = tf.get_variable(shape=[1], name='beta1', dtype=tf.float64)
beta2 = tf.get_variable(shape=[1], name='beta2', dtype=tf.float64)

# Yhat
tf_pred = (tf.multiply(model_X1, beta1) + tf.multiply(model_X2, beta2) + alpha)


# # Make difference of squares
# resid = tf.square(model_Y - tf_pred)
# loss = tf.reduce_sum(resid)

# # Make a Likelihood function based on simple stuff
stddev = tf.square(tf.get_variable(shape=[1],
                                    name='stddev', dtype=tf.float64))
covar = tfd.Normal(loc=model_Y, scale=stddev)
loss = -1.0*tf.reduce_sum(covar.log_prob(tf_pred))



# Trainer
lr=0.005
N_ITER = 20000

opt = tf.train.AdamOptimizer(lr, beta1=0.95, beta2=0.95)
train = opt.minimize(loss)


with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())

    start = time.time()
    for step in range(N_ITER):
        out_l, out_b1, out_b2, out_a, laws = sess.run([train, beta1, beta2, alpha, loss],
                                                  feed_dict={model_X1: X1,
                                                             model_X2: X2,
                                                             model_Y: Ytest})

        if step % 500 == 0:
            print('Step: {s}, loss = {l}, alpha = {a:.3f}, beta1 = {b1:.3f}, beta2 = {b2:.3f}'.format(
                s=step, l=laws, a=out_a[0], b1=out_b1[0], b2=out_b2[0]))
    print(f"True: alpha = {true_int}, beta1 = {true_b1}, beta2 = {true_b2}")
    end = time.time()
    print(end-start)

以下是打印的一些输出,如果它们表明正在发生的事情:

对于 CPU 运行:

Colocations handled automatically by placer.
2019-04-18 09:00:56.329669: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2019-04-18 09:00:56.351151: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2904000000 Hz
2019-04-18 09:00:56.351672: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x558fefe604c0 executing computations on platform Host. Devices:
2019-04-18 09:00:56.351698: I tensorflow/compiler/xla/service/service.cc:158]   StreamExecutor device (0): <undefined>, <undefined>

对于 GPU 运行:

Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor
W0418 09:03:21.674947 139956864096064 deprecation.py:506] From /home/sadatnfs/.conda/envs/tf_gpu/lib/python3.6/site-packages/tensorflow/python/training/slot_creator.py:187: calling Zeros.__init__ (from tensorflow.python.ops.init_ops) with dtype is deprecated and will be removed in a future version.
Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor
2019-04-18 09:03:21.712913: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2019-04-18 09:03:21.717598: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcuda.so.1
2019-04-18 09:03:21.951277: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1009] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-04-18 09:03:21.952212: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x55e583bc4480 executing computations on platform CUDA. Devices:
2019-04-18 09:03:21.952225: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): Quadro P600, Compute Capability 6.1
2019-04-18 09:03:21.971218: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2904000000 Hz
2019-04-18 09:03:21.971816: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x55e58577f290 executing computations on platform Host. Devices:
2019-04-18 09:03:21.971842: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): <undefined>, <undefined>
2019-04-18 09:03:21.972102: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1551] Found device 0 with properties:
name: Quadro P600 major: 6 minor: 1 memoryClockRate(GHz): 1.5565
pciBusID: 0000:01:00.0
totalMemory: 1.95GiB freeMemory: 1.91GiB
2019-04-18 09:03:21.972147: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1674] Adding visible gpu devices: 0
2019-04-18 09:03:21.972248: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.0
2019-04-18 09:03:21.973094: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1082] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-04-18 09:03:21.973105: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1088]      0
2019-04-18 09:03:21.973110: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1101] 0:   N
2019-04-18 09:03:21.973279: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1222] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 1735 MB memory) -> physical GPU (device: 0, name: Quadro P600, pci bus id: 0000:01:00.0, compute capability: 6.1)

任何帮助,将不胜感激!我即将发布另一个关于在 R 中实现 CUBLAS 的问题,因为与英特尔 MKL 相比,这给了我较慢的速度时间,但我希望也许有一个明确的原因,即使是像 TF 一样构建良好的东西(与hacky R 和 CUBLAS 补丁)在 GPU 上运行缓慢。

谢谢!

编辑:按照弗拉德的建议,我编写了以下脚本来尝试抛出一些大型对象并对其进行训练,但我认为我可能没有正确设置它,因为在这种情况下,即使矩阵的大小是 CPU 也是增加。有什么建议吗?

import time
import warnings
import numpy as np
import scipy

import tensorflow as tf
import tensorflow_probability as tfp

from tensorflow_probability import edward2 as ed
from tensorflow.python.ops import control_flow_ops
from tensorflow_probability import distributions as tfd

np.random.seed(69)

# Handy snippet to reset the global graph and global session.
def reset_g():
    with warnings.catch_warnings():
        warnings.simplefilter('ignore')
        tf.reset_default_graph()
        try:
            sess.close()
        except:
            pass


# Loop over the different number of feature columns
for x_feat in [30, 50, 100, 1000, 10000]:

    y_feat=10;
    # Simulate data
    N = 5000
    inttest = np.ones(N).reshape(N, 1)
    stddev_raw = np.random.uniform(0.01, 0.25, size=y_feat)

    true_int = np.linspace(0.1 ,1., num=y_feat)
    xcols = x_feat
    true_bw = np.random.randn(xcols, y_feat)
    true_X = np.random.randn(N, xcols)
    true_errorcov = np.eye(y_feat)
    np.fill_diagonal(true_errorcov, stddev_raw)

    true_Y = true_int + np.matmul(true_X, true_bw) + \
        np.random.multivariate_normal(mean=np.array([0 for i in range(y_feat)]),
                                      cov=true_errorcov,
                                      size=N)


    ## Our model is:
    ## Y = a + b*X + error where, for N=5000 observations:
    ## Y : 10 outputs;
    ## X : 30,50,100,1000,10000 features
    ## a, b = bias and weights
    ## error: just... error

    # Number of iterations
    N_ITER = 1001

    # Training rate
    lr=0.005

    with tf.device('gpu'):

        # Create data and weights
        model_X = tf.placeholder(dtype=tf.float64, shape=[N, xcols])
        model_Y = tf.placeholder(dtype=tf.float64, shape=[N, y_feat])

        alpha = tf.get_variable(shape=[y_feat], name='alpha', dtype=tf.float64)
        # these two params need shape of one if using trainable distro
        betas = tf.get_variable(shape=[xcols, y_feat], name='beta1', dtype=tf.float64)


        # Yhat
        tf_pred = alpha + tf.matmul(model_X, betas)

        # Make difference of squares (loss fn) [CONVERGES TO TRUTH]
        resid = tf.square(model_Y - tf_pred)
        loss = tf.reduce_sum(resid)

        # Trainer
        opt = tf.train.AdamOptimizer(lr, beta1=0.95, beta2=0.95)
        train = opt.minimize(loss)


    sess = tf.Session()
    sess.run(tf.global_variables_initializer())

    start = time.time()
    for step in range(N_ITER):
        out_l, laws = sess.run([train, loss], feed_dict={model_X: true_X, model_Y: true_Y})

        if step % 500 == 0:
            print('Step: {s}, loss = {l}'.format(
                s=step, l=laws))
    end = time.time()
    print("y_feat: {n}, x_feat: {x2}, Time elapsed: {te}".format(n = y_feat, x2 = x_feat, te = end-start))

    reset_g()

标签: pythontensorflowgpu

解决方案


正如我在评论中所说,调用 GPU 内核以及将数据复制到 GPU 和从 GPU 复制数据的开销非常高。对于参数很少的模型上的操作,使用 GPU 是不值得的,因为 CPU 内核的频率要高得多。如果您比较矩阵乘法(这是 DL 最常做的事情),您会发现对于大型矩阵,GPU 的性能明显优于 CPU。

看看这个情节。X 轴是两个正方形矩阵的大小,y 轴是在 GPU 和 CPU 上将这些矩阵相乘所需的时间。正如您在一开始所看到的,对于小矩阵,蓝线更高,这意味着它在 CPU 上更快。但是随着我们增加矩阵的大小,使用 GPU 的好处会显着增加。

在此处输入图像描述

要重现的代码:

import tensorflow as tf
import time
cpu_times = []
sizes = [1, 10, 100, 500, 1000, 2000, 3000, 4000, 5000, 8000, 10000]
for size in sizes:
    tf.reset_default_graph()
    start = time.time()
    with tf.device('cpu:0'):
        v1 = tf.Variable(tf.random_normal((size, size)))
        v2 = tf.Variable(tf.random_normal((size, size)))
        op = tf.matmul(v1, v2)

    with tf.Session() as sess:
        sess.run(tf.global_variables_initializer())
        sess.run(op)
    cpu_times.append(time.time() - start)
    print('cpu time took: {0:.4f}'.format(time.time() - start))

import tensorflow as tf
import time

gpu_times = []
for size in sizes:
    tf.reset_default_graph()
    start = time.time()
    with tf.device('gpu:0'):
        v1 = tf.Variable(tf.random_normal((size, size)))
        v2 = tf.Variable(tf.random_normal((size, size)))
        op = tf.matmul(v1, v2)

    with tf.Session() as sess:
        sess.run(tf.global_variables_initializer())
        sess.run(op)
    gpu_times.append(time.time() - start)
    print('gpu time took: {0:.4f}'.format(time.time() - start))

import matplotlib.pyplot as plt

fig, ax = plt.subplots(figsize=(8, 6))
ax.plot(sizes, gpu_times, label='GPU')
ax.plot(sizes, cpu_times, label='CPU')
plt.xlabel('MATRIX SIZE')
plt.ylabel('TIME (sec)')
plt.legend()
plt.show()

推荐阅读