首页 > 解决方案 > 在自定义训练循环中计算梯度,TF 与 Torch 的性能差异

问题描述

我试图将计算分子结构中的力和能量的 NN 模型的 pytorch 实现转换为 TensorFlow。这需要一个自定义训练循环和自定义损失函数,所以我在下面实现了不同的一步训练函数。

  1. 首先使用嵌套渐变胶带。
def calc_gradients(D_train_batch, E_train_batch, F_train_batch, opt):
    
    #set up gradient tape scope in order to track gradients of both d(Loss)/d(Weights)
    #and d(output)/d(input)
     with tf.GradientTape() as tape1:
          with tf.GradientTape() as tape2:
              #set gradient tape to watch Tensor
              tape2.watch(D_train_batch)
              #pass D thru model to get predicted energy vals
              E_pred = model(D_train_batch, training=True)
                  
          df_dD_train_batch = tape2.gradient(E_pred, D_train_batch) 
          #matrix mult of -Grad_D(f) x Grad_r(D)
          F_pred = -tf.einsum('ijkl,il->ijk', dD_dr_train_batch, df_dD_train_batch)
          #calculate loss value
          loss = force_energy_loss(E_pred, F_pred, E_train_batch, F_train_batch)
          
          
     
     grads = tape1.gradient(loss, model.trainable_weights)
     opt.apply_gradients(zip(grads, model.trainable_weights))
  1. 使用渐变胶带的其他尝试(持久 = true)
def calc_gradients_persistent(D_train_batch, E_train_batch, F_train_batch, opt):
#set up gradient tape scope in order to track gradients of both d(Loss)/d(Weights)
        #and d(output)/d(input)
        with tf.GradientTape(persistent = True) as outer:
            
            #set gradient tape to watch Tensor
            outer.watch(D_train_batch)
            
            #output values from model, set trainable to be true to get 
            #model.trainable_weights out
            E_pred = model(D_train_batch, training=True)
            
            #set gradient tape to watch trainable weights
            outer.watch(model.trainable_weights)
            
            #get gradient of output (f/E_pred) w.r.t input (D/D_train_batch) and cast to double
            df_dD_train_batch = outer.gradient(E_pred, D_train_batch)
            
            #matrix mult of -Grad_D(f) x Grad_r(D)
            F_pred = -tf.einsum('ijkl,il->ijk', dD_dr_train_batch, df_dD_train_batch)

            #calculate loss value
            loss = force_energy_loss(E_pred, F_pred, E_train_batch, F_train_batch)
        
        #get gradient of loss w.r.t to trainable weights for back propogation
        grads = outer.gradient(loss, model.trainable_weights)
        
        #updates weights using the optimizer and the gradients (grads)
        opt.apply_gradients(zip(grads, model.trainable_weights)) 

这些是 pytorch 代码的尝试翻译

# Forward pass: Predict energies from the descriptor input
        E_train_pred_batch = model(D_train_batch)

        # Get derivatives of model output with respect to input variables. The
        # torch.autograd.grad-function can be used for this, as it returns the
        # gradients of the input with respect to outputs. It is very important
        # to set the create_graph=True in this case. Without it the derivatives
        # of the NN parameters with respect to the loss from the force error
        # will not be populated (=the force error will not affect the
        # training), but the model will still run fine without errors.
        df_dD_train_batch = torch.autograd.grad(
            outputs=E_train_pred_batch,
            inputs=D_train_batch,
            grad_outputs=torch.ones_like(E_train_pred_batch),
            create_graph=True,
        )[0]

        # Get derivatives of input variables (=descriptor) with respect to atom
        # positions = forces
        F_train_pred_batch = -torch.einsum('ijkl,il->ijk', dD_dr_train_batch, df_dD_train_batch)

        # Zero gradients, perform a backward pass, and update the weights.
        # D_train_batch.grad.data.zero_()
        optimizer.zero_grad()
        loss = energy_force_loss(E_train_pred_batch, E_train_batch, F_train_pred_batch, F_train_batch)
        loss.backward()
        optimizer.step()

来自https://singroup.github.io/dscribe/latest/tutorials/machine_learning/forces_and_energies.html的 Dscribe 库教程

问题

与运行 pytorch 版本相比,使用任一版本的 TF 实现都会导致预测准确性的巨大损失。我想知道,我是否可能误解了 pytorch 代码并且翻译不正确,如果是,我的差异在哪里?

PS 模型直接计算能量 E,从中我们使用 E wrt D 的梯度来计算力 F。损失函数是力和能量的 MSE 的加权和。

标签: tensorflowpytorchautogradgradienttape

解决方案


这些方法实际上是相同的,我的错误是在其他地方产生了不同的结果。对于任何尝试实现 TensorFlow 版本的人来说,嵌套梯度磁带的速度大约快 2 倍,至少在这种情况下,并且还确保将函数包装在一个@tf.function中以便在急切执行时使用图形,速度大约是 10 倍。


推荐阅读