首页 > 解决方案 > 警告:张量流:1 个 IndexedSlices 不支持高效的 allreduce

问题描述

使用功能 API 运行 keras 多输入模型时出现此警告。该模型在单个 GPU 上运行时运行良好且没有警告。当我使用tf.distribute.MirroredStrategy两个 GPU 时,模型的最终结果是好的,但我收到了警告。我认为这会导致性能问题?

tf.__version__ : 2.2.0
tf.keras.__version__ : 2.3.0-tf
NVIDIA-SMI 410.72       Driver Version: 410.72       CUDA Version: 10.1

我正在生成的模型是:

def build_model_():

input_a_size = 200
input_b_size = 4
num_classes = 2
len_embedding = 100

mirrored_strategy = tf.distribute.MirroredStrategy(['/gpu:0', '/gpu:1'])

with mirrored_strategy.scope():

    input_a = Input(shape=(input_a_size,), name='input_a', dtype=np.uint8)
    input_b = Input(shape=(input_b_size,), name='input_b', dtype=np.float32)

    x = Embedding(len_embedding, 100)(input_a)
    x = Conv1D(32, 4, activation='relu')(x)
    x = Flatten()(x)
    branch_a = Dense(64, activation='relu')(x)

    x = Dense(32, activation='relu')(input_b)
    branch_b = Dense(32, activation='relu')(x)

    concat = Concatenate()([
                            branch_a,
                            branch_b,
                           ])

    x = Dense(256, activation = 'relu')(concat)
    output = Dense(num_classes, activation='softmax')(x)

    model = Model(inputs=[
                          input_a,
                          input_b,
                         ],
                  outputs=[output])
        
    model.compile(loss = 'binary_crossentropy', optimizer = 'adam', metrics = ['accuracy'])

model.summary()

return model

型号总结:

INFO:tensorflow:Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0', '/job:localhost/replica:0/task:0/device:GPU:1')
Model: "model"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
==================================================================================================
input_a (InputLayer)            [(None, 200)]        0                                            
__________________________________________________________________________________________________
embedding (Embedding)           (None, 200, 100)     10000       input_a[0][0]                    
__________________________________________________________________________________________________
conv1d (Conv1D)                 (None, 197, 128)     51328       embedding[0][0]                  
__________________________________________________________________________________________________
max_pooling1d (MaxPooling1D)    (None, 49, 128)      0           conv1d[0][0]                     
__________________________________________________________________________________________________
input_b (InputLayer)            [(None, 4)]          0                                            
__________________________________________________________________________________________________
flatten (Flatten)               (None, 6272)         0           max_pooling1d[0][0]              
__________________________________________________________________________________________________
dense_1 (Dense)                 (None, 32)           160         input_b[0][0]                    
__________________________________________________________________________________________________
dense (Dense)                   (None, 64)           401472      flatten[0][0]                    
__________________________________________________________________________________________________
dense_2 (Dense)                 (None, 32)           1056        dense_1[0][0]                    
__________________________________________________________________________________________________
concatenate (Concatenate)       (None, 96)           0           dense[0][0]                      
                                                                 dense_2[0][0]                    
__________________________________________________________________________________________________
dense_3 (Dense)                 (None, 256)          24832       concatenate[0][0]                
__________________________________________________________________________________________________
dense_4 (Dense)                 (None, 2)            514         dense_3[0][0]                    
==================================================================================================
Total params: 489,362
Trainable params: 489,362
Non-trainable params: 0
__________________________________________________________________________________________________

我生成输入的方式:

input_a_train.shape: (35000, 200)
input_b_train.shape: (35000, 4)
y_train.shape: (35000, 2)

train_dataset = tf.data.Dataset.from_tensor_slices(({
                                                     "input_a": input_a_train,
                                                     "input_b": input_b_train,
                                                     }, y_train))
<TensorSliceDataset shapes: ({input_a: (200,), input_b: (4,)}, (2,)), types: ({input_a: tf.uint8, input_b: tf.float64}, tf.float32)>

val_dataset = tf.data.Dataset.from_tensor_slices(({
                                                     "input_a": input_a_val,
                                                     "input_b": input_b_val,
                                                     }, y_val))
<TensorSliceDataset shapes: ({input_a: (200,), input_b: (4,)}, (2,)), types: ({input_a: tf.uint8, input_b: tf.float64}, tf.float32)>

train_batches = train_dataset.padded_batch(128)
val_batches = val_dataset.padded_batch(128)

我在训练阶段收到警告,

history = my_model.fit(
    x = train_batches,
    epochs=3,
    verbose = 1,
    validation_data = val_batches,
)

这是输出:

Epoch 1/3
INFO:tensorflow:batch_all_reduce: 12 all-reduces with algorithm = nccl, num_packs = 1
WARNING:tensorflow:Efficient allreduce is not supported for 1 IndexedSlices
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:GPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:GPU:0', '/job:localhost/replica:0/task:0/device:GPU:1').
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:batch_all_reduce: 12 all-reduces with algorithm = nccl, num_packs = 1
WARNING:tensorflow:Efficient allreduce is not supported for 1 IndexedSlices
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:GPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:GPU:0', '/job:localhost/replica:0/task:0/device:GPU:1').
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
274/274 [==============================] - ETA: 0s - loss: 0.1857 - accuracy: 0.9324
...

这里有一个类似的问题:Efficient allreduce is not supported for 2 IndexedSlices但它没有答案。

编辑 1(2020 年 7 月 31 日)

如本指南中所述,实施了自定义训练循环:

https://www.tensorflow.org/tutorials/distribute/custom_training

https://www.tensorflow.org/guide/distributed_training#using_tfdistributetestrategy_with_custom_training_loops

https://www.tensorflow.org/tutorials/customization/custom_training_walkthrough

https://www.tensorflow.org/guide/keras/writing_a_training_loop_from_scratch

多个 GPU 上的相同警告和相同行为。当我增加 GPU 数量时,性能会下降。在 1 个 GPU 上训练比在 2 个 GPU 上训练更快,最坏的情况是使用 8 个 GPU。我认为问题可能出在 keras.model.fit 方法中,但不是。我的猜测是使用功能 api 模型的多输入 keras 的输入数据格式存在问题。

标签: pythontensorflowkerasgpu

解决方案


我使用tf.distribute.experimental.MultiWorkerMirroredStrategy(). 此策略对处理 IndexedSlices 有更好的支持,链接如下:https ://github.com/tensorflow/tensorflow/issues/41898#issuecomment-668786507 。此外,根据使用的 GPU 数量增加批量大小。这个问题涵盖了问题, https://github.com/tensorflow/tensorflow/issues/41898

physical_devices = tf.config.list_physical_devices('GPU') # 8 GPUs in my setup
tf.config.set_visible_devices(physical_devices[0:8], 'GPU') # Using all GPUs (default behaviour)
strategy = tf.distribute.experimental.MultiWorkerMirroredStrategy()

BATCH_SIZE_PER_REPLICA = 1024
GLOBAL_BATCH_SIZE = BATCH_SIZE_PER_REPLICA * strategy.num_replicas_in_sync

train_batches = train_dataset.batch(GLOBAL_BATCH_SIZE)

with strategy.scope():
    model = build_model_()
    model.compile(loss = 'binary_crossentropy', optimizer = 'adam', metrics = ['accuracy'])

history = model.fit(
    x = train_batches,
    epochs=10,
    verbose = 1,
)


推荐阅读