首页 > 解决方案 > Tensorflow2.0 MultiWorkerMirroredStrategy 示例挂起

问题描述

我遵循了官方 tensorflow 网站上的示例。
https://www.tensorflow.org/tutorials/distribute/multi_worker_with_keras

这是我的规格
WSL
Ubuntu 16.04.6 LTS
Tensorflow2.0
无 GPU 可用

我有一个名为“tfexample.py”的文件,看起来像这样

from __future__ import absolute_import, division, print_function, unicode_literals
import tensorflow_datasets as tfds
import tensorflow as tf
import json, os

tfds.disable_progress_bar()

os.environ["TF_CONFIG"] = json.dumps(
    {
        "cluster": {"worker": ["localhost:12345", "localhost:23456"]},
        "task": {"type": "worker", "index": 0},
    }
)
strategy = tf.distribute.experimental.MultiWorkerMirroredStrategy()
BUFFER_SIZE = 10000
BATCH_SIZE = 64


def make_datasets_unbatched():
    # Scaling MNIST data from (0, 255] to (0., 1.]
    def scale(image, label):
        image = tf.cast(image, tf.float32)
        image /= 255
        return image, label

    datasets, info = tfds.load(name="mnist", with_info=True, as_supervised=True)

    return datasets["train"].map(scale).cache().shuffle(BUFFER_SIZE)


train_datasets = make_datasets_unbatched().batch(BATCH_SIZE)


def build_and_compile_cnn_model():
    model = tf.keras.Sequential(
        [
            tf.keras.layers.Conv2D(32, 3, activation="relu", input_shape=(28, 28, 1)),
            tf.keras.layers.MaxPooling2D(),
            tf.keras.layers.Flatten(),
            tf.keras.layers.Dense(64, activation="relu"),
            tf.keras.layers.Dense(10, activation="softmax"),
        ]
    )
    model.compile(
        loss=tf.keras.losses.sparse_categorical_crossentropy,
        optimizer=tf.keras.optimizers.SGD(learning_rate=0.001),
        metrics=["accuracy"],
    )
    return model


# single_worker_model = build_and_compile_cnn_model()
# single_worker_model.fit(x=train_datasets, epochs=3, steps_per_epoch=5)


NUM_WORKERS = 2
# Here the batch size scales up by number of workers since
# `tf.data.Dataset.batch` expects the global batch size. Previously we used 64,
# and now this becomes 128.
GLOBAL_BATCH_SIZE = 64 * NUM_WORKERS
with strategy.scope():
    # Creation of dataset, and model building/compiling need to be within
    # `strategy.scope()`.
    train_datasets = make_datasets_unbatched().batch(GLOBAL_BATCH_SIZE)
    multi_worker_model = build_and_compile_cnn_model()

# Keras' `model.fit()` trains the model with specified number of epochs and
# number of steps per epoch. Note that the numbers here are for demonstration
# purposes only and may not sufficiently produce a model with good quality.
multi_worker_model.fit(x=train_datasets, epochs=3, steps_per_epoch=5)

当我运行这个文件时

python tfexample.py

终端就像下面一样挂起

2020-02-04 17:50:23.483411: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer.so.6'; dlerror: libnvinfer.so.6: cannot open shared object file: No such file or directory
2020-02-04 17:50:23.485194: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer_plugin.so.6'; dlerror: libnvinfer_plugin.so.6: cannot open shared object file: No such file or directory
2020-02-04 17:50:23.485747: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:30] Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
/home/danny/.local/lib/python2.7/site-packages/requests/__init__.py:83: RequestsDependencyWarning: Old version of cryptography ([1, 2, 3]) may cause slowdown.
  warnings.warn(warning, RequestsDependencyWarning)
2020-02-04 17:50:29.013263: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
2020-02-04 17:50:29.014152: E tensorflow/stream_executor/cuda/cuda_driver.cc:351] failed call to cuInit: UNKNOWN ERROR (303)
2020-02-04 17:50:29.014781: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (WINDOWS-6DFFM0Q): /proc/driver/nvidia/version does not exist
2020-02-04 17:50:29.015780: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2020-02-04 17:50:29.025575: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2701000000 Hz
2020-02-04 17:50:29.027050: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x66b11a0 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-02-04 17:50:29.027669: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
E0204 17:50:29.038614800   24084 socket_utils_common_posix.cc:198] check for SO_REUSEPORT: {"created":"@1580856629.038575000","description":"Protocol not available","errno":92,"file":"external/grpc/src/core/lib/iomgr/socket_utils_common_posix.cc","file_line":175,"os_error":"Protocol not available","syscall":"getsockopt(SO_REUSEPORT)"}
E0204 17:50:29.039313500   24084 socket_utils_common_posix.cc:299] setsockopt(TCP_USER_TIMEOUT) Protocol not available
2020-02-04 17:50:29.051180: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:300] Initialize GrpcChannelCache for job worker -> {0 -> localhost:12345, 1 -> localhost:23456}
2020-02-04 17:50:29.053392: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:390] Started server with target: grpc://localhost:12345 

任何帮助将不胜感激!

标签: tensorflow2.0

解决方案


您是否使用正确的 TFconfig 在两个会话上运行 tfexample.py。我没有在同一台机器上尝试过两个实例


推荐阅读