tensorflow2.0 - Tensorflow2.0 MultiWorkerMirroredStrategy 示例挂起
问题描述
我遵循了官方 tensorflow 网站上的示例。
https://www.tensorflow.org/tutorials/distribute/multi_worker_with_keras
这是我的规格
WSL
Ubuntu 16.04.6 LTS
Tensorflow2.0
无 GPU 可用
我有一个名为“tfexample.py”的文件,看起来像这样
from __future__ import absolute_import, division, print_function, unicode_literals
import tensorflow_datasets as tfds
import tensorflow as tf
import json, os
tfds.disable_progress_bar()
os.environ["TF_CONFIG"] = json.dumps(
{
"cluster": {"worker": ["localhost:12345", "localhost:23456"]},
"task": {"type": "worker", "index": 0},
}
)
strategy = tf.distribute.experimental.MultiWorkerMirroredStrategy()
BUFFER_SIZE = 10000
BATCH_SIZE = 64
def make_datasets_unbatched():
# Scaling MNIST data from (0, 255] to (0., 1.]
def scale(image, label):
image = tf.cast(image, tf.float32)
image /= 255
return image, label
datasets, info = tfds.load(name="mnist", with_info=True, as_supervised=True)
return datasets["train"].map(scale).cache().shuffle(BUFFER_SIZE)
train_datasets = make_datasets_unbatched().batch(BATCH_SIZE)
def build_and_compile_cnn_model():
model = tf.keras.Sequential(
[
tf.keras.layers.Conv2D(32, 3, activation="relu", input_shape=(28, 28, 1)),
tf.keras.layers.MaxPooling2D(),
tf.keras.layers.Flatten(),
tf.keras.layers.Dense(64, activation="relu"),
tf.keras.layers.Dense(10, activation="softmax"),
]
)
model.compile(
loss=tf.keras.losses.sparse_categorical_crossentropy,
optimizer=tf.keras.optimizers.SGD(learning_rate=0.001),
metrics=["accuracy"],
)
return model
# single_worker_model = build_and_compile_cnn_model()
# single_worker_model.fit(x=train_datasets, epochs=3, steps_per_epoch=5)
NUM_WORKERS = 2
# Here the batch size scales up by number of workers since
# `tf.data.Dataset.batch` expects the global batch size. Previously we used 64,
# and now this becomes 128.
GLOBAL_BATCH_SIZE = 64 * NUM_WORKERS
with strategy.scope():
# Creation of dataset, and model building/compiling need to be within
# `strategy.scope()`.
train_datasets = make_datasets_unbatched().batch(GLOBAL_BATCH_SIZE)
multi_worker_model = build_and_compile_cnn_model()
# Keras' `model.fit()` trains the model with specified number of epochs and
# number of steps per epoch. Note that the numbers here are for demonstration
# purposes only and may not sufficiently produce a model with good quality.
multi_worker_model.fit(x=train_datasets, epochs=3, steps_per_epoch=5)
当我运行这个文件时
python tfexample.py
终端就像下面一样挂起
2020-02-04 17:50:23.483411: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer.so.6'; dlerror: libnvinfer.so.6: cannot open shared object file: No such file or directory
2020-02-04 17:50:23.485194: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer_plugin.so.6'; dlerror: libnvinfer_plugin.so.6: cannot open shared object file: No such file or directory
2020-02-04 17:50:23.485747: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:30] Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
/home/danny/.local/lib/python2.7/site-packages/requests/__init__.py:83: RequestsDependencyWarning: Old version of cryptography ([1, 2, 3]) may cause slowdown.
warnings.warn(warning, RequestsDependencyWarning)
2020-02-04 17:50:29.013263: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
2020-02-04 17:50:29.014152: E tensorflow/stream_executor/cuda/cuda_driver.cc:351] failed call to cuInit: UNKNOWN ERROR (303)
2020-02-04 17:50:29.014781: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (WINDOWS-6DFFM0Q): /proc/driver/nvidia/version does not exist
2020-02-04 17:50:29.015780: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2020-02-04 17:50:29.025575: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2701000000 Hz
2020-02-04 17:50:29.027050: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x66b11a0 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-02-04 17:50:29.027669: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version
E0204 17:50:29.038614800 24084 socket_utils_common_posix.cc:198] check for SO_REUSEPORT: {"created":"@1580856629.038575000","description":"Protocol not available","errno":92,"file":"external/grpc/src/core/lib/iomgr/socket_utils_common_posix.cc","file_line":175,"os_error":"Protocol not available","syscall":"getsockopt(SO_REUSEPORT)"}
E0204 17:50:29.039313500 24084 socket_utils_common_posix.cc:299] setsockopt(TCP_USER_TIMEOUT) Protocol not available
2020-02-04 17:50:29.051180: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:300] Initialize GrpcChannelCache for job worker -> {0 -> localhost:12345, 1 -> localhost:23456}
2020-02-04 17:50:29.053392: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:390] Started server with target: grpc://localhost:12345
任何帮助将不胜感激!
解决方案
您是否使用正确的 TFconfig 在两个会话上运行 tfexample.py。我没有在同一台机器上尝试过两个实例
推荐阅读
- regex - 使用 RegEx 解析地址字段中的合法描述
- youtube - Youtube 视频嵌入缩略图
- kubernetes - 我可以让 NGinX 入口控制器代理外部(不在我的云中)资源吗?
- python - 我安装了 bs4,但随后得到“没有名为 'bs4' 的模块”
- azure - Azure Web Job App.Config 转换在 TFS 发布管道中不起作用
- python - 如何克服 Gnu Radio 3.8 版中的 AttributeError
- java - 某些 JUnit 测试在 Ant 中花费大量时间
- java - 从不同的类调用String方法到主类
- python - 在python中读取数组文件
- google-cloud-platform - 通过 REST 添加新版本的秘密