嗨我有一个可操作的情况,当试图使用估计+实验班进行分布式训练。tensorflow分布式训练瓦特/估计+实验框架
下面是一个例子:https://gist.github.com/protoget/2cf2b530bc300f209473374cf02ad829
这是使用来自TF官方教程
- DNNClassifier一个简单的例子
- 实验框架
- 1工人和1个PS在同一主机不同的端口。
会发生什么事是
1)当我开始PS作业,它看起来不错:
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE3 instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:200] Initialize GrpcChannelCache for job ps -> {0 -> localhost:9000}
I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:200] Initialize GrpcChannelCache for job worker -> {0 -> 127.0.0.1:9001}
I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:221] Started server with target: grpc://localhost:9000
2)当我开始工人作业时,作业自行退出,不留记录,在所有。
急切寻求帮助。