tensorflow分布式训练瓦特/估计+实验框架

嗨我有一个可操作的情况，当试图使用估计+实验班进行分布式训练。tensorflow分布式训练瓦特/估计+实验框架

下面是一个例子：https://gist.github.com/protoget/2cf2b530bc300f209473374cf02ad829

这是使用来自TF官方教程

DNNClassifier一个简单的例子
实验框架
1工人和1个PS在同一主机不同的端口。

会发生什么事是

1）当我开始PS作业，它看起来不错：

W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE3 instructions, but these are available on your machine and could speed up CPU computations. 
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations. 
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations. 
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations. 
I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:200] Initialize GrpcChannelCache for job ps -> {0 -> localhost:9000} 
I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:200] Initialize GrpcChannelCache for job worker -> {0 -> 127.0.0.1:9001} 
I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:221] Started server with target: grpc://localhost:9000

2）当我开始工人作业时，作业自行退出，不留记录，在所有。

急切寻求帮助。

来源

2017-03-27 appledore

我有同样的问题，我终于得到解决方案。

的问题是在config._environment

config = {"cluster": {'ps':  ['127.0.0.1:9000'], 
         'worker': ['127.0.0.1:9001']}} 

if args.type == "worker": 
    config["task"] = {'type': 'worker', 'index': 0} 
else: 
    config["task"] = {'type': 'ps', 'index': 0} 

os.environ['TF_CONFIG'] = json.dumps(config) 

config = run_config.RunConfig() 

config._environment = run_config.Environment.CLOUD

设置config._environment为Environment.CLOUD。

然后你可以有分布式培训系统。

我希望它能让你快乐:)

来源

2017-05-16 05:46:57 Hulk

我有同样的问题，这是由于一些内部tensorflow代码我想，我已经开了一个问题，关于SO已经为此：TensorFlow: minimalist program fails on distributed mode。

我还打开了拉请求：https://github.com/tensorflow/tensorflow/issues/8796。

有两种方法可以解决您的问题。由于这是由于您的ClusterSpec具有隐含的local环境，您可以尝试设置另一个（google或cloud），但我无法向您保证其余工作不会受到影响。所以我最好先查看一下代码，然后尝试自己修复本地模式，这就是为什么我解释下面的原因。

你会看到它为什么在这些职位更精确地失败的解释，事实是谷歌一直很沉默到目前为止我所做的是我修改他们的源代码（在tensorflow/contrib/learn/python/learn/experiment.py）：

# Start the server, if needed. It's important to start the server before 
# we (optionally) sleep for the case where no device_filters are set. 
# Otherwise, the servers will wait to connect to each other before starting 
# to train. We might as well start as soon as we can. 
config = self._estimator.config 
if (config.environment != run_config.Environment.LOCAL and 
    config.environment != run_config.Environment.GOOGLE and 
    config.cluster_spec and config.master): 
self._start_server()

（这部分阻止服务器以本地模式启动，如果您在集群规范中没有设置，则本地模式是您自己的，因此您应该简单地评论config.environment != run_config.Environment.LOCAL and并且应该可以工作）。

来源

2017-06-20 07:58:46

tensorflow分布式训练瓦特/估计+实验框架

回答

相关问题