您的问题是,除了主要工作人员处理的全局变量之外,非主要工作人员还有自己的局部变量集,需要在工人重新启动时对其进行初始化。
看看这个example由abenmao。你可以创建session.run钩子来初始化局部变量或全局变量。然后,根据工人是否主管,使用正确的钩子创建MonitoredTraining会话。
ma_hook = ma.make_ma_run_hook()
# And also, create the hook which handles initialization and queues.
ma_replicas_hook = ma.make_session_run_hook(is_chief)
```
In the training program, every worker will run the train_op as if not
model_average or synchronized. Note that if you want to run other ops like
test op, you should use common session instead of monitoredSession:
```python
with training.MonitoredTrainingSession(
master=workers[worker_id].target, is_chief=is_chief,
hooks=[ma_replicas_hook, ma_hook]) as mon_sess:
while not mon_sess.should_stop():
mon_sess.run(training_op)
...
def make_session_run_hook(self, is_chief, num_tokens=0):
"""Creates a hook to handle ReplicasHook ops such as initialization."""
if self._ma_run_hook is False:
raise ValueError("make_session_run_hook Should be "
"called after make_ma_run_hook.")
if is_chief:
return self._ReplicasHook(self.chief_init_op,
self.ready_for_local_init_op,
self.get_chief_queue_runner(),
self.get_init_tokens_op(num_tokens))
return self._ReplicasHook(self.local_step_init_op,
self.ready_for_local_init_op, None, None)
感谢您的回复!似乎答案解决了我的其他问题;)但示例链接很好 - 'model_average_device_setter()'可以将全局模型变量放在PS上,而每个工作人员也有副本。因为当前的tf只支持'tf.device(tf.train.replica_device_setter())',所有的张量“变量”放在PS上,而且每个worker都不'没有它们的本地复制品,这导致了如上所述的大量通信成本。 –