当任务管理器失败时，Flink预期HA行为

我创建了一个由1个JobManager和2个任务管理器组成的HA Flink v1.2群集，每个群集都在其自己的VM（不使用YARN或hdfs）中。在JobManager节点上启动作业后，我终止了一个TaskManager实例。立即在Web仪表板中，我可以看到该作业被取消，然后失败。如果我检查日志：当任务管理器失败时，Flink预期HA行为

03/06/2017 16:23:50 Flat Map(1/2) switched to DEPLOYING 
03/06/2017 16:23:50 Flat Map(2/2) switched to SCHEDULED 
03/06/2017 16:23:50 Flat Map(2/2) switched to DEPLOYING 
03/06/2017 16:23:50 Flat Map(1/2) switched to RUNNING 
03/06/2017 16:23:50 Source: Custom Source -> Flat Map(1/2) switched to RUNNING 
03/06/2017 16:23:50 Flat Map(2/2) switched to RUNNING 
03/06/2017 16:23:50 Source: Custom Source -> Flat Map(2/2) switched to RUNNING 
03/06/2017 16:25:38 Flat Map(1/2) switched to FAILED 
org.apache.flink.runtime.io.network.netty.exception.RemoteTransportException: Connection unexpectedly closed by remote task manager 'ip-10-106-0-238/10.106.0.238:40578'. This might indicate that the remote task manager was lost. 
    at org.apache.flink.runtime.io.network.netty.PartitionRequestClientHandler.channelInactive(PartitionRequestClientHandler.java:118) 
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:237) 
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:223) 
    at io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:75) 
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:237) 
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:223) 
    at io.netty.handler.codec.ByteToMessageDecoder.channelInactive(ByteToMessageDecoder.java:294) 
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:237) 
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:223) 
    at io.netty.channel.DefaultChannelPipeline.fireChannelInactive(DefaultChannelPipeline.java:829) 
    at io.netty.channel.AbstractChannel$AbstractUnsafe$7.run(AbstractChannel.java:610) 
    at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:357) 
    at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:357) 
    at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111) 
    at java.lang.Thread.run(Thread.java:745) 

03/06/2017 16:25:38 Job execution switched to status FAILING. 
org.apache.flink.runtime.io.network.netty.exception.RemoteTransportException: Connection unexpectedly closed by remote task manager 'ip-10-106-0-238/10.106.0.238:40578'. This might indicate that the remote task manager was lost. 
    at org.apache.flink.runtime.io.network.netty.PartitionRequestClientHandler.channelInactive(PartitionRequestClientHandler.java:118) 
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:237) 
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:223) 
    at io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:75) 
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:237) 
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:223) 
    at io.netty.handler.codec.ByteToMessageDecoder.channelInactive(ByteToMessageDecoder.java:294) 
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:237) 
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:223) 
    at io.netty.channel.DefaultChannelPipeline.fireChannelInactive(DefaultChannelPipeline.java:829) 
    at io.netty.channel.AbstractChannel$AbstractUnsafe$7.run(AbstractChannel.java:610) 
    at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:357) 
    at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:357) 
    at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111) 
    at java.lang.Thread.run(Thread.java:745) 
03/06/2017 16:25:38 Source: Custom Source -> Flat Map(1/2) switched to CANCELING 
03/06/2017 16:25:38 Source: Custom Source -> Flat Map(2/2) switched to CANCELING 
03/06/2017 16:25:38 Flat Map(2/2) switched to CANCELING 
03/06/2017 16:25:38 Source: Custom Source -> Flat Map(1/2) switched to CANCELED 
03/06/2017 16:26:18 Source: Custom Source -> Flat Map(2/2) switched to CANCELED 
03/06/2017 16:26:18 Flat Map(2/2) switched to CANCELED

在作业执行我

env.setRestartStrategy(RestartStrategies.fixedDelayRestart(3, // number 
                   // of 
                   // restart 
                   // attempts 
     Time.of(10, TimeUnit.SECONDS) // delay 
));

我的问题是不应该的JobManager自动重定向到其余/运行任务管理器的所有要求？同样，如果我启动JobManager和1个TaskManager实例并运行作业，当我启动第2个TaskManager实例时它是否也有助于解决正在运行的作业？

谢谢！

来源

2017-03-06 razvan

首先RestartStrategy与HA模式无关。高可用性涉及JobManager的可用性。无论如何，医管局工作至少需要两个JobManager的实例（你说你只是开始一个）。

至于何时指定fixedDelayRestart战略后失效（如你的情况时，例如杀任务管理器），作业将尝试再次运行（在你的情况下10秒后）的RestartStrategy。如果在安装中不是这种情况，那么可能缺少可用资源来运行作业（假设每个TaskManager有1个任务插槽，因此当剩下一个时，您无法运行并行度为2或更高的作业）。

对于最后一个问题，添加一个TaskManager不会影响正在运行的作业。以某种方式连接的行为称为动态缩放。您可以通过获取保存点然后使用更多资源重新运行保存点来完成此操作。看看here。自动重新调整正在进行中。

来源

2017-03-08 09:27:41

嗨Dawid，谢谢你的答案，它为我清理了一些事情。我做了一个新的测试，并行度设置为1，每个TaskManager设置为1个插槽。你是对的，工作在剩余的TaskManager上被重试，但是我得到了一个错误：java.io.FileNotFoundException：/ home/ubuntu/Prototype/flink/flink-checkpoints/6fc6168a1e5a6a27f58f6d57deeacb65/chk-37/31c325f7-2b57-4e6b -bc20-3f6e9390a724（没有这样的文件或目录）。似乎检查点在第二个TaskManager上不可用。这会导致作业失败。你知道在TaskManager之间是否同步检查点？ – razvan

检查点的存储位置取决于使用的StateBackend。有关更多信息，请参阅：https://ci.apache.org/projects/flink/flink-docs-release-1.2/ops/state_backends.html#state-backends –

当然，很明显，我使用文件系统作为后端状态。但是我在每个TaskManager上设置了本地路径（它们在不同的虚拟机中），并期望框架在JobManager中保持状态同步。显然不是，如果TaskManager出现故障并且本地保存了检查点，则作业将失败。你知道是否所有的TaskManagers后端状态localtion应该指向相同的路径吗？它生成名称为UUID的文件夹，希望确保不会有冲突。 – razvan

当任务管理器失败时，Flink预期HA行为

回答

相关问题