纱线上的火花，带有非零退出码的集装箱退出143

我使用HDP 2.5，运行spark-submit作为纱线集群模式。纱线上的火花，带有非零退出码的集装箱退出143

我试图使用数据帧交叉连接生成数据。即

val generatedData = df1.join(df2).join(df3).join(df4) 
generatedData.saveAsTable(...)....

DF1存储水平MEMORY_AND_DISK

DF2，DF3，DF4存储水平MEMORY_ONLY

DF1有更多的记录，即500万，而DF2到DF4至多100条记录。这样做我解释明白会导致更好的性能使用BroadcastNestedLoopJoin解释计划。

由于某种原因，它总是失败。我不知道如何调试它以及内存在哪里爆炸。

错误日志输出：

16/12/06 19:44:08 WARN YarnAllocator: Container marked as failed: container_e33_1480922439133_0845_02_000002 on host: hdp4. Exit status: 143. Diagnostics: Container killed on request. Exit code is 143 
Container exited with a non-zero exit code 143 
Killed by external signal 

16/12/06 19:44:08 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: Container marked as failed: container_e33_1480922439133_0845_02_000002 on host: hdp4. Exit status: 143. Diagnostics: Container killed on request. Exit code is 143 
Container exited with a non-zero exit code 143 
Killed by external signal 

16/12/06 19:44:08 ERROR YarnClusterScheduler: Lost executor 1 on hdp4: Container marked as failed: container_e33_1480922439133_0845_02_000002 on host: hdp4. Exit status: 143. Diagnostics: Container killed on request. Exit code is 143 
Container exited with a non-zero exit code 143 
Killed by external signal 

16/12/06 19:44:08 WARN TaskSetManager: Lost task 1.0 in stage 12.0 (TID 19, hdp4): ExecutorLostFailure (executor 1 exited caused by one of the running tasks) Reason: Container marked as failed: container_e33_1480922439133_0845_02_000002 on host: hdp4. Exit status: 143. Diagnostics: Container killed on request. Exit code is 143 
Container exited with a non-zero exit code 143 
Killed by external signal

我没有看到这个错误之前的任何警告或错误日志。问题是什么？我应该在哪里寻找内存消耗？我看不到任何东西存储 SparkUI的选项卡。日志从纱资源管理器UI采取HDP 2.5

编辑看着在容器日志，好像它是一个java.lang.OutOfMemoryError: GC overhead limit exceeded

我知道如何增加内存，但我不没有任何记忆了。如何在没有出现此错误的情况下使用4个数据框执行笛卡尔/产品连接？所有容器的

来源

2016-12-06 David H

如果数据框的大小与您所建议的一样（5e6,100,100,100），则笛卡尔产品将具有大约5e12条记录，即5万亿条记录。您没有提及列的数量，但是如果您有一个整数列，这将需要数TB的存储空间。如果你有多个列，联合数据库可能需要数百或数千兆字节。这真的是你想要的吗？ – abeboparebop

1栏。这是一个数据生成器工具，导致内存爆炸。 –

日志文件，我可上

yarn logs -applicationId application_1480922439133_0845_02

如果你只是想AM日志，

yarn logs -am -applicationId application_1480922439133_0845_02

如果你想找到的容器跑了这份工作，

yarn logs -applicationId application_1480922439133_0845_02|grep container_e33_1480922439133_0845_02

如果您只想要一个容器日志，请点击这里

yarn logs -containerId container_e33_1480922439133_0845_02_000002

为了使这些命令起作用，必须将日志聚合设置为true，否则必须从各个服务器目录获取日志。

更新除了尝试交换外，没有什么可以做，但会降低性能。

GC开销限制意味着GC已连续运行，但无法恢复太多内存。唯一的原因是，任一代码都写得不好，并有大量的反向引用（这是可疑的，因为你正在做简单的连接），或者内存容量已经达到。

来源

2016-12-06 18:36:41

感谢您的帮助，我已经找出了什么问题。如果你知道如何解决它，我会非常感激（我正在更新这个问题） –

我也遇到了这个问题，并尝试通过引用一些博客来解决它。 1.运行火花添加CONF波纹管：

--conf 'spark.driver.extraJavaOptions=-XX:+UseCompressedOops -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps' \ 
--conf 'spark.executor.extraJavaOptions=-XX:+UseCompressedOops -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintHeapAtGC ' \

当JVM GC，你会得到遵循消息：

Heap after GC invocations=157 (full 98): 
PSYoungGen  total 940544K, used 853456K [0x0000000781800000, 0x00000007c0000000, 0x00000007c0000000) 
    eden space 860160K, 99% used [0x0000000781800000,0x00000007b5974118,0x00000007b6000000) 
    from space 80384K, 0% used [0x00000007b6000000,0x00000007b6000000,0x00000007bae80000) 
    to space 77824K, 0% used [0x00000007bb400000,0x00000007bb400000,0x00000007c0000000) 
ParOldGen  total 2048000K, used 2047964K [0x0000000704800000, 0x0000000781800000, 0x0000000781800000) 
    object space 2048000K, 99% used [0x0000000704800000,0x00000007817f7148,0x0000000781800000) 
Metaspace  used 43044K, capacity 43310K, committed 44288K, reserved 1087488K 
    class space used 6618K, capacity 6701K, committed 6912K, reserved 1048576K 
}

PSYoungGen和ParOldGen都是99％，那么你将得到java.lang.OutOfMemoryError：如果创建了更多的对象，则超出了GC开销限制。

尝试添加更多的内存为您的遗嘱执行人或你的驱动程序时，更多的内存资源可供选择：

--executor-memory 10000m \
--driver-memory 10000m \

对于我的情况：内存PSYoungGen ParOldGen会导致很多年轻对象进入ParOldGen内存区域，并且最终会导致ParOldGen不可用。所以java.lang.OutOfMemoryError：出现Java堆空间错误。对于执行

添加的conf：

'spark.executor.extraJavaOptions=-XX:NewRatio=1 -XX:+UseCompressedOops -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps '

-XX：NewRatio =率率= ParOldGen/PSYoungGen

它dependends.You可以尝试像

-XX:+UseSerialGC :Serial Collector 
-XX:+UseParallelGC :Parallel Collector 
-XX:+UseParallelOldGC :Parallel Old collector 
-XX:+UseConcMarkSweepGC :Concurrent Mark Sweep

GC策略

Java Concurrent and Parallel GC

如果第4步和第6步都完成但仍然出错，则应考虑更改您的代码。例如，减少ML模型中的迭代器次数。

来源

2017-05-27 10:13:25 Matiji66

纱线上的火花，带有非零退出码的集装箱退出143

回答

相关问题