0

我试图解决我在我的火花设置中看到的内存溢出问题,此时,我无法就我为什么看到这一点做出具体分析。编写数据框到镶木地板或卡夫卡时,我总是看到这个问题。我的数据帧有5000行。它的模式是写入实木复合地板/卡夫卡:线程中的异常“dag-scheduler-event-loop”java.lang.OutOfMemoryError

root 

    |-- A: string (nullable = true) 
    |-- B: string (nullable = true) 
    |-- C: string (nullable = true) 
    |-- D: array (nullable = true) 
    | |-- element: string (containsNull = true) 
    |-- E: array (nullable = true) 
    | |-- element: string (containsNull = true) 
    |-- F: double (nullable = true) 
    |-- G: array (nullable = true) 
    | |-- element: double (containsNull = true) 
    |-- H: integer (nullable = true) 
    |-- I: double (nullable = true) 
    |-- J: double (nullable = true) 
    |-- K: array (nullable = true) 
    | |-- element: double (containsNull = false) 

其中列G可以具有高达16MB的单元大小。我的数据帧总大小约为10GB,分为12个分区。在编写之前,我试图用repartition()创建48个分区,但即使我没有重新分区编写,也会出现问题。在这个例外时,我只有一个Dataframe缓存,大小约为10GB。我的驱动程序有19GB的可用内存,2个执行程序每个有8GB的可用内存。 spark版本是2.1.0.cloudera1和scala版本是2.11.8。

我有以下设置:

spark.driver.memory  35G 
spark.executor.memory 25G 
spark.executor.instances 2 
spark.executor.cores 3 
spark.driver.maxResultSize  30g 
spark.serializer  org.apache.spark.serializer.KryoSerializer 
spark.kryoserializer.buffer.max 1g 
spark.rdd.compress  true 
spark.rpc.message.maxSize  2046 
spark.yarn.executor.memoryOverhead  4096 

例外回溯是

Exception in thread "dag-scheduler-event-loop" java.lang.OutOfMemoryError 
    at java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputStream.java:123) 
    at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:117) 
    at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93) 
    at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153) 
    at org.apache.spark.util.ByteBufferOutputStream.write(ByteBufferOutputStream.scala:41) 
    at java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1877) 
    at java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1786) 
    at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1189) 
    at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348) 
    at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:43) 
    at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100) 
    at org.apache.spark.scheduler.DAGScheduler.submitMissingTasks(DAGScheduler.scala:991) 
    at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:918) 
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$submitWaitingChildStages$6.apply(DAGScheduler.scala:765) 
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$submitWaitingChildStages$6.apply(DAGScheduler.scala:764) 
    at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) 
    at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) 
    at org.apache.spark.scheduler.DAGScheduler.submitWaitingChildStages(DAGScheduler.scala:764) 
    at org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1228) 
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1647) 
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1605) 
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1594) 
    at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) 

任何见解?

+0

有类似的问题。增加Java堆大小解决了它。请参阅https://stackoverflow.com/q/1565388/5039312 – Marco

回答

-1

我们终于找到了问题。我们在scala中对k行大小为4的5000行数据帧执行kfold logistic回归。分类完成后,我们基本上得到了4个大小为1250的测试输出数据框,每个数据框至少由200个分区进行分区。所以我们在5000行数据上拥有超过800个分区。然后代码将重新分区为48个分区。我们的系统可能无法处理这种重新分配,可能是由于洗牌。为了解决这个问题,我们将每个折叠输出数据帧重新分区为一个较小的数字(而不是在组合的数据帧上进行),并解决了这个问题。