2017-07-26 167 views
3

去年我一直和H2O一起工作,我对服务器崩溃感到非常厌倦。我已经放弃了“每晚发布”,因为它们很容易被我的数据集崩溃。请告诉我在哪里可以下载稳定的版本。H2O服务器崩溃

查尔斯

我的环境是:

  • 的Windows 10的企业,建立1607,64 GB内存。
  • Java SE开发工具包8更新77(64位)。
  • Anaconda Python 3.6.2-0。

我开始了服务器:

localH2O = h2o.init(ip = "localhost", 
        port = 54321, 
        max_mem_size="12G", 
        nthreads = 4) 

的H2O初始化信息是:

H2O cluster uptime:   12 hours 12 mins 
H2O cluster version:  3.10.5.2 
H2O cluster version age: 1 month and 6 days 
H2O cluster name:   H2O_from_python_Charles_ji1ndk 
H2O cluster total nodes: 1 
H2O cluster free memory: 6.994 Gb 
H2O cluster total cores: 8 
H2O cluster allowed cores: 4 
H2O cluster status:   locked, healthy 
H2O connection url:   http://localhost:54321 
H2O connection proxy: 
H2O internal security:  False 
Python version:    3.6.2 final 

崩溃的信息是:

OSError: Job with key $03017f00000132d4ffffffff$_a0ce9b2c855ea5cff1aa58d65c2a4e7c failed with an exception: java.lang.AssertionError: I am really confused about the heap usage; MEM_MAX=11453595648 heapUsedGC=11482667352 
stacktrace: 
java.lang.AssertionError: I am really confused about the heap usage; MEM_MAX=11453595648 heapUsedGC=11482667352 
    at water.MemoryManager.set_goals(MemoryManager.java:97) 
    at water.MemoryManager.malloc(MemoryManager.java:265) 
    at water.MemoryManager.malloc(MemoryManager.java:222) 
    at water.MemoryManager.arrayCopyOfRange(MemoryManager.java:291) 
    at water.AutoBuffer.expandByteBuffer(AutoBuffer.java:719) 
    at water.AutoBuffer.putA4f(AutoBuffer.java:1355) 
    at hex.deeplearning.Storage$DenseRowMatrix$Icer.write129(Storage$DenseRowMatrix$Icer.java) 
    at hex.deeplearning.Storage$DenseRowMatrix$Icer.write(Storage$DenseRowMatrix$Icer.java) 
    at water.Iced.write(Iced.java:61) 
    at water.AutoBuffer.put(AutoBuffer.java:771) 
    at water.AutoBuffer.putA(AutoBuffer.java:883) 
    at hex.deeplearning.DeepLearningModelInfo$Icer.write128(DeepLearningModelInfo$Icer.java) 
    at hex.deeplearning.DeepLearningModelInfo$Icer.write(DeepLearningModelInfo$Icer.java) 
    at water.Iced.write(Iced.java:61) 
    at water.AutoBuffer.put(AutoBuffer.java:771) 
    at hex.deeplearning.DeepLearningModel$Icer.write105(DeepLearningModel$Icer.java) 
    at hex.deeplearning.DeepLearningModel$Icer.write(DeepLearningModel$Icer.java) 
    at water.Iced.write(Iced.java:61) 
    at water.Iced.asBytes(Iced.java:42) 
    at water.Value.<init>(Value.java:348) 
    at water.TAtomic.atomic(TAtomic.java:22) 
    at water.Atomic.compute2(Atomic.java:56) 
    at water.Atomic.fork(Atomic.java:39) 
    at water.Atomic.invoke(Atomic.java:31) 
    at water.Lockable.unlock(Lockable.java:181) 
    at water.Lockable.unlock(Lockable.java:176) 
    at hex.deeplearning.DeepLearning$DeepLearningDriver.trainModel(DeepLearning.java:491) 
    at hex.deeplearning.DeepLearning$DeepLearningDriver.buildModel(DeepLearning.java:311) 
    at hex.deeplearning.DeepLearning$DeepLearningDriver.computeImpl(DeepLearning.java:216) 
    at hex.ModelBuilder$Driver.compute2(ModelBuilder.java:173) 
    at hex.deeplearning.DeepLearning$DeepLearningDriver.compute2(DeepLearning.java:209) 
    at water.H2O$H2OCountedCompleter.compute(H2O.java:1349) 
    at jsr166y.CountedCompleter.exec(CountedCompleter.java:468) 
    at jsr166y.ForkJoinTask.doExec(ForkJoinTask.java:263) 
    at jsr166y.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:974) 
    at jsr166y.ForkJoinPool.runWorker(ForkJoinPool.java:1477) 
    at jsr166y.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:104) 

回答

1

你需要一个更大的船。

错误消息是说“heapUsedGC = 11482667352”,它比MEM_MAX更高。为什么不给它更多的64GB,而不是给予max_mem_size="12G"?或者建立一个不那么雄心勃勃的模型(更少的隐藏节点,更少的培训数据,类似的东西)。 (显然,理想情况下,h2o不应该崩溃,而应该在接近使用所有可用内存时正常中止,如果您能够与H2O共享您的数据/代码,则可能值得在他们的JIRA上打开一个错误报告。)

顺便说一句,我已经运行h2o 3.10.xx作为一个Web服务器进程的后端9个月左右,自动重启它在周末,并没有有一次崩溃。那么,我做了 - 在离开它3周后,它用越来越多的数据和模型填满了所有的记忆。这就是为什么我将它切换为每周重新启动,并且只保留我需要的模型。 (顺便提一下,这是一个AWS实例,4GB内存;由cron作业和bash命令重新启动。)

+0

感谢您对Darren的出色评论。我会离开内存大小,看看会发生什么。 – CBrauer

0

您可以随时从https://www.h2o.ai/download下载最新的稳定版本(有一个标有“最新稳定版本” )。最新的稳定Python包可以通过PyPI和Anaconda下载;最新的稳定R包可在CRAN上获得。

我同意达伦的观点,你可能需要更多的记忆 - 如果你的H2O簇中有足够的记忆,H2O不应该崩溃。我们通常说你应该有一个集群,它至少是你磁盘上训练集的3-4倍,以训练一个模型。但是,如果要构建模型网格或许多模型,则需要增加内存,以便拥有足够的内存来存储所有这些模型。