2015-08-31 50 views
4

我公司生产的IDFModel与PySpark和IPython的笔记本如下:如何保存IDFmodel与PySpark

from pyspark import SparkContext 
from pyspark.mllib.feature import HashingTF 
from pyspark.mllib.feature import IDF 

hashingTF = HashingTF() #this will be used with hashing later 

txtdata_train = sc.wholeTextFiles("/home/ubuntu/folder").sortByKey() #this returns RDD of (filename, string) pairs for each file from the directory 

split_data_train = txtdata_train.map(parse) #my parse function puts RDD in form I want 

tf_train = hashingTF.transform(split_data_train) #creates term frequency sparse vectors for the training set 

tf_train.cache() 

idf_train = IDF().fit(tf_train) #makes IDFmodel, THIS IS WHAT I WANT TO SAVE!!! 

tfidf_train = idf_train.transform(tf_train) 

这是基于本指南https://spark.apache.org/docs/1.2.0/mllib-feature-extraction.html上。我想保存此模型以便稍后在不同的笔记本中再次加载它。然而,没有任何信息,如何做到这一点,我找到最接近的是:

Save Apache Spark mllib model in python

但是,当我在回答试过建议

idf_train.save(sc, "/home/ubuntu/newfolder") 

我得到的错误代码

AttributeError: 'IDFModel' object has no attribute 'save' 

有没有我缺少的东西,或者它不可能解决IDFModel对象?谢谢!

+0

我使用的Spark 1.2.0 Hadoop的2.4.0 – Matt

+0

内置看看到[文档](https://spark.apache.org/docs/latest/api/python/pyspark .mllib.html)。 'IDFModel'没有'save'方法,而另一个SO问题'RandomForestModel'中的模型确实有它... – lrnzcig

+0

你是对的,谢谢,这将是一个值得追加 – Matt

回答

1

我在Scala/Java中做过类似的事情。它似乎工作,但可能不是很有效。这个想法是把一个文件写成一个序列化的对象,并在以后读回来。祝你好运! :)

try { 
    val fileOut:FileOutputStream = new FileOutputStream(savePath+"/idf.jserialized"); 
    val out:ObjectOutputStream = new ObjectOutputStream(fileOut); 
    out.writeObject(idf); 
    out.close(); 
    fileOut.close(); 
    System.out.println("\nSerialization Successful... Checkout your specified output file..\n"); 
} catch { 
    case foe:FileNotFoundException => foe.printStackTrace() 
    case ioe:IOException => ioe.printStackTrace() 
}