如何保存IDFmodel与PySpark

我公司生产的IDFModel与PySpark和IPython的笔记本如下：如何保存IDFmodel与PySpark

from pyspark import SparkContext 
from pyspark.mllib.feature import HashingTF 
from pyspark.mllib.feature import IDF 

hashingTF = HashingTF() #this will be used with hashing later 

txtdata_train = sc.wholeTextFiles("/home/ubuntu/folder").sortByKey() #this returns RDD of (filename, string) pairs for each file from the directory 

split_data_train = txtdata_train.map(parse) #my parse function puts RDD in form I want 

tf_train = hashingTF.transform(split_data_train) #creates term frequency sparse vectors for the training set 

tf_train.cache() 

idf_train = IDF().fit(tf_train) #makes IDFmodel, THIS IS WHAT I WANT TO SAVE!!! 

tfidf_train = idf_train.transform(tf_train)

这是基于本指南https://spark.apache.org/docs/1.2.0/mllib-feature-extraction.html上。我想保存此模型以便稍后在不同的笔记本中再次加载它。然而，没有任何信息，如何做到这一点，我找到最接近的是：

Save Apache Spark mllib model in python

但是，当我在回答试过建议

idf_train.save(sc, "/home/ubuntu/newfolder")

我得到的错误代码

AttributeError: 'IDFModel' object has no attribute 'save'

有没有我缺少的东西，或者它不可能解决IDFModel对象？谢谢！

来源

2015-08-31 Matt

我使用的Spark 1.2.0 Hadoop的2.4.0 – Matt

内置看看到[文档]（https://spark.apache.org/docs/latest/api/python/pyspark .mllib.html）。 'IDFModel'没有'save'方法，而另一个SO问题'RandomForestModel'中的模型确实有它... – lrnzcig

你是对的，谢谢，这将是一个值得追加 – Matt

我在Scala/Java中做过类似的事情。它似乎工作，但可能不是很有效。这个想法是把一个文件写成一个序列化的对象，并在以后读回来。祝你好运！ :)

try { 
    val fileOut:FileOutputStream = new FileOutputStream(savePath+"/idf.jserialized"); 
    val out:ObjectOutputStream = new ObjectOutputStream(fileOut); 
    out.writeObject(idf); 
    out.close(); 
    fileOut.close(); 
    System.out.println("\nSerialization Successful... Checkout your specified output file..\n"); 
} catch { 
    case foe:FileNotFoundException => foe.printStackTrace() 
    case ioe:IOException => ioe.printStackTrace() 
}

来源

2015-12-09 22:37:41 jarasss

如何保存IDFmodel与PySpark

回答

相关问题