4
我公司生产的IDFModel与PySpark和IPython的笔记本如下:如何保存IDFmodel与PySpark
from pyspark import SparkContext
from pyspark.mllib.feature import HashingTF
from pyspark.mllib.feature import IDF
hashingTF = HashingTF() #this will be used with hashing later
txtdata_train = sc.wholeTextFiles("/home/ubuntu/folder").sortByKey() #this returns RDD of (filename, string) pairs for each file from the directory
split_data_train = txtdata_train.map(parse) #my parse function puts RDD in form I want
tf_train = hashingTF.transform(split_data_train) #creates term frequency sparse vectors for the training set
tf_train.cache()
idf_train = IDF().fit(tf_train) #makes IDFmodel, THIS IS WHAT I WANT TO SAVE!!!
tfidf_train = idf_train.transform(tf_train)
这是基于本指南https://spark.apache.org/docs/1.2.0/mllib-feature-extraction.html上。我想保存此模型以便稍后在不同的笔记本中再次加载它。然而,没有任何信息,如何做到这一点,我找到最接近的是:
Save Apache Spark mllib model in python
但是,当我在回答试过建议
idf_train.save(sc, "/home/ubuntu/newfolder")
我得到的错误代码
AttributeError: 'IDFModel' object has no attribute 'save'
有没有我缺少的东西,或者它不可能解决IDFModel对象?谢谢!
我使用的Spark 1.2.0 Hadoop的2.4.0 – Matt
内置看看到[文档](https://spark.apache.org/docs/latest/api/python/pyspark .mllib.html)。 'IDFModel'没有'save'方法,而另一个SO问题'RandomForestModel'中的模型确实有它... – lrnzcig
你是对的,谢谢,这将是一个值得追加 – Matt