有numFeatures之间HashingTF在星火MLlib和术语在文档(句子)的实际数量的任何关系?Spark MLlib中的HashingTF中的numFeatures和文档中的实际条目数之间的关系是什么?
List<Row> data = Arrays.asList(
RowFactory.create(0.0, "Hi I heard about Spark"),
RowFactory.create(0.0, "I wish Java could use case classes"),
RowFactory.create(1.0, "Logistic regression models are neat")
);
StructType schema = new StructType(new StructField[]{
new StructField("label", DataTypes.DoubleType, false, Metadata.empty()),
new StructField("sentence", DataTypes.StringType, false, Metadata.empty())
});
Dataset<Row> sentenceData = spark.createDataFrame(data, schema);
Tokenizer tokenizer = new Tokenizer().setInputCol("sentence").setOutputCol("words");
Dataset<Row> wordsData = tokenizer.transform(sentenceData);
int numFeatures = 20;
HashingTF hashingTF = new HashingTF()
.setInputCol("words")
.setOutputCol("rawFeatures")
.setNumFeatures(numFeatures);
Dataset<Row> featurizedData = hashingTF.transform(wordsData);
正如Spark Mllib文档中提到的那样,HashingTF将每个句子转换为具有numFeatures长度的特征向量。 如果这里的每个文档,在这种情况下,句子包含数千个术语,会发生什么? numFeatures的价值应该是什么?如何计算该值?