2013-06-22 82 views

回答

1

你正在寻找的缺失片叫做“单词矢量”。基本上你必须创建一个新的示例集,其中一个属性将代表一个单词。对于给定的示例(即文档),该属性的(数字)值将显示该文档对该词的“重要性”。

一个幼稚的方法是使用文档中单词的计数,但通常您应该使用TD-IDF(术语频率逆文档频率),它也将考虑整个文档语料库。

要在RapidMiner中执行此操作,您必须安装文本挖掘扩展并使用诸如“从数据处理文档”或“从文件处理文档”等操作符。请记住,对于文本挖掘,您需要执行更多预处理步骤,例如创建令牌,删除停用词(几乎可以在所有文档中找到的常用词,因此不是很有用),并使用词的词干(so “单词”和“单词”将被平等对待)。

这里是一个小例子:

<?xml version="1.0" encoding="UTF-8" standalone="no"?> 
<process version="5.3.009"> 
    <context> 
    <input/> 
    <output/> 
    <macros/> 
    </context> 
    <operator activated="true" class="process" compatibility="5.3.009" expanded="true" name="Process"> 
    <process expanded="true"> 
     <operator activated="true" class="text:create_document" compatibility="5.3.000" expanded="true" height="60" name="Create Document" width="90" x="45" y="75"> 
     <parameter key="text" value="I want to classify text data using classifier model SVM with Rapidminer tool. Classification would be of multilable type. Since my data is of text type, how SVM can be used for this classification. I know that SVM works with numeric data only."/> 
     </operator> 
     <operator activated="true" class="text:create_document" compatibility="5.3.000" expanded="true" height="60" name="Create Document (2)" width="90" x="45" y="165"> 
     <parameter key="text" value="The missing piece you are looking for is called &quot;word vector&quot;. Basically you have to create a new example set for which the attributes will represent the words. For a given example (i.e. a document) the (numerical) value for this attribute will show the &quot;importance&quot; of this word for this document. &#10;&#10;A naive approach would be to use the count of the word within the document, but typically you should use TD-IDF (term frequency–inverse document frequency) which will take the whole document corpus into account as well.&#10;&#10;To do this in RapidMiner you have to install the text mining extension and use operators like &quot;Process Documents from Data&quot; or &quot;Process Documents from Files&quot;. Keep in mind that for text mining you will need to conduct more preprocessing steps like creating tokens, removing stop words (common words which you can find in nearly all documents and which are therefore not very helpful) and use the stem of the words (so &quot;word&quot; and &quot;words&quot; will be treated equally).&#10;&#10;Here is a small example:"/> 
     </operator> 
     <operator activated="true" class="text:process_documents" compatibility="5.3.000" expanded="true" height="112" name="Process Documents" width="90" x="179" y="75"> 
     <process expanded="true"> 
      <operator activated="true" class="text:tokenize" compatibility="5.3.000" expanded="true" height="60" name="Tokenize" width="90" x="45" y="30"/> 
      <operator activated="true" class="text:filter_stopwords_english" compatibility="5.3.000" expanded="true" height="60" name="Filter Stopwords (English)" width="90" x="179" y="30"/> 
      <operator activated="true" class="text:stem_porter" compatibility="5.3.000" expanded="true" height="60" name="Stem (Porter)" width="90" x="313" y="30"/> 
      <connect from_port="document" to_op="Tokenize" to_port="document"/> 
      <connect from_op="Tokenize" from_port="document" to_op="Filter Stopwords (English)" to_port="document"/> 
      <connect from_op="Filter Stopwords (English)" from_port="document" to_op="Stem (Porter)" to_port="document"/> 
      <connect from_op="Stem (Porter)" from_port="document" to_port="document 1"/> 
      <portSpacing port="source_document" spacing="0"/> 
      <portSpacing port="sink_document 1" spacing="0"/> 
      <portSpacing port="sink_document 2" spacing="0"/> 
     </process> 
     </operator> 
     <connect from_op="Create Document" from_port="output" to_op="Process Documents" to_port="documents 1"/> 
     <connect from_op="Create Document (2)" from_port="output" to_op="Process Documents" to_port="documents 2"/> 
     <connect from_op="Process Documents" from_port="example set" to_port="result 1"/> 
     <portSpacing port="source_input 1" spacing="0"/> 
     <portSpacing port="sink_result 1" spacing="0"/> 
     <portSpacing port="sink_result 2" spacing="0"/> 
    </process> 
    </operator> 
</process> 

BTW:也有与RapidMiner YouTube上的几个相当不错的文本挖掘教程。

+0

感谢您的回复。您可以请参阅我的一些很好的文本挖掘教程,我可以找到如何使用SVM用于使用单词矢量进行多标签/多分类的示例。 – kailash

+0

由于SVM提供二项式输出(2个值),它如何提供多个类别值? – kailash

1

这个问题可能相当古老,但也许有更多像我这样的人在那里,只是试验Rapidminer,希望能解决完全相同的问题。

我猜想第一部分关于处理文本一般使用Rapidminer的插件“文本挖掘扩展”已被maerch一段时间的正确解释。但考虑到kailash的评论,主要问题似乎是二项SVM模型和多项式输入/标签集之间的不兼容。

实际的喂养SVM模型是通过添加元运算符“二项分类的多项式”作为SVM的包装。它可以多次合并输入类(以可以使用“分类策略”参数选择的方式),以便总是有两个输入组并将它们提供给SVM,直到可以推导出组合结果。那么最终的模型就可以处理多个类。

下面的过程段说明了一个SVM(默认参数)与它的Poly2Bi-打包机:

<process expanded="true"> 
    <operator activated="true" class="polynomial_by_binomial_classification" compatibility="5.3.015" expanded="true" height="76" name="Polynominal by Binominal Classification" width="90" x="112" y="120"> 
     <parameter key="classification_strategies" value="1 against all"/> 
     <parameter key="random_code_multiplicator" value="2.0"/> 
     <parameter key="use_local_random_seed" value="false"/> 
     <parameter key="local_random_seed" value="1992"/> 
     <process expanded="true"> 
      <operator activated="true" class="support_vector_machine_linear" compatibility="5.3.015" expanded="true" height="76" name="SVM (Linear)" width="90" x="179" y="210"> 
       <parameter key="kernel_cache" value="200"/> 
       <parameter key="C" value="0.0"/> 
       <parameter key="convergence_epsilon" value="0.001"/> 
       <parameter key="max_iterations" value="100000"/> 
       <parameter key="scale" value="true"/> 
       <parameter key="L_pos" value="1.0"/> 
       <parameter key="L_neg" value="1.0"/> 
       <parameter key="epsilon" value="0.0"/> 
       <parameter key="epsilon_plus" value="0.0"/> 
       <parameter key="epsilon_minus" value="0.0"/> 
       <parameter key="balance_cost" value="false"/> 
       <parameter key="quadratic_loss_pos" value="false"/> 
       <parameter key="quadratic_loss_neg" value="false"/> 
      </operator> 
      <connect from_port="training set" to_op="SVM (Linear)" to_port="training set"/> 
      <connect from_op="SVM (Linear)" from_port="model" to_port="model"/> 
      <portSpacing port="source_training set" spacing="0"/> 
      <portSpacing port="sink_model" spacing="0"/> 
     </process> 
    </operator> 
    <connect from_port="training" to_op="Polynominal by Binominal Classification" to_port="training set"/> 
    <connect from_op="Polynominal by Binominal Classification" from_port="model" to_port="model"/> 
    <portSpacing port="source_training" spacing="0"/> 
    <portSpacing port="sink_model" spacing="0"/> 
    <portSpacing port="sink_through 1" spacing="0"/> 
</process> 

注意,当操作者Poly2Bi以这种方式使用的RapidMiner(至少)版本5.3.015抱怨在验证操作员的培训区域内部,并且在测试区域有一个Performance操作员。将出现性能运算符的错误消息:

标签和预测必须是相同类型,但分别是多项式和标称。

但在RapidMiner论坛,他们point out,这似乎是一个无用警告,你可以忽略。就我而言,这个过程也运行良好。