一个相对复杂的过程,但这里有一个可以复制的实例。
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="7.0.000">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="7.0.000" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="retrieve" compatibility="7.0.000" expanded="true" height="68" name="Retrieve Iris" width="90" x="112" y="34">
<parameter key="repository_entry" value="//Samples/data/Iris"/>
</operator>
<operator activated="true" class="k_means" compatibility="7.0.000" expanded="true" height="82" name="Clustering" width="90" x="246" y="34">
<parameter key="k" value="10"/>
</operator>
<operator activated="true" class="generate_attributes" compatibility="7.0.000" expanded="true" height="82" name="Generate Attributes" width="90" x="246" y="136">
<list key="function_descriptions">
<parameter key="category" value="if(rand()>0.5, "state", "notstate")"/>
<parameter key="categoryNumeric" value="if(category=="state", 1, 0)"/>
</list>
</operator>
<operator activated="true" class="aggregate" compatibility="7.0.000" expanded="true" height="82" name="Aggregate" width="90" x="246" y="238">
<list key="aggregation_attributes">
<parameter key="categoryNumeric" value="average"/>
</list>
<parameter key="group_by_attributes" value="cluster"/>
</operator>
<operator activated="true" class="generate_attributes" compatibility="7.0.000" expanded="true" height="82" name="Generate Attributes (4)" width="90" x="380" y="340">
<list key="function_descriptions">
<parameter key="description" value="if ([average(categoryNumeric)]>0.5, "state","private")"/>
</list>
</operator>
<operator activated="true" class="join" compatibility="7.0.000" expanded="true" height="82" name="Join" width="90" x="514" y="238">
<parameter key="join_type" value="left"/>
<parameter key="use_id_attribute_as_key" value="false"/>
<list key="key_attributes">
<parameter key="cluster" value="cluster"/>
</list>
</operator>
<operator activated="true" class="jdbc_connectors:write_database" compatibility="7.0.000" expanded="true" height="68" name="Write Database" width="90" x="715" y="238">
<parameter key="connection" value="LocalMYSQL"/>
<parameter key="schema_name" value="ascom"/>
<parameter key="table_name" value="joinresult"/>
</operator>
<connect from_op="Retrieve Iris" from_port="output" to_op="Clustering" to_port="example set"/>
<connect from_op="Clustering" from_port="cluster model" to_port="result 1"/>
<connect from_op="Clustering" from_port="clustered set" to_op="Generate Attributes" to_port="example set input"/>
<connect from_op="Generate Attributes" from_port="example set output" to_op="Aggregate" to_port="example set input"/>
<connect from_op="Aggregate" from_port="example set output" to_op="Generate Attributes (4)" to_port="example set input"/>
<connect from_op="Aggregate" from_port="original" to_op="Join" to_port="left"/>
<connect from_op="Generate Attributes (4)" from_port="example set output" to_op="Join" to_port="right"/>
<connect from_op="Join" from_port="join" to_op="Write Database" to_port="input"/>
<connect from_op="Write Database" from_port="through" to_port="result 2"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
<portSpacing port="sink_result 3" spacing="0"/>
</process>
</operator>
</process>
的要点是
- 创建对应于
category
称为categoryNumeric
它被设置为1,如果是category
state
否则为0的属性。
- 按聚类进行聚合,取平均值
categoryNumeric
。如果聚合值大于0.5,则表示大多数群集示例的category
等于state
。
- 根据大多数确定,在聚合结果中创建一个新属性,称为
description
。
- 每个集群现在都有附加数据,并且可以使用集群标识符作为关键字将其连接到原始数据。
- 写入到数据库中(我用的MySQL)
希望这有助于为一个开始。
这将是可能的,但要清楚你的意思是以下吗?如果cluster1中有100个例子,其中51个有另一个属性,称为'category'设置为'state',然后将另一个属性称为'description'为'state',否则将'description'设置为'private' 。考虑每个群集的数量,重复其他群集。将最终结果保存到数据库中。 – awchisholm
正是。因此,要保存在数据库中的最终结果(如果对于例如多数是“状态”)将是: [集群1的质心] [desc ='状态'] –