2013-09-27 49 views
0

将样品输入文件(实际输入文件包含大约50,000个条目):要根据条件形成群集?

615 146 
615 180 
615 53 
615 42 
615 52 
615 52 
615 51 
615 45 
615 49 
616 34 
616 44 
616 42 
616 41 
616 42 
617 42 
617 43 
617 42 
685 33 
685 33 
685 33 
686 33 
686 33 
687 47 
687 68 
737 449 
737 41 
737 1138 
738 46 
738 53 

我必须在列中的每个值与相同的值等615615615比较必须被分组在一起群集必须包含像146180 COLUMN1值.. ...... 45,49则群集必须打破&形式的另一个群集为下一组相同的值616616616 ..........的等

我写的代码是:

from __future__ import division 
from sys import exit 
h = 0 
historyjobs = [] 
targetjobs = [] 


def quickzh(zhlistsub, 
    targetjobs=targetjobs,num=0,denom=0): 

li = [] ; ji = [] 
j = 0 
for i in zhlistsub: 
    x1 = targetjobs[j][0] 

    x = targetjobs[i][0] 

    num += x 
    denom += 1 
    if x1 >= 0.9 * (num/denom):#to group all items with same value in column 0 
     li.append(targetjobs[i][1]) 
    else: 
     break  
return li 


def filewr(listli): 
global h 
s = open("newout1","a") 
if(len(listli) != 0): 
     h += 1 
     s.write("cluster: %d"%h) 
     s.write("\n") 
     s.write(str(listli)) 
     s.write("\n\n") 
else: 
     print "0" 


def new(inputfile, 
historyjobs=historyjobs,targetjobs=targetjobs): 
zhlistsub = [];zhlist = [] 
k = 0 

with open(inputfile,'r') as f: 
    for line in f: 
     job = map(int,line.split()) 
     targetjobs.append(job) 
    while True: 
    if len(targetjobs) != 0: 

     zhlistsub = [i for i, element in enumerate(targetjobs)] 

     if zhlistsub: 
      listrun = quickzh(zhlistsub) 
      filewr(listrun) 
     historyjobs.append(targetjobs.pop(0)) 
     k += 1 
    else: 
     break 

new('newfinal1') 

输出,我得到的是:

cluster: 1 
[146, 180, 53, 42, 52, 52, 51, 45, 49, 34, 44, 42, 41, 42, 42, 43, 42, 33, 33, 33, 33, 33, 47, 68, 449, 41, 1138, 46, 53] 

cluster: 2 
[180, 53, 42, 52, 52, 51, 45, 49, 34, 44, 42, 41, 42, 42, 43, 42, 33, 33, 33, 33, 33, 47, 68, 449, 41, 1138, 46, 53] 

cluster: 3 
[53, 42, 52, 52, 51, 45, 49, 34, 44, 42, 41, 42, 42, 43, 42, 33, 33, 33, 33, 33, 47, 68, 449, 41, 1138, 46, 53] 
..................so on 

但是,我需要输出为:

cluster: 1 
    [146, 180, 53, 42, 52, 52, 51, 45, 49] 
    cluster: 2 
    [34, 44, 42, 41, 42] 
    cluster: 3 
    [42, 43, 42] 
    _____________________ so on 

所以任何人都可以建议我应该做哪些改变来调节,以获得所需的结果。它是真的有用吗?

+3

我有一个真正艰难的时间,了解你需要什么...但通常对于分组,'itertools.groupby'或者'collections.defaultdict'是要走的路... – mgilson

回答

1

试试这个,groupby负责创建群的照顾,所有剩下要做的就是建立名单:

import itertools as it 
[[y[1] for y in x[1]] for x in it.groupby(data, key=lambda x:x[0])] 

上述假设data是你输入所在,而且它已经过滤和排序由第一列。对于这个问题的例子,它看起来像这样:

data = [[615, 146], [615, 180], [615, 53] ... ] 
+0

如果x1> = 0.9 *(num/denom),你可以在我的if if条件中提出一些条件:''提供结果。 –

+0

我的答案有助于构建群集,但尚不清楚如何使用该条件过滤值。我只能建议你将问题分成两部分,首先过滤掉输入,在我的例子中建立一个列表作为'data',然后用上面的列表理解建立集群 –

1

没有测试的答案,但按照这个概念

import collections.defaultdict 

cluster=defaultdict(list) 

with open(inputfile,'r') as f: 
    for line in f: 
     clus, val = line.split() 
     cluster[clus].append(val) 

for clus, val in cluster: 
    print "cluster" +str(clus)+"\n" 
    print str(val)+"\n"