数据结构：Top K排序字典键值

我有一个非常大的字典，其形式为{(Tuple) : [int, int]}。例如，dict = {(1.0, 2.1):[2,3], (2.0, 3.1):[1,4],...}无法放入内存。数据结构：Top K排序字典键值

我只对这个字典中按照每个键的值的第一个元素排序的顶部K值感兴趣。如果有一个数据结构可以让我只保留最大的K个键值对？作为一个例子，我只需要我的字典中的3个值。我可以放入以下键值对; (1.0, 2.1):[2,3], (2.0, 3.1):[1,4], (3.1, 4.2):[8,0], (4.3, 4.1):[1,1]和我的字典将是：(3.1, 4.2):[8,0], (1.0, 2.1):[2,3], (2.0, 3.1):[1,4]（如果键值对具有相同的第一个元素，则会检查第二个元素，并且将保留基于第二个元素的最大键值对）

来源

2017-03-04 Black

如何创建这本词典没有？你想在创建时或创建字典之后做到这一点？ – Kasramvd

如果你不反对使用numpy它有'partition'和'argpartition'，它可以在O（n）中找到顶部或底部的k。 –

对不起，我应该解释一下，我无法将我的字典保存在内存中。 – Black

import heapq 


class OnlyKDict(object): 

    def __init__(self,K,key=lambda x:x): 
     self.data = [] 
     self.dictionary = {} 
     self.key=key   # Lambda function for the comparator 
     self.K = K   # How many values to keep in dictionary 

    def push(self,item): 
     heapq.heappush(self.data,(self.key(item),item)) 
     self.dictionary[item[0]]=item[1] 
     if len(self.data)>self.K: #Size greater than k? pop minimum from heap and dict. 
      item = self.pop()  #This ensure only k largest are there. 
      self.dictionary.pop(item[0],None) 

    def pop(self): 
     return heapq.heappop(self.data)[1] 

    def __getitem__(self,key): 
     return self.dictionary[key] 

    def __setitem__(self,key,value): 
     if self.dictionary.has_key(key): 
      self.dictionary[key] = value #If key present update value 
     else: 
      self.push((key,value)) ##Else push key and value as a tuple 

h = OnlyKDict(8,lambda x:x[0][1] if x[0][1]==x[0][0] else x[0][0]) ##Compare 2nd value if both equal else compare 1st value only. 

for i in xrange(10): 
    h[(i,i)] = [i,i] 

print h.dictionary

输出：{（5,5）：[5,5]，（6，6）：[6,6]，（4，4）：[4，4]，（7,7 ）：[7,7]，（9,9）：[9,9]，（8,8）：[8,8]，（2,2）：[2,2]，（3,3）：[3，3]}

你可以看到只有前8个值存储在这里。

主要的东西取自heapq with custom compare predicate。

我们所做的是创建我们的自定义堆类，该类需要一个关键参数，我们指定要排序的值。

接下来是每当这个尺寸大于8时，我们弹出最小项目。这确保我们始终只有最多8个值。

来源

2017-03-04 06:46:38

为什么不用['heapq.nlargest']（https://docs.python.org/3/library/heapq.html#heapq.nlargest）与'key = ...'？ –

不，我们只保留8个值，因为这是要求。接下来他还想要返回一本字典。这就是为什么make_dict函数.. –

但是，你所说的是对的 –

如果您的数据不适合内存，您需要特别关注它的存储方式。它是在数据库，平面文件，csv文件，JSON还是什么？

如果是“矩形”文件格式，那么只需使用标准* nix排序实用程序，然后在第一行k行中读取即可。

来源

2017-03-04 06:25:51

这里是一个定制的OrderedDict这使N个最大键为您提供：

from collections import OrderedDict 
from operator import itemgetter 


class LimitedSizeOrderedDict(OrderedDict): 
    def __init__(self, *args, **kwds): 
     self.maxlen = kwds.pop("maxlen", None) 
     if args: 
      try: 
       top_n = sorted(*args, key=itemgetter(0, 0))[-self.maxlen:] 
       self.min_key = top_n[0][0] 
      except TypeError: 
       raise Exception("keys should be in tuple format") 
     else: 
      self.min_key = (float("inf"), 0) 
     super(LimitedSizeOrderedDict, self).__init__(top_n, **kwds) 

    def __setitem__(self, key, value): 
     if self._check_size(): 
      OrderedDict.__setitem__(self, key, value) 
      if key[0] < self.min_key[0]: 
       self.min_key = key 
     elif key[0] > self.min_key[0]: 
      self.pop(self.min_key) 
      OrderedDict.__setitem__(self, key, value) 
      self.min_key = min(self, key=itemgetter(0)) 

    def _check_size(self): 
     if self.maxlen is not None: 
      if len(self) < self.maxlen: 
       return True 
      return False 
     return True

演示：

In [2]: a = LimitedSizeOrderedDict([((7,2),3), ((2, 5), 3), ((6, 0), 1)], maxlen= 2) 

In [3]: a 
Out[3]: LimitedSizeOrderedDict([((6, 0), 1), ((7, 2), 3)]) 

In [4]: a[(12, 5)] = 10 

In [5]: a 
Out[5]: LimitedSizeOrderedDict([((7, 2), 3), ((12, 5), 10)]) 

In [6]: a[(10, 5)] = 9 

In [7]: a 
Out[7]: LimitedSizeOrderedDict([((12, 5), 10), ((10, 5), 9)]) 

In [8]: a[(0, 5)] = 9 

In [9]: a 
Out[9]: LimitedSizeOrderedDict([((12, 5), 10), ((10, 5), 9)])

来源

2017-03-04 07:11:11 Kasramvd

是否有'top_n = sorted（args，itemgetter（0））[：self.maxlen]'意思是我必须阅读我所有的数据？ – Black

@Black不，如果您在创建时将任何项目传递到字典，它将在初始化时返回前N个项目。 – Kasramvd

@Black签出更新以获得更全面的答案。 – Kasramvd

数据结构：Top K排序字典键值

回答

相关问题