2013-03-14 71 views
0

我正在使用defaultdict(int)来记录一组书中的单词出现次数。MemoryError for defaultdict(int)

Python是消耗RAM的1.5演出时,我得到了内存异常:

File "C:\Python32\lib\collections.py", line 540, in update 
    _count_elements(self, iterable) 
MemoryError 

和我的柜台的大小超过800万。

我有至少20,000,000个独特的词数。我能做些什么来避免内存异常?

+1

数据集中有多少个唯一字? – NPE 2013-03-14 15:33:28

+0

@NPE 20,000,000 – Baz 2013-03-14 15:35:34

+0

了解。一个词的平均长度是多少? – NPE 2013-03-14 15:36:25

回答

1

即使你有一个内存负载的64位系统,我不认为用dict跟踪它们是不可行的。你应该使用数据库。

/* If we added a key, we can safely resize. Otherwise just return! 
* If fill >= 2/3 size, adjust size. Normally, this doubles or 
* quaduples the size, but it's also possible for the dict to shrink 
* (if ma_fill is much larger than ma_used, meaning a lot of dict 
* keys have been * deleted). 
* 
* Quadrupling the size improves average dictionary sparseness 
* (reducing collisions) at the cost of some memory and iteration 
* speed (which loops over every possible entry). It also halves 
* the number of expensive resize operations in a growing dictionary. 
* 
* Very large dictionaries (over 50K items) use doubling instead. 
* This may help applications with severe memory constraints. 
*/ 
if (!(mp->ma_used > n_used && mp->ma_fill*3 >= (mp->ma_mask+1)*2)) 
    return 0; 
return dictresize(mp, (mp->ma_used > 50000 ? 2 : 4) * mp->ma_used); 

code,它说,如果插入了太多的项目,一个dict有增长 - 不仅包含项目提供空间,同时也为新项目槽中。它说如果一个字典的2/3以上被填充,字典的大小将增加一倍(或四倍于少于50,000个项目)。我个人使用的字典包含少于几十万个项目。即使只有不到100万个物品,它也会消耗几千兆字节,几乎冻结我的8GB win7机器。

如果你简单地计算项目,您可以:

spilt the words in chunk 
count the words in each chunk 
update the database 

以合理的块大小,执行一些分贝querys(假设数据库的访问将是一个瓶颈)会更好海事组织。