MemoryError for defaultdict（int）

我正在使用defaultdict（int）来记录一组书中的单词出现次数。MemoryError for defaultdict（int）

Python是消耗RAM的1.5演出时，我得到了内存异常：

File "C:\Python32\lib\collections.py", line 540, in update 
    _count_elements(self, iterable) 
MemoryError

和我的柜台的大小超过800万。

我有至少20,000,000个独特的词数。我能做些什么来避免内存异常？

来源

2013-03-14 Baz

数据集中有多少个唯一字？ – NPE 2013-03-14 15:33:28

@NPE 20,000,000 – Baz 2013-03-14 15:35:34

了解。一个词的平均长度是多少？ – NPE 2013-03-14 15:36:25

即使你有一个内存负载的64位系统，我不认为用dict跟踪它们是不可行的。你应该使用数据库。

/* If we added a key, we can safely resize. Otherwise just return! 
* If fill >= 2/3 size, adjust size. Normally, this doubles or 
* quaduples the size, but it's also possible for the dict to shrink 
* (if ma_fill is much larger than ma_used, meaning a lot of dict 
* keys have been * deleted). 
* 
* Quadrupling the size improves average dictionary sparseness 
* (reducing collisions) at the cost of some memory and iteration 
* speed (which loops over every possible entry). It also halves 
* the number of expensive resize operations in a growing dictionary. 
* 
* Very large dictionaries (over 50K items) use doubling instead. 
* This may help applications with severe memory constraints. 
*/ 
if (!(mp->ma_used > n_used && mp->ma_fill*3 >= (mp->ma_mask+1)*2)) 
    return 0; 
return dictresize(mp, (mp->ma_used > 50000 ? 2 : 4) * mp->ma_used);

从code，它说，如果插入了太多的项目，一个dict有增长 - 不仅包含项目提供空间，同时也为新项目槽中。它说如果一个字典的2/3以上被填充，字典的大小将增加一倍（或四倍于少于50,000个项目）。我个人使用的字典包含少于几十万个项目。即使只有不到100万个物品，它也会消耗几千兆字节，几乎冻结我的8GB win7机器。

如果你简单地计算项目，您可以：

spilt the words in chunk 
count the words in each chunk 
update the database

以合理的块大小，执行一些分贝querys（假设数据库的访问将是一个瓶颈）会更好海事组织。

来源

2013-03-14 16:12:26 thkang

MemoryError for defaultdict（int）

回答

相关问题