即使你有一个内存负载的64位系统,我不认为用dict
跟踪它们是不可行的。你应该使用数据库。
/* If we added a key, we can safely resize. Otherwise just return!
* If fill >= 2/3 size, adjust size. Normally, this doubles or
* quaduples the size, but it's also possible for the dict to shrink
* (if ma_fill is much larger than ma_used, meaning a lot of dict
* keys have been * deleted).
*
* Quadrupling the size improves average dictionary sparseness
* (reducing collisions) at the cost of some memory and iteration
* speed (which loops over every possible entry). It also halves
* the number of expensive resize operations in a growing dictionary.
*
* Very large dictionaries (over 50K items) use doubling instead.
* This may help applications with severe memory constraints.
*/
if (!(mp->ma_used > n_used && mp->ma_fill*3 >= (mp->ma_mask+1)*2))
return 0;
return dictresize(mp, (mp->ma_used > 50000 ? 2 : 4) * mp->ma_used);
从code,它说,如果插入了太多的项目,一个dict有增长 - 不仅包含项目提供空间,同时也为新项目槽中。它说如果一个字典的2/3以上被填充,字典的大小将增加一倍(或四倍于少于50,000个项目)。我个人使用的字典包含少于几十万个项目。即使只有不到100万个物品,它也会消耗几千兆字节,几乎冻结我的8GB win7机器。
如果你简单地计算项目,您可以:
spilt the words in chunk
count the words in each chunk
update the database
以合理的块大小,执行一些分贝querys(假设数据库的访问将是一个瓶颈)会更好海事组织。
数据集中有多少个唯一字? – NPE 2013-03-14 15:33:28
@NPE 20,000,000 – Baz 2013-03-14 15:35:34
了解。一个词的平均长度是多少? – NPE 2013-03-14 15:36:25