2010-04-06 34 views
2

我已经成功地调试了自己的内存泄漏问题。但是,我注意到一些非常奇怪的事件。Python内存泄漏 - 解决了,但仍然困惑

for fid, fv in freqDic.iteritems(): 
     outf.write(fid+"\t")    #ID 
     for i, term in enumerate(domain): #Vector 
      tfidf = self.tf(term, fv) * self.idf(term, docFreqDic) 
      if i == len(domain) - 1: 
       outf.write("%f\n" % tfidf) 
      else: 
       outf.write("%f\t" % tfidf) 
     outf.flush() 
     print "Memory increased by", int(self.memory_mon.usage()) - startMemory 

    outf.close() 

def tf(self, term, freqVector): 
    total = freqVector[TOTAL] 
    if total == 0: 
     return 0 
    if term not in freqVector:  ## When you don't have these lines memory leaks occurs 
     return 0     ## 
    return float(freqVector[term])/freqVector[TOTAL] 


def idf(self, term, docFrequencyPerTerm): 
    if term not in docFrequencyPerTerm: 
     return 0   
    return math.log(float(docFrequencyPerTerm[TOTAL])/docFrequencyPerTerm[term]) 

基本上让我描述我的问题: 1)我做TFIDF计算 2)我跟踪内存泄漏的根源是从defaultdict到来。我使用memory_mon从How to get current CPU and RAM usage in Python? 4)我的内存泄漏的原因如下:a)在self.tf中,如果行:if项不在freqVector:return 0中未添加会导致内存泄漏。 (我使用memory_mon验证了这一点,并注意到内存的急剧增加不断增加)

我的问题的解决方案是1)由于fv是defaultdict,所以在fv中找不到它的任何引用都会创建条目。在非常大的域中,这会导致内存泄漏。

我决定使用dict而不是默认的dict,并且内存问题确实消失了。我的唯一难题是:因为fv是在fid中创建的,所以在freqDic.iteritems()中使用fv:“不应该在每个for循环的末尾被销毁?我试着把gc.collect()放在for循环的末尾,但gc不能收集所有东西(返回0)。是的,这个假设是正确的,但是如果for循环会破坏所有的临时变量,那么内存应该保持与循环相当一致。

这是它看起来像在self.tf两个行:

Memory increased by 12 
Memory increased by 948 
Memory increased by 28 
Memory increased by 36 
Memory increased by 36 
Memory increased by 32 
Memory increased by 28 
Memory increased by 32 
Memory increased by 32 
Memory increased by 32 
Memory increased by 40 
Memory increased by 32 
Memory increased by 32 
Memory increased by 28 

,并没有两行:

Memory increased by 1652 
Memory increased by 3576 
Memory increased by 4220 
Memory increased by 5760 
Memory increased by 7296 
Memory increased by 8840 
Memory increased by 10456 
Memory increased by 12824 
Memory increased by 13460 
Memory increased by 15000 
Memory increased by 17448 
Memory increased by 18084 
Memory increased by 19628 
Memory increased by 22080 
Memory increased by 22708 
Memory increased by 24248 
Memory increased by 26704 
Memory increased by 27332 
Memory increased by 28864 
Memory increased by 30404 
Memory increased by 32856 
Memory increased by 33552 
Memory increased by 35024 
Memory increased by 36564 
Memory increased by 39016 
Memory increased by 39924 
Memory increased by 42104 
Memory increased by 42724 
Memory increased by 44268 
Memory increased by 46720 
Memory increased by 47352 
Memory increased by 48952 
Memory increased by 50428 
Memory increased by 51964 
Memory increased by 53508 
Memory increased by 55960 
Memory increased by 56584 
Memory increased by 58404 
Memory increased by 59668 
Memory increased by 61208 
Memory increased by 62744 
Memory increased by 64400 

我期待着你的答案

编辑: 看来,我的术语可能是错误的(或似乎是错误的)。

  1. 我指的内存泄漏不是从freqVector [term]生成的。 (在defaultdict中查找不存在的键)。
  2. 我在说的实际内存泄漏是从for fid, fv in freqDic.iteritems()内存泄漏!我知道由于1)fv的尺寸增加了,但在循环结束时它仍然应该被销毁!内存不应该继续扩大。这不是内存泄漏?

回答

2

freqDict进行迭代不会生成新值,但会将引用传递给已由dict保存的值。这意味着即使在循环之后,您也可以向freqDict保持的fv添加新值。

另一个解决方案是在循环结束后清除freqDict。

一般来说,Python确实通过引用传递了所有内容,尽管它有时会以其他方式出现。字符串和整数是不可变的,如果它们被改变,它们所代表的对象将被替换。

+0

谢谢。这就说得通了。 – disappearedng 2010-04-06 15:14:13

0

这不是内存泄漏,因为内存没有泄漏,它是由你的默认词典例如

from collections import defaultdict 

d = defaultdict(int) 
for i in xrange(10**7): 
    a = d[i] 

你认为这是内存泄漏吗?你正在给一个字典赋值并且内存使用量会因为它而增加,所以它类似于这个

d = {} 
for i in xrange(10**7): 
    d[i] = 0 

这不是内存泄漏。

+0

请阅读我的编辑评论 – disappearedng 2010-04-06 15:13:15

1

我怀疑Python的内存使用量可能会增加,因为浮点数也是Python中的对象,并且解释器维护着一个无限且不朽的浮点数freelist。因此,每当float计算结果产生一个以前没有发生的新float时,Python就会在freelist中分配一个新的float对象,然后它保留该对象以防以后可能需要它。

请参阅Python bug跟踪器here中的类似讨论。