我已经成功地调试了自己的内存泄漏问题。但是,我注意到一些非常奇怪的事件。Python内存泄漏 - 解决了,但仍然困惑
for fid, fv in freqDic.iteritems():
outf.write(fid+"\t") #ID
for i, term in enumerate(domain): #Vector
tfidf = self.tf(term, fv) * self.idf(term, docFreqDic)
if i == len(domain) - 1:
outf.write("%f\n" % tfidf)
else:
outf.write("%f\t" % tfidf)
outf.flush()
print "Memory increased by", int(self.memory_mon.usage()) - startMemory
outf.close()
def tf(self, term, freqVector):
total = freqVector[TOTAL]
if total == 0:
return 0
if term not in freqVector: ## When you don't have these lines memory leaks occurs
return 0 ##
return float(freqVector[term])/freqVector[TOTAL]
def idf(self, term, docFrequencyPerTerm):
if term not in docFrequencyPerTerm:
return 0
return math.log(float(docFrequencyPerTerm[TOTAL])/docFrequencyPerTerm[term])
基本上让我描述我的问题: 1)我做TFIDF计算 2)我跟踪内存泄漏的根源是从defaultdict到来。我使用memory_mon从How to get current CPU and RAM usage in Python? 4)我的内存泄漏的原因如下:a)在self.tf中,如果行:if项不在freqVector:return 0中未添加会导致内存泄漏。 (我使用memory_mon验证了这一点,并注意到内存的急剧增加不断增加)
我的问题的解决方案是1)由于fv是defaultdict,所以在fv中找不到它的任何引用都会创建条目。在非常大的域中,这会导致内存泄漏。
我决定使用dict而不是默认的dict,并且内存问题确实消失了。我的唯一难题是:因为fv是在fid中创建的,所以在freqDic.iteritems()中使用fv:“不应该在每个for循环的末尾被销毁?我试着把gc.collect()放在for循环的末尾,但gc不能收集所有东西(返回0)。是的,这个假设是正确的,但是如果for循环会破坏所有的临时变量,那么内存应该保持与循环相当一致。
这是它看起来像在self.tf两个行:
Memory increased by 12
Memory increased by 948
Memory increased by 28
Memory increased by 36
Memory increased by 36
Memory increased by 32
Memory increased by 28
Memory increased by 32
Memory increased by 32
Memory increased by 32
Memory increased by 40
Memory increased by 32
Memory increased by 32
Memory increased by 28
,并没有两行:
Memory increased by 1652
Memory increased by 3576
Memory increased by 4220
Memory increased by 5760
Memory increased by 7296
Memory increased by 8840
Memory increased by 10456
Memory increased by 12824
Memory increased by 13460
Memory increased by 15000
Memory increased by 17448
Memory increased by 18084
Memory increased by 19628
Memory increased by 22080
Memory increased by 22708
Memory increased by 24248
Memory increased by 26704
Memory increased by 27332
Memory increased by 28864
Memory increased by 30404
Memory increased by 32856
Memory increased by 33552
Memory increased by 35024
Memory increased by 36564
Memory increased by 39016
Memory increased by 39924
Memory increased by 42104
Memory increased by 42724
Memory increased by 44268
Memory increased by 46720
Memory increased by 47352
Memory increased by 48952
Memory increased by 50428
Memory increased by 51964
Memory increased by 53508
Memory increased by 55960
Memory increased by 56584
Memory increased by 58404
Memory increased by 59668
Memory increased by 61208
Memory increased by 62744
Memory increased by 64400
我期待着你的答案
编辑: 看来,我的术语可能是错误的(或似乎是错误的)。
- 我指的内存泄漏不是从freqVector [term]生成的。 (在defaultdict中查找不存在的键)。
- 我在说的实际内存泄漏是从
for fid, fv in freqDic.iteritems()
内存泄漏!我知道由于1)fv的尺寸增加了,但在循环结束时它仍然应该被销毁!内存不应该继续扩大。这不是内存泄漏?
谢谢。这就说得通了。 – disappearedng 2010-04-06 15:14:13