此方法迭代数据库中的术语列表,检查术语是否在作为参数传递的文本中,如果是,则将其替换为带有术语“参数”的搜索页的链接。为什么这个Python方法泄漏内存?
条款数量很高(约100000),所以这个过程非常慢,但这是好的,因为它是作为一个cron作业执行的。然而,它使脚本存储单耗扶摇直上,我找不到原因:
class SearchedTerm(models.Model):
[...]
@classmethod
def add_search_links_to_text(cls, string, count=3, queryset=None):
"""
Take a list of all researched terms and search them in the
text. If they exist, turn them into links to the search
page.
This process is limited to `count` replacements maximum.
WARNING: because the sites got different URLS schemas, we don't
provides direct links, but we inject the {% url %} tag
so it must be rendered before display. You can use the `eval`
tag from `libs` for this. Since they got different namespace as
well, we enter a generic 'namespace' and delegate to the
template to change it with the proper one as well.
If you have a batch process to do, you can pass a query set
that will be used instead of getting all searched term at
each calls.
"""
found = 0
terms = queryset or cls.on_site.all()
# to avoid duplicate searched terms to be replaced twice
# keep a list of already linkified content
# added words we are going to insert with the link so they won't match
# in case of multi passes
processed = set((u'video', u'streaming', u'title',
u'search', u'namespace', u'href', u'title',
u'url'))
for term in terms:
text = term.text.lower()
# no small word and make
# quick check to avoid all the rest of the matching
if len(text) < 3 or text not in string:
continue
if found and cls._is_processed(text, processed):
continue
# match the search word with accent, for any case
# ensure this is not part of a word by including
# two 'non-letter' character on both ends of the word
pattern = re.compile(ur'([^\w]|^)(%s)([^\w]|$)' % text,
re.UNICODE|re.IGNORECASE)
if re.search(pattern, string):
found += 1
# create the link string
# replace the word in the description
# use back references (\1, \2, etc) to preserve the original
# formatin
# use raw unicode strings (ur"string" notation) to avoid
# problems with accents and escaping
query = '-'.join(term.text.split())
url = ur'{%% url namespace:static-search "%s" %%}' % query
replace_with = ur'\1<a title="\2 video streaming" href="%s">\2</a>\3' % url
string = re.sub(pattern, replace_with, string)
processed.add(text)
if found >= 3:
break
return string
你可能会想这个代码以及:
class SearchedTerm(models.Model):
[...]
@classmethod
def _is_processed(cls, text, processed):
"""
Check if the text if part of the already processed string
we don't use `in` the set, but `in ` each strings of the set
to avoid subtring matching that will destroy the tags.
This is mainly an utility function so you probably won't use
it directly.
"""
if text in processed:
return True
return any(((text in string) for string in processed))
我真的有只引用了两个对象那可能是这里的犯罪嫌疑人:terms
和processed
。但我看不出有什么理由让他们不被垃圾收集。
编辑:
我想我应该说,这种方法被称为的Django模型方法本身里面。我不知道它是否有关,但这里是代码:
class Video(models.Model):
[...]
def update_html_description(self, links=3, queryset=None):
"""
Take a list of all researched terms and search them in the
description. If they exist, turn them into links to the search
engine. Put the reset into `html_description`.
This use `add_search_link_to_text` and has therefor, the same
limitations.
It DOESN'T call save().
"""
queryset = queryset or SearchedTerm.objects.filter(sites__in=self.sites.all())
text = self.description or self.title
self.html_description = SearchedTerm.add_search_links_to_text(text,
links,
queryset)
我可以想象,自动Python正则表达式缓存吃了一些内存。但它应该只做一次,每次调用update_html_description
时内存消耗都会增加。
问题不仅在于它消耗了大量内存,问题在于它不会释放它:每次调用都需要大约3%的内存,最终会填满内存并导致脚本崩溃,导致无法分配内存”。
''在像Python这样的垃圾收集语言中泄漏内存几乎是不可能的。严格来说,内存泄漏是内存中没有可变引用。在C++中,如果您在类中分配内存,但不声明析构函数,则可能会发生内存泄漏。你在这里只是高内存消耗.' ' –
:-)好的。然后,我在每次通话后都获得了越来越高的内存消耗。但是,因为这是一种方法。而且,因为在完成任何事情后我都没有做出任何决定,为什么某些东西仍然会消耗内存? –
更新了有关此问题。 –