2013-01-18 43 views
1

我已经设法使用pymongo为mongoDB编写简单的索引器脚本。但我不知道为什么索引,添加文档和查询会占用服务器上96GB的RAM。为什么mongoDB占用这么多RAM空间?

是因为我的查询没有优化?我怎样才能优化我的查询,而不是database.find_one({"eng":src})

我怎么能优化我的索引器脚本?

所以我的输入是本身(实际数据输入具有变化的句子的长度的2000000 +线):

#srcfile

You will be aware from the press and television that there have been a number of bomb explosions and killings in Sri Lanka. 
One of the people assassinated very recently in Sri Lanka was Mr Kumar Ponnambalam, who had visited the European Parliament just a few months ago. 
Would it be appropriate for you, Madam President, to write a letter to the Sri Lankan President expressing Parliament's regret at his and the other violent deaths in Sri Lanka and urging her to do everything she possibly can to seek a peaceful reconciliation to a very difficult situation? 
Yes, Mr Evans, I feel an initiative of the type you have just suggested would be entirely appropriate. 
If the House agrees, I shall do as Mr Evans has suggested. 

#trgfile

Wie Sie sicher aus der Presse und dem Fernsehen wissen, gab es in Sri Lanka mehrere Bombenexplosionen mit zahlreichen Toten. 
Zu den Attentatsopfern, die es in jüngster Zeit in Sri Lanka zu beklagen gab, zählt auch Herr Kumar Ponnambalam, der dem Europäischen Parlament erst vor wenigen Monaten einen Besuch abgestattet hatte. 
Wäre es angemessen, wenn Sie, Frau Präsidentin, der Präsidentin von Sri Lanka in einem Schreiben das Bedauern des Parlaments zum gewaltsamen Tod von Herrn Ponnambalam und anderen Bürgern von Sri Lanka übermitteln und sie auffordern würden, alles in ihrem Kräften stehende zu tun, um nach einer friedlichen Lösung dieser sehr schwierigen Situation zu suchen? 
Ja, Herr Evans, ich denke, daß eine derartige Initiative durchaus angebracht ist. 
Wenn das Haus damit einverstanden ist, werde ich dem Vorschlag von Herrn Evans folgen. 

一个例子文档看起来像这样

{ 
    "_id" : ObjectId("50f5fe8916174763f6217994"), 
    "deu" : "Wie Sie sicher aus der Presse und dem Fernsehen wissen, gab es in Sri 
      Lanka mehrere Bombenexplosionen mit zahlreichen Toten.\n", 
    "uid" : 13, 
    "eng" : "You will be aware from the press and television that there have been a 
      number of bomb explosions and killings in Sri Lanka." 
} 

我的代码

# -*- coding: utf8 -*- 
import codecs, glob, os 
from pymongo import MongoClient 
from itertools import izip 
from bson.code import Code 

import sys 
reload(sys) 
sys.setdefaultencoding("utf-8") 

# Gets first instance of matching key given a value and a dictionary.  
def getKey(dic, value): 
    return [k for k,v in dic.items() if v == value] 

def langiso (lang, isochar=3): 
    languages = {"en":"eng", 
       "da":"dan","de":"deu", 
       "es":"spa", 
       "fi":"fin","fr":"fre", 
       "it":"ita", 
       "nl":"nld", 
       "zh":"mcn"} 
    if len(lang) == 2 or isochar==3: 
    return languages[lang] 
    if len(lang) == 3 or isochar==2: 
    return getKey(lang) 

def txtPairs (bitextDir): 
    txtpairs = {} 
    for infile in glob.glob(os.path.join(bitextDir, '*')): 
    #print infile 
    k = infile[-8:-3]; lang = infile[-2:] 
    try: 
     txtpairs[k] = (txtpairs[k],infile) if lang == "en" else (infile,txtpairs[k]) 
    except: 
     txtpairs[k] = infile 
    for i in txtpairs: 
    if len(txtpairs[i]) != 2: 
     del txtpairs[i] 
    return txtpairs 

def indexEuroparl(sfile, tfile, database): 
    trglang = langiso(tfile[-2:]) #; srclang = langiso(sfile[-2:]) 

    maxdoc = database.find().sort("uid",-1).limit(1) 
    uid = 1 if maxdoc.count() == 0 else maxdoc[0] 

    counter = 0 
    for src, trg in izip(codecs.open(sfile,"r","utf8"), \ 
         codecs.open(tfile,"r","utf8")): 
    quid = database.find_one({"eng":src}) 
    # If sentence already exist in db 
    if quid != None: 
     if database.find({trglang: {"$exists": True}}): 
     print "Sentence uniqID",quid["uid"],"already exist." 
     continue 
     else: 
     print "Reindexing uniqID",quid["uid"],"..." 
     database.update({"uid":quid["uid"]}, {"$push":{trglang:trg}}) 
    else: 
     print "Indexing uniqID",uid,"..." 
     doc = {"uid":uid,"eng":src,trglang:trg} 
     database.insert(doc) 
     uid+=1 
    if counter == 1000: 
     for i in database.find(): 
     print i 
     counter = 0 
    counter+=1 

connection = MongoClient() 
db = connection["europarl"] 
v7 = db["v7"] 

srcfile = "eng-deu.en"; trgfile = "eng-deu.de" 
indexEuroparl(srcfile,trgfile,v7) 

# After indexing the english-german pair, i'll perform the same indexing on other language pairs 
srcfile = "eng-spa.en"; trgfile = "eng-spa.es" 
indexEuroparl(srcfile,trgfile,v7) 
+1

为了节省我们不必试图了解你的代码,你可以告诉我们一个示例文档? – Sammaye

+0

所以你正在查询你需要离开的东西吗?译文?如果我错了,请更正我的错误,但要获得您正在查找的查询,您必须已经知道您希望提取哪个查询,这意味着:为什么要执行查询? – Sammaye

+0

把'getIndexes'输出和'explain()'查询 –

回答

0

经过几轮代码分析的,我发现那里的RAM被泄漏到。

首先,如果我要经常查询"eng"场,我应该这样做,创建该领域中的指标:

v7.ensure_index([("eng",1),("unique",True)]) 

解析跨越未编入索引"eng"领域采取串行搜索的时间。

二,出血RAM问题是由于这种昂贵的函数调用:

doc = {"uid":uid,"eng":src,trglang:trg} 
if counter == 1000: 
    for i in database.find(): 
    print i 
    counter = 0 
counter+=1 

什么MongoDB的确实是它存储的结果到RAM中@Sammaye已经注意到。每次我调用database.find()时,它都会在RAM中保存一整套我添加到集合中的文档。这就是我烧毁96GB的RAM。上面的代码需要被改为:

doc = {"uid":uid,"eng":src,trglang:trg} 
if counter == 1000: 
    print doc 
counter+=1 

通过消除database.find()和创建"eng"领域的指数,我只使用最多25GB和我已经完成了200万的指标句子在不到1个小时内。

相关问题