2014-04-18 44 views
6

我正在开发一个脚本,需要在启动时处理一个相当大的(620 000字)词典。输入词汇被处理的字的字成defaultdict(list),连键为字母bi和卦和值是包含使用Python的defaultdict(列表)de /序列化性能

for word in lexicon_file: 
    word = word.lower() 
    for letter n-gram in word: 
     lexicon[n-gram].append(word) 

> lexicon["ab"] 
["abracadabra", "abbey", "abnormal"] 
键字母的n-gram单词列表

生成的结构包含25 000个密钥,每个密钥包含1到133 000个字符串(平均值500,中值为20)的列表。所有字符串都在windows-1250编码。

这个处理需要很长的时间(考虑到脚本的预期实际运行时间可以忽略不计,但通常在测试时需要征税),并且由于词典本身从不改变,所以我认为将序列化结果defaultdict(list)然后反序列化可能会更快它在每一个后续的启动。

我发现了什么是使用cPickle即使,反序列化过程大约两倍的时间简单地处理词汇,与平均值接近到:

> normal lexicon creation 
45 seconds 
> cPickle deserialization 
80 seconds 

我没有任何经验与序列化,但我期望反序列化比正常处理更快,至少对于cPickle模块。

我的问题是,这个结果是否可以预期?为什么?有什么方法可以更快地存储/加载我的结构?

+1

您所指定的咸菜协议格式?它默认为ASCII。通过将它作为第三个参数传递给'pickle.dump()',选择二进制版本2(或最新的-1)。 –

+0

@KevinThibedeau指定协议格式有很多帮助。 Unpickling现在可以与正常的过程相比,但仍然比较慢 – Deutherius

回答

2

最好的方法是想写出一堆测试,并使用timeit来查看哪个更快。我在下面进行了一些测试,但您应该使用词典词典来尝试,因为结果可能会有所不同。

如果您希望时间更加稳定(准确),您可以将number参数增加到timeit--它只会使测试花费更长的时间。另请注意,timeit返回的值是总执行时间,而不是每次运行的时间。

testing with 10 keys... 
serialize flat: 2.97198390961 
serialize eval: 4.60271120071 
serialize defaultdict: 20.3057091236 
serialize dict: 20.2011070251 
serialize defaultdict new pickle: 14.5152060986 
serialize dict new pickle: 14.7755970955 
serialize json: 13.5039670467 
serialize cjson: 4.0456969738 
unserialize flat: 1.29577493668 
unserialize eval: 25.6548647881 
unserialize defaultdict: 10.2215960026 
unserialize dict: 10.208122015 
unserialize defaultdict new pickle: 5.70747089386 
unserialize dict new pickle: 5.69750404358 
unserialize json: 5.34811091423 
unserialize cjson: 1.50241613388 
testing with 100 keys... 
serialize flat: 2.91076397896 
serialize eval: 4.72978711128 
serialize defaultdict: 21.331786871 
serialize dict: 21.3218340874 
serialize defaultdict new pickle: 15.7140991688 
serialize dict new pickle: 15.6440980434 
serialize json: 14.3557379246 
serialize cjson: 5.00576901436 
unserialize flat: 1.6677339077 
unserialize eval: 22.9142649174 
unserialize defaultdict: 10.7773029804 
unserialize dict: 10.7524499893 
unserialize defaultdict new pickle: 6.13370203972 
unserialize dict new pickle: 6.18057107925 
unserialize json: 5.92281794548 
unserialize cjson: 1.91151690483 

代码:

import cPickle 
import json 
try: 
    import cjson # not Python standard library 
except ImportError: 
    cjson = False 
from collections import defaultdict 

dd1 = defaultdict(list) 
dd2 = defaultdict(list) 

for i in xrange(1000000): 
    dd1[str(i % 10)].append(str(i)) 
    dd2[str(i % 100)].append(str(i)) 

dt1 = dict(dd1) 
dt2 = dict(dd2) 

from timeit import timeit 

def testdict(dd, dt): 
    def serialize_defaultdict(): 
     with open('defaultdict.pickle', 'w') as f: 
      cPickle.dump(dd, f) 

    def serialize_p2_defaultdict(): 
     with open('defaultdict.pickle2', 'w') as f: 
      cPickle.dump(dd, f, -1) 

    def serialize_dict(): 
     with open('dict.pickle', 'w') as f: 
      cPickle.dump(dt, f) 

    def serialize_p2_dict(): 
     with open('dict.pickle2', 'w') as f: 
      cPickle.dump(dt, f, -1) 

    def serialize_json(): 
     with open('dict.json', 'w') as f: 
      json.dump(dt, f) 

    if cjson: 
     def serialize_cjson(): 
      with open('dict.cjson', 'w') as f: 
       f.write(cjson.encode(dt)) 

    def serialize_flat(): 
     with open('dict.flat', 'w') as f: 
      f.write('\n'.join([' '.join([k] + v) for k, v in dt.iteritems()])) 

    def serialize_eval(): 
     with open('dict.eval', 'w') as f: 
      f.write('\n'.join([k + '\t' + repr(v) for k, v in dt.iteritems()])) 

    def unserialize_defaultdict(): 
     with open('defaultdict.pickle') as f: 
      assert cPickle.load(f) == dd 

    def unserialize_p2_defaultdict(): 
     with open('defaultdict.pickle2') as f: 
      assert cPickle.load(f) == dd 

    def unserialize_dict(): 
     with open('dict.pickle') as f: 
      assert cPickle.load(f) == dt 

    def unserialize_p2_dict(): 
     with open('dict.pickle2') as f: 
      assert cPickle.load(f) == dt 

    def unserialize_json(): 
     with open('dict.json') as f: 
      assert json.load(f) == dt 

    if cjson: 
     def unserialize_cjson(): 
      with open('dict.cjson') as f: 
       assert cjson.decode(f.read()) == dt 

    def unserialize_flat(): 
     with open('dict.flat') as f: 
      dtx = {} 
      for line in f:                                                         
       vals = line.split() 
       dtx[vals[0]] = vals[1:] 
      assert dtx == dt 

    def unserialize_eval(): 
     with open('dict.eval') as f: 
      dtx = {} 
      for line in f:                                                          
       vals = line.split('\t') 
       dtx[vals[0]] = eval(vals[1]) 
      assert dtx == dt 

    print 'serialize flat:', timeit(serialize_flat, number=10) 
    print 'serialize eval:', timeit(serialize_eval, number=10) 
    print 'serialize defaultdict:', timeit(serialize_defaultdict, number=10) 
    print 'serialize dict:', timeit(serialize_dict, number=10) 
    print 'serialize defaultdict new pickle:', timeit(serialize_p2_defaultdict, number=10) 
    print 'serialize dict new pickle:', timeit(serialize_p2_dict, number=10) 
    print 'serialize json:', timeit(serialize_json, number=10) 
    if cjson: 
     print 'serialize cjson:', timeit(serialize_cjson, number=10) 
    print 'unserialize flat:', timeit(unserialize_flat, number=10) 
    print 'unserialize eval:', timeit(unserialize_eval, number=10) 
    print 'unserialize defaultdict:', timeit(unserialize_defaultdict, number=10) 
    print 'unserialize dict:', timeit(unserialize_dict, number=10) 
    print 'unserialize defaultdict new pickle:', timeit(unserialize_p2_defaultdict, number=10) 
    print 'unserialize dict new pickle:', timeit(unserialize_p2_dict, number=10) 
    print 'unserialize json:', timeit(unserialize_json, number=10) 
    if cjson: 
     print 'unserialize cjson:', timeit(unserialize_cjson, number=10) 

print 'testing with 10 keys...' 
testdict(dd1, dt1) 

print 'testing with 100 keys...' 
testdict(dd2, dt2) 
+0

谢谢您对可用方法的精彩和精疲力尽的比较。我选择了扁平化/序列化方法,因为速度几乎是这里唯一的问题。我也尝试玩cjson,但它似乎默认使用unicode进行编码/解码,这会在我的脚本中产生不一致。 – Deutherius

+0

另外,在我的情况下,cjson文件占用的磁盘空间几乎是平面方法的一倍(cjson为242 MB,平面为129 MB),而速度却是缓慢的两倍(json的14.6秒反序列化vs 6秒平面) – Deutherius