2016-03-08 66 views
8

有一个C++相比,从列表的列表获取列表的工会:The fastest way to find union of sets的最快方法 - Python的

而且还有其他几个蟒蛇相关的问题,但没有提出建立工会的名单最快的方法:

从答案,我收集了该疗法e为至少2种方式来做到这一点:

>>> from itertools import chain 
>>> x = [[1,2,3], [3,4,5], [1,7,8]] 
>>> list(set().union(*x)) 
[1, 2, 3, 4, 5, 7, 8] 
>>> list(set(chain(*x))) 
[1, 2, 3, 4, 5, 7, 8] 

请注意,我铸造一套事后列出,因为我需要在列表的顺序是固定作进一步处理。

经过一番比较后,好像list(set(chain(*x)))更稳定,花费较少的时间:

from itertools import chain 
import time 
import random 

# Dry run. 
x = [[random.choice(range(10000)) 
    for i in range(10)] for j in range(10)] 
list(set().union(*x)) 
list(set(chain(*x))) 

y_time = 0 
z_time = 0 

for _ in range(1000): 
    x = [[random.choice(range(10000)) 
     for i in range(10)] for j in range(10)] 
    start = time.time() 
    y = list(set().union(*x)) 
    y_time += time.time() - start 
    #print 'list(set().union(*x)):\t', y_time 
    start = time.time() 
    z = list(set(chain(*x))) 
    z_time += time.time() - start 
    #print 'list(set(chain(*x))):\t', z_time 
    assert sorted(y) == sorted(z) 
    #print 

print y_time/1000. 
print z_time/1000. 

[出]:

1.39586925507e-05 
1.09834671021e-05 

取出铸造套,以列表的变量:

y_time = 0 
z_time = 0 

for _ in range(1000): 
    x = [[random.choice(range(10000)) 
     for i in range(10)] for j in range(10)] 

    start = time.time() 
    y = set().union(*x) 
    y_time += time.time() - start 

    start = time.time() 
    z = set(chain(*x)) 
    z_time += time.time() - start 

    assert sorted(y) == sorted(z) 

print y_time/1000. 
print z_time/1000. 

[out]:

1.22241973877e-05 
1.02684497833e-05 

下面是完整的输出,当我尝试打印中间计时(不含名单铸造):http://pastebin.com/raw/y3i6dXZ8

为什么是它list(set(chain(*x)))花费较少的时间比list(set().union(*x))

是否有另一种方法来实现相同的列表联合?使用numpypandassframe什么的?替代方案是否更快?

+0

的内部列表排序? – fl00r

+0

不,内部列表没有明确排序。假定列表的输入列表的顺序为未知。 – alvas

回答

7

什么是最快取决于x本质 - 无论它是一个长长的清单或名单,有许多子列表或几个子列表,子列表是否是长或短,是否有很多重复或几个副本。

以下是一些timeit结果比较一些替代方案。有这么多的可能性,这绝不是一个完整的分析,但也许这会给你一个研究你的用例的框架。

func     | x     | time 
unique_concatenate | many_uniques   | 0.863 
empty_set_union  | many_uniques   | 1.191 
short_set_union_rest | many_uniques   | 1.192 
long_set_union_rest | many_uniques   | 1.194 
set_chain   | many_uniques   | 1.224 

func     | x     | time 
long_set_union_rest | many_duplicates  | 0.958 
short_set_union_rest | many_duplicates  | 0.969 
empty_set_union  | many_duplicates  | 0.971 
set_chain   | many_duplicates  | 1.128 
unique_concatenate | many_duplicates  | 2.411 

func     | x     | time 
empty_set_union  | many_small_lists  | 1.023 
long_set_union_rest | many_small_lists  | 1.028 
set_chain   | many_small_lists  | 1.032 
short_set_union_rest | many_small_lists  | 1.036 
unique_concatenate | many_small_lists  | 1.351 

func     | x     | time 
long_set_union_rest | few_large_lists  | 0.791 
empty_set_union  | few_large_lists  | 0.813 
unique_concatenate | few_large_lists  | 0.814 
set_chain   | few_large_lists  | 0.829 
short_set_union_rest | few_large_lists  | 0.849 

一定要在自己的机器上运行timeit基准测试,因为结果可能会有所不同。


from __future__ import print_function 
import random 
import timeit 
from itertools import chain 
import numpy as np 

def unique_concatenate(x): 
    return np.unique(np.concatenate(x)) 

def short_set_union_rest(x): 
    # This assumes x[0] is the shortest list in x 
    return list(set(x[0]).union(*x[1:])) 

def long_set_union_rest(x): 
    # This assumes x[-1] is the longest list in x 
    return list(set(x[-1]).union(*x[1:])) 

def empty_set_union(x): 
    return list(set().union(*x)) 

def set_chain(x): 
    return list(set(chain(*x))) 


big_range = list(range(10**7)) 
small_range = list(range(10**5)) 
many_uniques = [[random.choice(big_range) for i in range(j)] 
       for j in range(10, 10000, 10)] 
many_duplicates = [[random.choice(small_range) for i in range(j)] 
       for j in range(10, 10000, 10)] 
many_small_lists = [[random.choice(big_range) for i in range(10)] 
        for j in range(10, 10000, 10)] 
few_large_lists = [[random.choice(big_range) for i in range(1000)] 
        for j in range(10, 100, 10)] 

if __name__=='__main__': 
    for x, n in [('many_uniques', 1), ('many_duplicates', 4), 
       ('many_small_lists', 800), ('few_large_lists', 800)]: 
     timing = dict() 
     for func in [ 
       'unique_concatenate', 'short_set_union_rest', 'long_set_union_rest', 
       'empty_set_union', 'set_chain']: 
      timing[func, x] = timeit.timeit(
       '{}({})'.format(func, x), number=n, 
       setup='from __main__ import {}, {}'.format(func, x)) 
     print('{:20} | {:20} | {}'.format('func', 'x', 'time')) 
     for key, t in sorted(timing.items(), key=lambda item: item[1]): 
      func, x = key 
      print('{:20} | {:20} | {:.3f}'.format(func, x, t)) 
     print(end='\n')