对于很多重复的大DataFrames,它可能更快地使用熊猫方法groupby/count
行那将是比使用collections.Counter
:
In [75]: df = pd.DataFrame(np.random.randint(2, size=(10000,4)))
In [76]: df.reset_index().groupby(list(df.columns)).count().to_dict('dict')['index']
Out[76]:
{(0, 0, 0, 0): 639,
(0, 0, 0, 1): 621,
(0, 0, 1, 0): 658,
(0, 0, 1, 1): 595,
(0, 1, 0, 0): 601,
(0, 1, 0, 1): 640,
(0, 1, 1, 0): 643,
(0, 1, 1, 1): 632,
(1, 0, 0, 0): 637,
(1, 0, 0, 1): 644,
(1, 0, 1, 0): 574,
(1, 0, 1, 1): 642,
(1, 1, 0, 0): 612,
(1, 1, 0, 1): 667,
(1, 1, 1, 0): 588,
(1, 1, 1, 1): 607}
In [77]: collections.Counter(df.itertuples(index=False))
Out[77]: Counter({Pandas(_0=1, _1=1, _2=0, _3=1): 667, Pandas(_0=0, _1=0, _2=1, _3=0): 658, Pandas(_0=1, _1=0, _2=0, _3=1): 644, Pandas(_0=0, _1=1, _2=1, _3=0): 643, Pandas(_0=1, _1=0, _2=1, _3=1): 642, Pandas(_0=0, _1=1, _2=0, _3=1): 640, Pandas(_0=0, _1=0, _2=0, _3=0): 639, Pandas(_0=1, _1=0, _2=0, _3=0): 637, Pandas(_0=0, _1=1, _2=1, _3=1): 632, Pandas(_0=0, _1=0, _2=0, _3=1): 621, Pandas(_0=1, _1=1, _2=0, _3=0): 612, Pandas(_0=1, _1=1, _2=1, _3=1): 607, Pandas(_0=0, _1=1, _2=0, _3=0): 601, Pandas(_0=0, _1=0, _2=1, _3=1): 595, Pandas(_0=1, _1=1, _2=1, _3=0): 588, Pandas(_0=1, _1=0, _2=1, _3=0): 574})
In [78]: %timeit collections.Counter(df.itertuples(index=False))
100 loops, best of 3: 12.8 ms per loop
In [79]: %timeit df.reset_index().groupby(list(df.columns)).count().to_dict('dict')['index']
100 loops, best of 3: 3.74 ms per loop
对于少数人的数据帧重复,速度是可比的:
In [80]: df = pd.DataFrame(np.random.randint(1000, size=(10000,4)))
In [81]: %timeit collections.Counter(df.itertuples(index=False))
100 loops, best of 3: 11.2 ms per loop
In [82]: %timeit df.reset_index().groupby(list(df.columns)).count().to_dict('dict')['index']
100 loops, best of 3: 11.7 ms per loop
确切地!,哦,我怎么错过这个文档(巴掌!) – minerals