您可以使用DataFrame
构造:
N = 10
df =pd.DataFrame(newsampledata.values.tolist(),index=np.arange(N),columns=sampledata.columns)
print (df)
float_col int_col str_col r v new_coltest eddd
0 0.1 1 a 5 1.0 0.1 -0.539783
1 0.1 1 a 5 1.0 0.1 -0.539783
2 0.1 1 a 5 1.0 0.1 -0.539783
3 0.1 1 a 5 1.0 0.1 -0.539783
4 0.1 1 a 5 1.0 0.1 -0.539783
5 0.1 1 a 5 1.0 0.1 -0.539783
6 0.1 1 a 5 1.0 0.1 -0.539783
7 0.1 1 a 5 1.0 0.1 -0.539783
8 0.1 1 a 5 1.0 0.1 -0.539783
9 0.1 1 a 5 1.0 0.1 -0.539783
print (df.dtypes)
float_col float64
int_col int64
str_col object
r int64
v float64
new_coltest float64
eddd float64
dtype: object
个
时序:
是小DataFrame
更快sample
和reindex
方法,在大型DataFrame
构造方法。
N = 1000
In [88]: %timeit (pd.DataFrame(newsampledata.values.tolist(), index=np.arange(N), columns=sampledata.columns))
1000 loops, best of 3: 745 µs per loop
In [89]: %timeit (newsampledata.sample(N, replace=True).reset_index(drop=True))
The slowest run took 4.88 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 470 µs per loop
In [90]: %timeit (newsampledata.reindex(newsampledata.index.repeat(N)).reset_index(drop=True))
1000 loops, best of 3: 476 µs per loop
N = 10000
In [92]: %timeit (pd.DataFrame(newsampledata.values.tolist(), index=np.arange(N), columns=sampledata.columns))
1000 loops, best of 3: 946 µs per loop
In [93]: %timeit (newsampledata.sample(N, replace=True).reset_index(drop=True))
1000 loops, best of 3: 775 µs per loop
In [94]: %timeit (newsampledata.reindex(newsampledata.index.repeat(N)).reset_index(drop=True))
1000 loops, best of 3: 827 µs per loop
N = 100000
In [97]: %timeit (pd.DataFrame(newsampledata.values.tolist(), index=np.arange(N), columns=sampledata.columns))
The slowest run took 12.98 times longer than the fastest. This could mean that an intermediate result is being cached.
100 loops, best of 3: 6.93 ms per loop
In [98]: %timeit (newsampledata.sample(N, replace=True).reset_index(drop=True))
100 loops, best of 3: 7.07 ms per loop
In [99]: %timeit (newsampledata.reindex(newsampledata.index.repeat(N)).reset_index(drop=True))
100 loops, best of 3: 7.87 ms per loop
良好的解决方案的一个,似乎工作没有问题,我同意,它更快。不知道如何设置索引,将不得不记住这一个! – rajan
在以前的版本中,你有一个numpy版本,缺点是转换为object的dtypes。当回到原始数据类型时,这个解决方案如何比较性能?也许numpy仍然更快;) – Quickbeam2k1
@ Quickbeam2k1 - 我尝试。 – jezrael