熊猫复制一行填充DataFrame

我被困在一个死胡同，我使用了一些代码，这是决定性的非熊猫应该是一个非常简单的任务熊猫。我确定有更好的方法。熊猫复制一行填充DataFrame

我有一个数据帧，我将提取一行，并创建一个新的数据帧，像这样：

>>> sampledata 
float_col int_col str_col r v new_coltest  eddd 
0  0.1  1  a 5 1.0   0.1 -0.539783 
1  0.2  2  b 5 NaN   0.2 -1.394550 
2  0.2  6 None 5 NaN   0.2 0.290157 
3  10.1  8  c 5 NaN   10.1 -1.799373 
4  NaN  -1  a 5 NaN   NaN 0.694682 
>>> newsampledata = sampledata[(sampledata.new_coltest == 0.1) & (sampledata.float_col == 0.1)] 
>>> newsampledata 
float_col int_col str_col r v new_coltest  eddd 
0  0.1  1  a 5 1.0   0.1 -0.539783

我想要做的就是复制“newsampledata” N倍单行线，其中n是一个已知的整数。理想情况下，带有n行的最终DataFrame会覆盖单行“newsampledata”，但这并不重要。

我正在使用for循环执行pd.concat n-1次以获取DataFrame填充，但由于concat的工作原理，这不是快速的。我也尝试了使用append的相同类型的策略，而这比concat稍慢。

我已经看到有关类似项目的其他一些问题，但很多人还没有看到过这个确切的问题。另外，由于性能方面的考虑，我一直偏离地图/应用，但如果您已经看到了这种方法的良好表现，请告诉我，我也会尝试。

TIA

来源

2016-12-06 rajan

您可以使用DataFrame构造：

N = 10 
df =pd.DataFrame(newsampledata.values.tolist(),index=np.arange(N),columns=sampledata.columns) 
print (df) 
    float_col int_col str_col r v new_coltest  eddd 
0  0.1  1  a 5 1.0   0.1 -0.539783 
1  0.1  1  a 5 1.0   0.1 -0.539783 
2  0.1  1  a 5 1.0   0.1 -0.539783 
3  0.1  1  a 5 1.0   0.1 -0.539783 
4  0.1  1  a 5 1.0   0.1 -0.539783 
5  0.1  1  a 5 1.0   0.1 -0.539783 
6  0.1  1  a 5 1.0   0.1 -0.539783 
7  0.1  1  a 5 1.0   0.1 -0.539783 
8  0.1  1  a 5 1.0   0.1 -0.539783 
9  0.1  1  a 5 1.0   0.1 -0.539783 

print (df.dtypes) 
float_col  float64 
int_col   int64 
str_col   object 
r    int64 
v    float64 
new_coltest float64 
eddd   float64 
dtype: object

个

时序：

是小DataFrame更快sample和reindex方法，在大型DataFrame构造方法。

N = 1000 
In [88]: %timeit (pd.DataFrame(newsampledata.values.tolist(), index=np.arange(N), columns=sampledata.columns)) 
1000 loops, best of 3: 745 µs per loop 

In [89]: %timeit (newsampledata.sample(N, replace=True).reset_index(drop=True)) 
The slowest run took 4.88 times longer than the fastest. This could mean that an intermediate result is being cached. 
1000 loops, best of 3: 470 µs per loop 

In [90]: %timeit (newsampledata.reindex(newsampledata.index.repeat(N)).reset_index(drop=True)) 
1000 loops, best of 3: 476 µs per loop

N = 10000 
In [92]: %timeit (pd.DataFrame(newsampledata.values.tolist(), index=np.arange(N), columns=sampledata.columns)) 
1000 loops, best of 3: 946 µs per loop 

In [93]: %timeit (newsampledata.sample(N, replace=True).reset_index(drop=True)) 
1000 loops, best of 3: 775 µs per loop 

In [94]: %timeit (newsampledata.reindex(newsampledata.index.repeat(N)).reset_index(drop=True)) 
1000 loops, best of 3: 827 µs per loop

N = 100000 
In [97]: %timeit (pd.DataFrame(newsampledata.values.tolist(), index=np.arange(N), columns=sampledata.columns)) 
The slowest run took 12.98 times longer than the fastest. This could mean that an intermediate result is being cached. 
100 loops, best of 3: 6.93 ms per loop 

In [98]: %timeit (newsampledata.sample(N, replace=True).reset_index(drop=True)) 
100 loops, best of 3: 7.07 ms per loop 

In [99]: %timeit (newsampledata.reindex(newsampledata.index.repeat(N)).reset_index(drop=True)) 
100 loops, best of 3: 7.87 ms per loop

来源

2016-12-06 07:29:37 jezrael

良好的解决方案的一个，似乎工作没有问题，我同意，它更快。不知道如何设置索引，将不得不记住这一个！ – rajan

在以前的版本中，你有一个numpy版本，缺点是转换为object的dtypes。当回到原始数据类型时，这个解决方案如何比较性能？也许numpy仍然更快;） – Quickbeam2k1

@ Quickbeam2k1 - 我尝试。 – jezrael

我想你可以只sample它更换

newsampledata.sample(n, replace=True).reset_index(drop=True)

或reindex

newsampledata.reindex(newsampledata.index.repeat(n)).reset_index(drop=True)

来源

2016-12-06 07:27:48

我认为你可以使用CONCAT不使用for循环明确。

df = pd.DataFrame({'a':[1], 'b':[.1]}) 
repetitions = 4 
res = pd.concat([df]*repetitions) 
print(res)

输出

所以我的样品架上，这的确是快于大约5倍使用循环然而，我期望不同的解决方案不使用CONCAT是显著更快。

为了展示豪慢CONCAT是，相比一些基准来jezrael的解决方案

来源

2016-12-06 07:47:16 Quickbeam2k1

当天晚些时候concat是非常缓慢的。一行数据帧花了1.5s，n = 10,000 –

你是对的。但是，这个解决方案至少比直接使用for循环更快。 – Quickbeam2k1

针对jezraels解决方案执行了一些基准测试，以显示concat的缓慢程度 – Quickbeam2k1

的bajillion方法可以做到这

pd.concat([df.query('new_coltest == 0.1 & float_col == 0.1')] * 4)

来源

2016-12-06 07:55:59 piRSquared

熊猫复制一行填充DataFrame

回答

相关问题