拆分一个大熊猫数据帧

我有一个423244行的大数据框。我想把它分成4个。我尝试了下面的代码给出了一个错误？ ValueError: array split does not result in an equal division拆分一个大熊猫数据帧

for item in np.split(df, 4): 
    print item

如何将此数据帧拆分为4组？

来源

2013-06-26 Nilani Algiriyage

使用np.array_split：

Docstring: 
Split an array into multiple sub-arrays. 

Please refer to the ``split`` documentation. The only difference 
between these functions is that ``array_split`` allows 
`indices_or_sections` to be an integer that does *not* equally 
divide the axis.

In [1]: import pandas as pd 

In [2]: df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar', 
    ...:       'foo', 'bar', 'foo', 'foo'], 
    ...:     'B' : ['one', 'one', 'two', 'three', 
    ...:       'two', 'two', 'one', 'three'], 
    ...:     'C' : randn(8), 'D' : randn(8)}) 

In [3]: print df 
    A  B   C   D 
0 foo one -0.174067 -0.608579 
1 bar one -0.860386 -1.210518 
2 foo two 0.614102 1.689837 
3 bar three -0.284792 -1.071160 
4 foo two 0.843610 0.803712 
5 bar two -1.514722 0.870861 
6 foo one 0.131529 -0.968151 
7 foo three -1.002946 -0.257468 

In [4]: import numpy as np 
In [5]: np.array_split(df, 3) 
Out[5]: 
[  A B   C   D 
0 foo one -0.174067 -0.608579 
1 bar one -0.860386 -1.210518 
2 foo two 0.614102 1.689837, 
     A  B   C   D 
3 bar three -0.284792 -1.071160 
4 foo two 0.843610 0.803712 
5 bar two -1.514722 0.870861, 
     A  B   C   D 
6 foo one 0.131529 -0.968151 
7 foo three -1.002946 -0.257468]

来源

2013-06-26 09:07:14 root

非常感谢！除此之外，我想对每个组应用一些功能？如何逐一访问组？ –

@NilaniAlgiriyage - 'array_split'返回一个DataFrames的列表，所以你可以循环访问列表... – root

我分裂的数据帧，因为它太大了。我想参加第一组并申请该功能，然后是第二组并申请功能等，那么我如何访问每个组？ –

注意：

np.array_split不numpy的-1.9.0工作。我检查出来了：它适用于1.8.1。

错误：

Dataframe has no 'size' attribute

来源

2014-11-17 16:15:55 yemu

我在熊猫github中提交了一个错误：https：//github.com/pydata/pandas/issues/8846 似乎它已经是固定为熊猫0.15.2 – yemu

熊猫0.15.2作品。 – pigletfly

我想这样做，我只好先问题分裂finction，那么问题在安装熊猫0.15.2，所以我又回到我的老版本，并写了一个很有用的小函数。我希望这可以帮助！

# input - df: a Dataframe, chunkSize: the chunk size 
# output - a list of DataFrame 
# purpose - splits the DataFrame into smaller of max size chunkSize (last is smaller) 
def splitDataFrameIntoSmaller(df, chunkSize = 10000): 
    listOfDf = list() 
    numberChunks = len(df) // chunkSize + 1 
    for i in range(numberChunks): 
     listOfDf.append(df[i*chunkSize:(i+1)*chunkSize]) 
    return listOfDf

来源

2015-03-05 15:49:46 elixir

比使用np.array_split（） – jgaw

注意np.array_split(df, 3)拆分数据帧分为3个子dataframes，而splitDataFrameIntoSmaller(df, chunkSize = 3)拆分数据帧每chunkSize行。

例子：

df = pd.DataFrame([1,2,3,4,5,6,7,8,9,10,11], columns=['TEST']) 
df_split = np.array_split(df, 3)

你得到3分dataframes：

df_split[0] # 1, 2, 3, 4 
df_split[1] # 5, 6, 7, 8 
df_split[2] # 9, 10, 11

有了：

df_split2 = splitDataFrameIntoSmaller(df, chunkSize = 3)

你得到4分dataframes：

df_split2[0] # 1, 2, 3 
df_split2[1] # 4, 5, 6 
df_split2[2] # 7, 8, 9 
df_split2[3] # 10, 11

希望我是对的，希望这是有用的。

来源

2017-07-12 10:06:30 Gilberto

快很多，有一个简单的方法可以使这个过程变得随机。我只能想到添加一个rondom列，拆分和删除随机列，但可能有一个更简单的方法 –

他们必须是相同的块大小？ – InquilineKea

您可以使用groupby，假设你有一个整数枚举指数：

import math 
df = pd.DataFrame(dict(sample=np.arange(99))) 
rows_per_subframe = math.ceil(len(df)/4.) 

subframes = [i[1] for i in df.groupby(np.arange(len(df))//rows_per_subframe)]

注：groupby返回一个元组中的第二个元素是数据帧，因此稍微复杂的提取。

>>> len(subframes), [len(i) for i in subframes] 
(4, [25, 25, 25, 24])

来源

2017-09-21 20:40:45 rumpel

拆分一个大熊猫数据帧

回答

相关问题