2017-05-20 33 views
2

如果我读只是一块CSV的我得到的数据结构以下的毗连改变类别类型到对象/ float64

<class 'pandas.core.frame.DataFrame'> 
MultiIndex: 100000 entries, (2015-11-01 00:00:00, 4980770) to (2016-06-01 00:00:00, 8850573) 
Data columns (total 5 columns): 
CHANNEL   100000 non-null category 
MCC    92660 non-null category 
DOMESTIC_FLAG 100000 non-null category 
AMOUNT   100000 non-null float32 
CNT    100000 non-null uint8 
dtypes: category(3), float32(1), uint8(1) 
memory usage: 1.9+ MB 

如果我在阅读整个CSV和CONCAT块按照上述我得到如下结构:

<class 'pandas.core.frame.DataFrame'> 
MultiIndex: 30345312 entries, (2015-11-01 00:00:00, 4980770) to (2015-08-01 00:00:00, 88838) 
Data columns (total 5 columns): 
CHANNEL   object 
MCC    float64 
DOMESTIC_FLAG category 
AMOUNT   float32 
CNT    uint8 
dtypes: category(1), float32(1), float64(1), object(1), uint8(1) 
memory usage: 784.6+ MB 

为什么分类变量改为object/float64?我怎样才能避免这种类型的变化? ESP。在float64

这是级联代码:

df = pd.concat([process(chunk) for chunk in reader]) 

处理功能只是做一些清洁和类型分配

+0

你可以发布你用来加载和连接它的代码吗? –

+0

分类也有'NaN'问题,有时 –

+0

现在加入到文本 – snovik

回答

1

考虑下面的示例DataFrames:

In [93]: df1 
Out[93]: 
    A B 
0 a a 
1 b b 
2 c c 
3 a a 

In [94]: df2 
Out[94]: 
    A B 
0 b b 
1 c c 
2 d d 
3 e e 

In [95]: df1.info() 
<class 'pandas.core.frame.DataFrame'> 
RangeIndex: 4 entries, 0 to 3 
Data columns (total 2 columns): 
A 4 non-null object 
B 4 non-null category 
dtypes: category(1), object(1) 
memory usage: 140.0+ bytes 

In [96]: df2.info() 
<class 'pandas.core.frame.DataFrame'> 
RangeIndex: 4 entries, 0 to 3 
Data columns (total 2 columns): 
A 4 non-null object 
B 4 non-null category 
dtypes: category(1), object(1) 
memory usage: 148.0+ bytes 

注:这两个DF有不同的类别:

In [97]: df1.B.cat.categories 
Out[97]: Index(['a', 'b', 'c'], dtype='object') 

In [98]: df2.B.cat.categories 
Out[98]: Index(['b', 'c', 'd', 'e'], dtype='object') 

,当我们将它们连接起来大熊猫不会合并类别 - 这将创建一个object列:

In [99]: m = pd.concat([df1, df2]) 

In [100]: m.info() 
<class 'pandas.core.frame.DataFrame'> 
Int64Index: 8 entries, 0 to 3 
Data columns (total 2 columns): 
A 8 non-null object 
B 8 non-null object 
dtypes: object(2) 
memory usage: 192.0+ bytes 

但是,如果我们连接两个DFS中使用相同的类别 - 一切正常:

In [102]: m = pd.concat([df1.sample(frac=.5), df1.sample(frac=.5)]) 

In [103]: m 
Out[103]: 
    A B 
3 a a 
0 a a 
3 a a 
2 c c 

In [104]: m.info() 
<class 'pandas.core.frame.DataFrame'> 
Int64Index: 4 entries, 3 to 2 
Data columns (total 2 columns): 
A 4 non-null object 
B 4 non-null category 
dtypes: category(1), object(1) 
memory usage: 92.0+ bytes 
+0

后的所有列我看到。所以唯一的办法是重新连接所有类型后连接... – snovik

+0

@snovik,AFAIK目前这是要走的路 – MaxU