0
如果我groupby(下面的g对象),然后将下面的函数应用于df的前1000行,它的工作原理。但是,如果我把它应用到整个DF,我得到这个异常:熊猫适用于数据帧组
def calc_load(x):
...: x.sort('log_timestamp')
...: x['time_stddev'] = x['time'].std()
...: x['time_mean'] = x['time'].mean()
...: return x
...:
c=g.apply(calc_load)
---------------------------------------------------------------------------
........
ValueError Traceback (most recent call last)
<ipython-input-262-f2fe1f013907> in <module>()
----> 1 c=g.apply(calc_load)
2215 tuple(map(int, [tot_items] + list(block_shape))),
-> 2216 tuple(map(int, [len(ax) for ax in axes]))))
2217
2218
ValueError: Shape of passed values is (10, 3943482), indices imply (10, 410450)
这里有什么原因,我该如何解决呢?
UPDATE:
我从HDF5存储器读取这个表是这样的:
prob2
Out[374]:
<class 'pandas.io.pytables.HDFStore'>
File path: /tmp/test2.h5
/mytable frame_table (typ->appendable,nrows->410450,ncols->8,indexers->[index])
a=prob2.mytable
a
Out[376]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 410450 entries, 0 to 9999
Data columns (total 8 columns):
args 410450 non-null values
host 410450 non-null values
kwargs 410450 non-null values
log_timestamp 410450 non-null values
operation 410450 non-null values
slot 410450 non-null values
status 410450 non-null values
time 410450 non-null values
dtypes: float64(1), int64(2), object(5)
如果我做往返CSV像下面,异常不会发生:
a.to_csv('/tmp/test2.csv')
b=pd.read_csv('/tmp/test2.csv')
b
Out[379]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 410450 entries, 0 to 410449
Data columns (total 9 columns):
Unnamed: 0 410450 non-null values
args 410450 non-null values
host 410450 non-null values
kwargs 410450 non-null values
log_timestamp 410450 non-null values
operation 410450 non-null values
slot 410450 non-null values
status 410450 non-null values
time 410450 non-null values
dtypes: float64(1), int64(3), object(5)
bg = b.groupby(['host','operation'])
bg.apply(calc_load)
Out[381]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 410450 entries, 0 to 410449
Data columns (total 11 columns):
Unnamed: 0 410450 non-null values
args 410450 non-null values
host 410450 non-null values
kwargs 410450 non-null values
log_timestamp 410450 non-null values
operation 410450 non-null values
slot 410450 non-null values
status 410450 non-null values
time 410450 non-null values
time_stddev 410371 non-null values
time_mean 410450 non-null values
dtypes: float64(3), int64(3), object(5)
往返(a)和往返(b)之前的数据帧看起来相似,但它们不完全相同!
a
Out[386]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 410450 entries, 0 to 9999
Data columns (total 8 columns):
args 410450 non-null values
host 410450 non-null values
kwargs 410450 non-null values
log_timestamp 410450 non-null values
operation 410450 non-null values
slot 410450 non-null values
status 410450 non-null values
time 410450 non-null values
dtypes: float64(1), int64(2), object(5)
b
Out[387]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 410450 entries, 0 to 410449
Data columns (total 9 columns):
Unnamed: 0 410450 non-null values
args 410450 non-null values
host 410450 non-null values
kwargs 410450 non-null values
log_timestamp 410450 non-null values
operation 410450 non-null values
slot 410450 non-null values
status 410450 non-null values
time 410450 non-null values
dtypes: float64(1), int64(3), object(5)
呃,这是怎么回事?
您需要提供一个工作的例子,也许还可以利用Dropbox的提供您的帧(或创建一个例子来说明错误) – Jeff
@Jeff,它在UPDATE。并感谢一百万次的帮助! – LetMeSOThat4U
你可以做''df.head()''所以可以看到值。看起来像你有一个类似字符串的列(标记)为对象dtype。对象的dtypes只能用于类字符串。您可能需要进行一些转换(甚至在将其放入HDF5之前)。数据来自哪一步? – Jeff