2013-03-26 49 views
1

我想我已经打了一个熊猫的bug。我希望得到一些帮助,以验证错误或帮助我找出我的逻辑错误在我的代码中的位置。有趣的结果,熊猫argsort

我的代码如下:

import pandas, numpy, StringIO 

def sq_fixer(sr): 
    sr = sr.where(sr != '20200229') 
    ranks = sr.argsort().astype(float) 
    ranks[ranks == -1] = numpy.nan 

    return ','.join(ranks.astype(numpy.str)) 

def correct_date(sr): 

    date_fixer = lambda x: pandas.datetime(x.year -100, x.month, x.day) if x > pandas.datetime.now() else x 
    sr = pandas.to_datetime(sr).apply(date_fixer).astype(pandas.datetime) 

    return sr 

txt = '''ID,RUN_START_DATE,PUSHUP_START_DATE,SITUP_START_DATE,PULLUP_START_DATE 
1,2013-01-24,2013-01-02,,2013-02-03 
2,2013-01-30,2013-01-21,2013-01-13,2013-01-06 
3,2013-01-29,2013-01-28,2013-01-01,2013-01-29 
4,2013-02-16,2013-02-12,2013-01-04,2013-02-11 
5,2013-01-06,2013-02-07,2013-02-25,2013-02-12 
6,2013-01-26,2013-01-28,2013-02-12,2013-01-10 
7,2013-01-26,,2013-01-12,2013-01-30 
8,2013-01-03,2013-01-24,2013-01-19,2013-01-02 
9,2013-01-22,2013-01-13,2013-02-03, 
10,2013-02-06,2013-01-16,2013-02-07,2013-01-11 
3347,,2008-02-27,2008-04-10,2008-02-13 
3588,2004-09-12,,2004-11-06,2004-09-06 
3784,2003-02-22,,2003-06-21,2003-02-19 
593,2009-04-03,,2009-06-01,2009-04-01 
4148,2003-03-21,2002-09-20,2003-04-01,2003-01-01 
4299,2004-05-24,2004-07-23,,2004-04-22 
4590,2005-05-05,2005-12-05,2005-04-05, 
4830,2001-06-12,2000-10-12,2001-07-28,2001-01-28 
4941,2006-11-08,2006-12-19,2006-07-19,2007-02-24 
1416,2004-04-03,2004-05-19,2004-02-06, 
1580,2008-12-20,,2009-03-19,2008-12-19 
1661,2005-10-03,2005-10-26,2005-09-12,2006-02-19 
1759,2001-10-18,,2002-01-17,2001-10-17 
1858,2003-04-14,2003-05-17,,2002-12-17 
1972,2003-06-01,2003-07-14,2002-12-14, 
5905,2000-11-18,2001-01-13,,2000-11-04 
2052,2002-06-11,,2002-08-23,2001-12-12 
2165,2006-10-01,,2007-02-27,2006-09-30 
2218,2007-09-19,,2008-02-06,2007-09-09 
2350,2000-08-08,,2000-09-22,2000-01-08 
2432,2001-08-22,,2001-09-25,2000-12-16 
2611,2005-05-07,,2005-06-05,2005-03-26 
2612,2005-05-06,,2005-05-26,2005-04-11 
7378,2009-08-07,2009-01-30,2010-01-20,2009-06-08 
7550,2006-04-08,,2006-06-01,2006-04-01 ''' 

df = pandas.read_csv(StringIO.StringIO(txt)) 

sequence_array = ['RUN_START_DATE', 'PUSHUP_START_DATE', 'SITUP_START_DATE', 'PULLUP_START_DATE'] 
xsequence_array = ['X_RUN_START_DATE', 'X_PUSHUP_START_DATE', 'X_SITUP_START_DATE', 'X_PULLUP_START_DATE'] 

df[sequence_array] = df[sequence_array].apply(correct_date, axis=1) 

fix_day = lambda x: x if x > 0 else 29 
fix_month = lambda x: x if x > 0 else 02 
fix_year = lambda x: x if x > 0 else 2020 

for col in sequence_array: 

    xcol = 'X_{0}'.format(col) 
    df[xcol] = ['{0:04d}{1:02d}{2:02d}'.format(fix_year(c.year), fix_month(c.month), fix_day(c.day)) for c in df[col]] 

df['X_AS_SEQUENCE'] = df[xsequence_array].apply(sq_fixer, axis=1) 

当我运行的代码大部分结果是正确的。举例来说,索引6:

In [31]: df.ix[6] 
Out[31]: 
ID          7 
RUN_START_DATE   2013-01-26 00:00:00 
PUSHUP_START_DATE      NaN 
SITUP_START_DATE  2013-01-12 00:00:00 
PULLUP_START_DATE  2013-01-30 00:00:00 
X_RUN_START_DATE     20130126 
X_PUSHUP_START_DATE    20200229 
X_SITUP_START_DATE    20130112 
X_PULLUP_START_DATE    20130130 
X_AS_SEQUENCE    1.0,nan,0.0,2.0 

但是,某些指数似乎会针对循环抛出pandas.argsort()。举个例子指标10:

In [32]: df.ix[10] 
Out[32]: 
ID         3347 
RUN_START_DATE       NaN 
PUSHUP_START_DATE  2008-02-27 00:00:00 
SITUP_START_DATE  2008-04-10 00:00:00 
PULLUP_START_DATE  2008-02-13 00:00:00 
X_RUN_START_DATE     20200229 
X_PUSHUP_START_DATE    20080227 
X_SITUP_START_DATE    20080410 
X_PULLUP_START_DATE    20080213 
X_AS_SEQUENCE    nan,2.0,0.0,1.0 

的argsort应该返回nan,1.0,2.0,0.0而不是nan,2.0,0.0,1.0

我已经在这三天了。在这一点上,我不确定这是我还是一个错误。我不确定如何回溯它以得到答案。非常感激任何的帮助!

+0

您正在使用什么版本的熊猫呢?我使用的是0.11.0-dev的,然后得到一个错误'AttributeError的:“浮动”对象有没有属性“在71号线 – waitingkuo 2013-03-26 07:09:31

+1

year'' @waitingkuo有代码中的一个小的失误。我已将正确的代码放在pastebin上。长期和短缺的是,我用pandas.datetime而不是numpy.datetime64。 http://pastebin.com/Fwbmsk5F – BigHandsome 2013-03-27 12:29:04

回答

4

您可能会错误地解释argsort的结果。 argsort不给出值的排名。如果要排列值,请使用rank方法。

argsort返回的系列值的下降将NaN后给原始值的相应位置。在你的情况下,因为你将20200229转换为NaN,所以你正在调用NaN, 20080227, 20080410, 20080213。非NaN值是

nonnan = [20080227, 20080410, 20080213] 

结果,NaN, 2, 0, 1说:

argsort  sorted values 
    NaN  NaN 
    2  nonnan[2] = 20080213 
    0  nonnan[0] = 20080227 
    1  nonnan[1] = 20080410 

因此,它看起来OK我。

+0

我在混乱中排名靠前。使用double argsort进行快速代码修复,我的代码已准备就绪。非常感谢你指点我正确的方向! – BigHandsome 2013-03-26 15:10:54

+0

如果需要排序的值,使用'rank'方法:http://pandas.pydata.org/pandas-docs/dev/generated/pandas.Series.rank.html#pandas.Series.rank – 2013-03-26 15:22:36

0

如果要排序的一个系列,只是使用sort_values()或秩()函数:

In [2]: a=pd.Series([3,2,1]) 

In [3]: a 
Out[3]: 
0 3 
1 2 
2 1 
dtype: int64 
In [4]: a.sort_values() 
Out[4]: 
2 1 
1 2 
0 3 
dtype: int64 

如果使用argsort(),这会给你会在排序系列中各元素的位置, 在这种情况下,1应在0位和2应在1位和3应该是在2位

In [5]: a.argsort() 
Out[5]: 
0 2 
1 1 
2 0 
dtype: int64