2016-06-14 37 views
1

1,我想读取一个hdf5文件并对其进行排名。pandas中的SIGSEGV错误Series.rank(升序= False)

import pandas as pd 
def test_df_ranks(f): 
    df = pd.read_hdf(f, key="t") 
    print (df.shape) 
    print (type(df)) 
    print (df) 
    s=df.non_current_asset_to_total_asset 
    #s.rank()  # rank() work properly 
    s.rank(ascending=False) #rank(ascending=False) crash 

然后我得到一个SIGSEGV错误。 的下面是verison列表:

numpy==1.11.0 
pandas==0.17.1 
pymongo==3.2.2 
python-dateutil==2.5.3 
pytz==2016.4 
ricequant-utility==0.1.0 
six==1.10.0 
tables==3.2.2 
os: 3.10.0-327.13.1.el7.x86_64 #1 SMP Thu Mar 31 16:04:38 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux 
gcc: 4.8.5 

我尝试去使用gdb,但不同的堆栈堆栈显示:

#1: 
.... 
#7 OBJECT_compare (ip1=0x47a3ef4b2e420, ip2=0x7f5c5413f128, __NPY_UNUSED_TAGGEDap=0x7f5cd0100760) at numpy/core/src/multiarray/arraytypes.c.src:2753 
#8 0x00007f5d0142c50e in npy_aquicksort ([email protected]=0x7f5c5413f060, [email protected]=0x7f5c5413cc80, [email protected]=52, [email protected]=0x7f5cd0100760) at numpy/core/src/npysort/quicksort.c.src:480 
#9 0x00007f5d0139a78a in _new_argsortlike ([email protected]=0x7f5cd0100760, axis=0, [email protected]=0x7f5d0142c310 <npy_aquicksort>, [email protected]=0x0, [email protected]=0x0, [email protected]=0) 
at numpy/core/src/multiarray/item_selection.c:1035 
#10 0x00007f5d0139dd7b in PyArray_ArgSort ([email protected]=0x7f5cd0100760, axis=0, which=<optimized out>) at numpy/core/src/multiarray/item_selection.c:1309 
#11 0x00007f5d013dd012 in array_argsort (self=0x7f5cd0100760, args=<optimized out>, kwds=<optimized out>) at numpy/core/src/multiarray/methods.c:1278 
#12 0x00007f5cf4eef28f in __Pyx_PyObject_Call (func=0x7f5cd1a1acc8, arg=0x7f5d0f900048, kw=0x0) at pandas/algos.c:201388 
#13 0x00007f5cf504e006 in __pyx_pf_6pandas_5algos_8rank_1d_generic ([email protected]=0x7f5cd0100620, __pyx_v_retry=1, __pyx_v_ties_method=0x7f5cf6999768, __pyx_v_ascending=0x7f5d0f6bd700 <_Py_FalseStruct>, 
__pyx_v_na_option=<optimized out>, __pyx_v_pct=0x7f5d0f6bd700 <_Py_FalseStruct>, __pyx_self=<optimized out>) at pandas/algos.c:14942 
#14 0x00007f5cf5050481 in __pyx_pw_6pandas_5algos_9rank_1d_generic (__pyx_self=<optimized out>, __pyx_args=<optimized out>, __pyx_kwds=0x7f5cd8659488) at pandas/algos.c:14439 
#15 0x00007f5d0f3b9477 in PyEval_EvalFrameEx() from /lib64/libpython3.4m.so.1.0 
#16 0x00007f5d0f3b9f3e in PyEval_EvalCodeEx() from /lib64/libpython3.4m.so.1.0 
#17 0x00007f5d0f3b7a12 in PyEval_EvalFrameEx() from /lib64/libpython3.4m.so.1.0 
#18 0x00007f5d0f3b9f3e in PyEval_EvalCodeEx() from /lib64/libpython3.4m.so.1.0 
#19 0x00007f5d0f3b7a12 in PyEval_EvalFrameEx() from /lib64/libpython3.4m.so.1.0 
#20 0x00007f5d0f3b8e40 in PyEval_EvalFrameEx() from /lib64/libpython3.4m.so.1.0 
#21 0x00007f5d0f3b9f3e in PyEval_EvalCodeEx() from /lib64/libpython3.4m.so.1.0 
#22 0x00007f5d0f3b7a12 in PyEval_EvalFrameEx() from /lib64/libpython3.4m.so.1.0 
#23 0x00007f5d0f3b9f3e in PyEval_EvalCodeEx() from /lib64/libpython3.4m.so.1.0 
#24 0x00007f5d0f32a4b3 in function_call() from /lib64/libpython3.4m.so.1.0 
#25 0x00007f5d0f301dcc in PyObject_Call() from /lib64/libpython3.4m.so.1.0 
#26 0x00007f5d0f3b57c9 in PyEval_EvalFrameEx() from /lib64/libpython3.4m.so.1.0 

...

现在堆栈:

#0 0x00007ffff6c985f7 in raise() from /lib64/libc.so.6 
#1 0x00007ffff6c99ce8 in abort() from /lib64/libc.so.6 
#2 0x00007ffff6cd8317 in __libc_message() from /lib64/libc.so.6 
#3 0x00007ffff6ce0023 in _int_free() from /lib64/libc.so.6 
#4 0x00007fffd15785a9 in H5FL_reg_gc_list() from /lib64/libhdf5.so.8 
#5 0x00007fffd1578626 in H5FL_reg_gc() from /lib64/libhdf5.so.8 
#6 0x00007fffd157b0be in H5FL_garbage_coll() from /lib64/libhdf5.so.8 
#7 0x00007fffd157b34e in H5FL_term_interface() from /lib64/libhdf5.so.8 
#8 0x00007fffd14ae466 in H5_term_library() from /lib64/libhdf5.so.8 
#9 0x00007ffff6c9be69 in __run_exit_handlers() from /lib64/libc.so.6 
#10 0x00007ffff6c9beb5 in exit() from /lib64/libc.so.6 
#11 0x00007ffff6c84b1c in __libc_start_main() from /lib64/libc.so.6 
#12 0x0000000000400b89 in _start() 

2,我将我的数据保存到csv。然后通过pd.read_csv() 这两个series.rank(ascending = True)或者series.rank(ascending = Flase)得到pd.Series。

3,表中可能存在问题?或hdf5?我的hdf5数据:https://github.com/HaoXJ/codefail/blob/master/data/test.h5

4,需要你们的帮助。

回答

0

首先,你的non_current_asset_to_total_asset列没有任何非NaN值,但它似乎是一个numpy或熊猫的错误​​。您可能想要检查它是否已被提升为bug here。或打开一个新的问题......

In [1]: fn = r'D:\download\test.h5' 

In [2]: df = pd.read_hdf(fn, key='t') 

列表行,其中non_current_asset_to_total_assetNaN

In [3]: df[pd.notnull(df.non_current_asset_to_total_asset)] 
Out[3]: 
Empty DataFrame 
Columns: [pb_ratio, pe_ratio_1, inc_operating_revenue, inc_total_asset, non_current_asset_to_total_asset] 
Index: [] 

注:没有行,其中non_current_asset_to_total_assetNaN

In [4]: df.head() 
Out[4]: 
      pb_ratio pe_ratio_1 inc_operating_revenue inc_total_asset non_current_asset_to_total_asset 
000022.XSHE 7.4091 14.9739    30.5996   13.1342        NaN 
000089.XSHE 1.7244 14.3574    7.5837   2.8343        NaN 
000099.XSHE 1.7782 23.6805    8.7495   -0.0933        NaN 
000429.XSHE 1.7264 17.5882    15.1496   -0.9485        NaN 
000507.XSHE 1.1563 46.9562    26.9032   4.4909        NaN 

rank(ascending=True)作品:

In [10]: df.non_current_asset_to_total_asset.rank(ascending=True).head() 
Out[10]: 
000022.XSHE NaN 
000089.XSHE NaN 
000099.XSHE NaN 
000429.XSHE NaN 
000507.XSHE NaN 
Name: non_current_asset_to_total_asset, dtype: float64 

等级(升序= 崩溃我的IPython:

In [5]: df.non_current_asset_to_total_asset.rank(ascending=False) 

崩溃信息:

<EXE NAME="multiarray.cp35-win_amd64.pyd" FILTER="CMI_FILTER_THISFILEONLY"> 
    <MATCHING_FILE NAME="multiarray.cp35-win_amd64.pyd" SIZE="1510912" CHECKSUM="0xD8B922AB" MODULE_TYPE="WIN32" PE_CHECKSUM="0x0" LINKER_VERSION="0x0" LINK_DATE="05/02/2016 21:19:46" UPTO_LINK_DATE="05/02/2016 21:19:46" EXPORT_NAME="multiarray.cp35-win_amd64.pyd" EXE_WRAPPER="0x0" /> 
</EXE> 
<EXE NAME="kernel32.dll" FILTER="CMI_FILTER_THISFILEONLY"> 
    <MATCHING_FILE NAME="kernel32.dll" SIZE="1163264" CHECKSUM="0xADFC88B8" BIN_FILE_VERSION="6.1.7601.23418" BIN_PRODUCT_VERSION="6.1.7601.23418" PRODUCT_VERSION="6.1.7601.18015" FILE_DESCRIPTION="Windows NT BASE API Client DLL" COMPANY_NAME="Microsoft Corporation" PRODUCT_NAME="Microsoft® Windows® Operating System" FILE_VERSION="6.1.7601.18015 (win7sp1_gdr.121129-1432)" ORIGINAL_FILENAME="kernel32" INTERNAL_NAME="kernel32" LEGAL_COPYRIGHT="© Microsoft Corporation. All rights reserved." VERDATEHI="0x0" VERDATELO="0x0" VERFILEOS="0x40004" VERFILETYPE="0x2" MODULE_TYPE="WIN32" PE_CHECKSUM="0x122E58" LINKER_VERSION="0x60001" UPTO_BIN_FILE_VERSION="6.1.7601.23418" UPTO_BIN_PRODUCT_VERSION="6.1.7601.23418" LINK_DATE="04/09/2016 07:00:43" UPTO_LINK_DATE="04/09/2016 07:00:43" EXPORT_NAME="KERNEL32.dll" VER_LANGUAGE="English (United States) [0x409]" EXE_WRAPPER="0x0" /> 
</EXE> 
</DATABASE> 

我的版本:

In [5]: pd.show_versions() 

INSTALLED VERSIONS 
------------------ 
commit: None 
python: 3.5.1.final.0 
python-bits: 64 
OS: Windows 
OS-release: 7 
machine: AMD64 
processor: Intel64 Family 6 Model 58 Stepping 9, GenuineIntel 
byteorder: little 
LC_ALL: None 
LANG: en_US 

pandas: 0.18.1 
nose: 1.3.7 
pip: 8.1.2 
setuptools: 21.2.1 
Cython: 0.23.4 
numpy: 1.10.4 
scipy: 0.17.0 
statsmodels: None 
xarray: None 
IPython: 4.2.0 
sphinx: 1.4 
patsy: None 
dateutil: 2.5.3 
pytz: 2016.4 
blosc: None 
bottleneck: None 
tables: 3.2.2 
numexpr: 2.5.2 
matplotlib: 1.5.1 
openpyxl: 2.3.5 
xlrd: 0.9.4 
xlwt: None 
xlsxwriter: 0.8.7 
lxml: 3.6.0 
bs4: 4.4.1 
html5lib: 0.9999999 
httplib2: None 
apiclient: None 
sqlalchemy: 1.0.13 
pymysql: None 
psycopg2: None 
jinja2: 2.8 
boto: None 
pandas_datareader: 0.2.1 
+0

我报告的bug熊猫..谢谢小号 –