2017-04-11 58 views
0

我有一个形状的数据框(14407,2564)。我正尝试使用VarianceThreshold函数去除低方差特征。但是,当我调用fit_transform时,出现以下错误:fit_transform中的错误:输入包含NaN,无穷大或值太大(dtype('float64'))

ValueError:输入包含NaN,无穷大或对于dtype('float64')来说值太大。

df.replace('null',np.NaN, inplace=True) 
    df.replace(r'^\s*$', np.NaN, regex=True, inplace=True) 
    df.fillna(value=df.median(), inplace=True) 

我使用检查我的数据帧之后的任何空/无限值:

m = df.isnull().any() 
    print "========= COLUMNS WITH NULL VALUES =================" 
    print m[m] 
    print "========= COLUMNS WITH INFINITE VALUES =================" 
    m = np.isfinite(df.select_dtypes(include=['float64'])).any() 
    print m[m] 

usign VarianceThreshold之前,我从我的DF使用下面的代码替换所有缺失值我有一个空的系列作为输出,这意味着我所有的列都没有任何缺失值。输出是:

========= COLUMNS WITH NULL VALUES ================= 
    Series([], dtype: bool) 
    ========= COLUMNS WITH INFINITE VALUES ================= 
    Series([], dtype: bool) 

完整的错误跟踪:

Traceback (most recent call last): 
     File "/home/users/MyUsername/MyProject/src/main/python/Main.py", line 222, in <module> 
     main() 
     File "/home/users/MyUsername/MyProject/src/main/python/Main.py", line 218, in   main 
     getAllData() 
     File "/home/users/MyUsername/MyProject/src/main/python/Main.py", line 95, in getAllData 
     predictors, labels, dropped_features = fselector.process(variance=True, corr=True, bestf=True, bestfk=200) 
     File   "/home/users/MyUsername/MyProject/src/main/python/classes/featureselector.py", line 54, in process 
     self.getVariance(threshold=(.95 * (1 - .95))) 
     File "/home/users/MyUsername/MyProject/src/main/python/classes/featureselector.py", line 136, in getVariance 
     self.removeLowVarianceColumns(df=self.X, thresh=threshold) 
     File "/home/users/MyUsername/MyProject/src/main/python/classes/featureselector.py", line 213, in removeLowVarianceColumns 
     selector.fit_transform(df) 
     File "/usr/lib64/python2.7/site-packages/sklearn/base.py", line 494, in fit_transform 
     return self.fit(X, **fit_params).transform(X) 
     File "/usr/lib64/python2.7/site-packages/sklearn/feature_selection/variance_threshold.py", line 64, in fit 
     X = check_array(X, ('csr', 'csc'), dtype=np.float64) 
    File "/usr/lib64/python2.7/site-packages/sklearn/utils/validation.py", line 407, in check_array 
     _assert_all_finite(array) 
    File "/usr/lib64/python2.7/site-packages/sklearn/utils/validation.py", line 58, in _assert_all_finite 
    " or a value too large for %r." % X.dtype) 
    ValueError: Input contains NaN, infinity or a value too large for dtype('float64'). 

所以,我不知道要检查什么,这似乎并不像一个缺失值的问题,但我也没能得到哪些列/值导致问题。

我在这里看到了几个线程,最后都有一个缺失值,但这似乎并不是问题。

+0

你应该总是发布完整的堆栈跟踪的错误 –

+0

@VivekKumar我将它添加到文章 – Sarah

+0

首先将其转换为np数组''X = np.asanyarray(df)'。然后,检查以下两条语句是否返回true或假:1)'np.isfinite(X.sum())'2)'np.isfinite(X).all()' –

回答

1

我通过将我的数据转换为数字来解决此问题。看起来,虽然错误消息指出'float64',但我的数据仅仅是所有对象,而对象与fit_transform不兼容。

将我的数据更改为使用浮点数: df = df.apply(lambda x: pd.to_numeric(x,errors='ignore'))解决了此问题。

相关问题