2017-05-09 77 views
1

我希望能够获取字典(记录)的列表,其中某些列的值列表为单元格的值。下面是一个例子Python - 字符串列表中的特征散列列表字符串

[{'fruit': 'apple', 'age': 27}, {'fruit':['apple', 'banana'], 'age': 32}] 

我怎么能借此输入并对其进行功能散列(在我的数据集我有成千上万的列)。目前我正在使用一种热门编码,但这似乎消耗了很多内存(比我的系统上的更多)。

我试图把我的数据集作为上面,就有了一个错误:

x__ = h.transform(data) 

Traceback (most recent call last): 

    File "<ipython-input-14-db4adc5ec623>", line 1, in <module> 
    x__ = h.transform(data) 

    File "/usr/local/lib/python2.7/dist-packages/sklearn/feature_extraction/hashing.py", line 142, in transform 
    _hashing.transform(raw_X, self.n_features, self.dtype) 

    File "sklearn/feature_extraction/_hashing.pyx", line 52, in sklearn.feature_extraction._hashing.transform (sklearn/feature_extraction/_hashing.c:2103) 

TypeError: a float is required 

我也试图把它变成一个数据帧,并把它传递给散列器:

x__ = h.transform(x_y_dataframe) 

Traceback (most recent call last): 

    File "<ipython-input-15-109e7f8018f3>", line 1, in <module> 
    x__ = h.transform(x_y_dataframe) 

    File "/usr/local/lib/python2.7/dist-packages/sklearn/feature_extraction/hashing.py", line 142, in transform 
    _hashing.transform(raw_X, self.n_features, self.dtype) 

    File "sklearn/feature_extraction/_hashing.pyx", line 46, in sklearn.feature_extraction._hashing.transform (sklearn/feature_extraction/_hashing.c:1928) 

    File "/usr/local/lib/python2.7/dist-packages/sklearn/feature_extraction/hashing.py", line 138, in <genexpr> 
    raw_X = (_iteritems(d) for d in raw_X) 

    File "/usr/local/lib/python2.7/dist-packages/sklearn/feature_extraction/hashing.py", line 15, in _iteritems 
    return d.iteritems() if hasattr(d, "iteritems") else d.items() 

AttributeError: 'unicode' object has no attribute 'items' 

任何想法如何我可以用熊猫或sklearn来实现这个吗?或者,也许我可以一次构建几千行的虚拟变量?

这里是我如何得到我的使用大熊猫虚拟变量:

def one_hot_encode(categorical_labels): 
    res = [] 
    tmp = None 
    for col in categorical_labels: 
     v = x[col].astype(str).str.strip('[]').str.get_dummies(', ')#cant set a prefix 
     if len(res) == 2: 
      tmp = pandas.concat(res, axis=1) 
      del res 
      res = [] 
      res.append(tmp) 
      del tmp 
      tmp = None 
     else: 
      res.append(v) 
    result = pandas.concat(res, axis=1) 
    return result 
+0

您可以将列表到元组,这是哈希的。 – IanS

回答

1

考虑以下方法:

from sklearn.feature_extraction.text import CountVectorizer 

lst = [{'fruit': 'apple', 'age': 27}, {'fruit':['apple', 'banana'], 'age': 32}] 

df = pd.DataFrame(lst) 

vect = CountVectorizer() 

X = vect.fit_transform(df.fruit.map(lambda x: ' '.join(x) if isinstance(x, list) else x)) 

r = pd.DataFrame(X.A, columns=vect.get_feature_names(), index=df.index) 

df.join(r) 

结果:

In [66]: r 
Out[66]: 
    apple banana 
0  1  0 
1  1  1 

In [67]: df.join(r) 
Out[67]: 
    age   fruit apple banana 
0 27   apple  1  0 
1 32 [apple, banana]  1  1 

UPDATE:从开始Pandas 0.20.1我们可以直接从spars创建SparseDataFrame Ë矩阵:

In [13]: r = pd.SparseDataFrame(X, columns=vect.get_feature_names(), index=df.index, default_fill_value=0) 

In [14]: r 
Out[14]: 
    apple banana 
0  1  0 
1  1  1 

In [15]: r.memory_usage() 
Out[15]: 
Index  80 
apple  16 # 2 * 8 byte (np.int64) 
banana  8 # 1 * 8 byte (as there is only one `1` value) 
dtype: int64 

In [16]: r.dtypes 
Out[16]: 
apple  int64 
banana int64 
dtype: object 
+0

虽然我看起来内存不足(32 GB),但确实有效,我想有很多列。我也注意到,当我将df分开时,为了能够做到这一点,它给了我很多nans(即使我提前从我的数据帧中删除所有nans) – Kevin

+0

我意识到我得到na的原因是因为我没有将轴设置为1 – Kevin

+0

@Kevin,在Pandas 0.20.1中,您可以直接从稀疏矩阵(CountVectorizer的结果)创建SparseDataFrame。请检查我的更新的答案 – MaxU