Python - 字符串列表中的特征散列列表字符串

我希望能够获取字典（记录）的列表，其中某些列的值列表为单元格的值。下面是一个例子Python - 字符串列表中的特征散列列表字符串

[{'fruit': 'apple', 'age': 27}, {'fruit':['apple', 'banana'], 'age': 32}]

我怎么能借此输入并对其进行功能散列（在我的数据集我有成千上万的列）。目前我正在使用一种热门编码，但这似乎消耗了很多内存（比我的系统上的更多）。

我试图把我的数据集作为上面，就有了一个错误：

x__ = h.transform(data) 

Traceback (most recent call last): 

    File "<ipython-input-14-db4adc5ec623>", line 1, in <module> 
    x__ = h.transform(data) 

    File "/usr/local/lib/python2.7/dist-packages/sklearn/feature_extraction/hashing.py", line 142, in transform 
    _hashing.transform(raw_X, self.n_features, self.dtype) 

    File "sklearn/feature_extraction/_hashing.pyx", line 52, in sklearn.feature_extraction._hashing.transform (sklearn/feature_extraction/_hashing.c:2103) 

TypeError: a float is required

我也试图把它变成一个数据帧，并把它传递给散列器：

x__ = h.transform(x_y_dataframe) 

Traceback (most recent call last): 

    File "<ipython-input-15-109e7f8018f3>", line 1, in <module> 
    x__ = h.transform(x_y_dataframe) 

    File "/usr/local/lib/python2.7/dist-packages/sklearn/feature_extraction/hashing.py", line 142, in transform 
    _hashing.transform(raw_X, self.n_features, self.dtype) 

    File "sklearn/feature_extraction/_hashing.pyx", line 46, in sklearn.feature_extraction._hashing.transform (sklearn/feature_extraction/_hashing.c:1928) 

    File "/usr/local/lib/python2.7/dist-packages/sklearn/feature_extraction/hashing.py", line 138, in <genexpr> 
    raw_X = (_iteritems(d) for d in raw_X) 

    File "/usr/local/lib/python2.7/dist-packages/sklearn/feature_extraction/hashing.py", line 15, in _iteritems 
    return d.iteritems() if hasattr(d, "iteritems") else d.items() 

AttributeError: 'unicode' object has no attribute 'items'

任何想法如何我可以用熊猫或sklearn来实现这个吗？或者，也许我可以一次构建几千行的虚拟变量？

这里是我如何得到我的使用大熊猫虚拟变量：

def one_hot_encode(categorical_labels): 
    res = [] 
    tmp = None 
    for col in categorical_labels: 
     v = x[col].astype(str).str.strip('[]').str.get_dummies(', ')#cant set a prefix 
     if len(res) == 2: 
      tmp = pandas.concat(res, axis=1) 
      del res 
      res = [] 
      res.append(tmp) 
      del tmp 
      tmp = None 
     else: 
      res.append(v) 
    result = pandas.concat(res, axis=1) 
    return result

来源

2017-05-09 Kevin

您可以将列表到元组，这是哈希的。 – IanS

考虑以下方法：

from sklearn.feature_extraction.text import CountVectorizer 

lst = [{'fruit': 'apple', 'age': 27}, {'fruit':['apple', 'banana'], 'age': 32}] 

df = pd.DataFrame(lst) 

vect = CountVectorizer() 

X = vect.fit_transform(df.fruit.map(lambda x: ' '.join(x) if isinstance(x, list) else x)) 

r = pd.DataFrame(X.A, columns=vect.get_feature_names(), index=df.index) 

df.join(r)

结果：

In [66]: r 
Out[66]: 
    apple banana 
0  1  0 
1  1  1 

In [67]: df.join(r) 
Out[67]: 
    age   fruit apple banana 
0 27   apple  1  0 
1 32 [apple, banana]  1  1

UPDATE：从开始Pandas 0.20.1我们可以直接从spars创建SparseDataFrame Ë矩阵：

In [13]: r = pd.SparseDataFrame(X, columns=vect.get_feature_names(), index=df.index, default_fill_value=0) 

In [14]: r 
Out[14]: 
    apple banana 
0  1  0 
1  1  1 

In [15]: r.memory_usage() 
Out[15]: 
Index  80 
apple  16 # 2 * 8 byte (np.int64) 
banana  8 # 1 * 8 byte (as there is only one `1` value) 
dtype: int64 

In [16]: r.dtypes 
Out[16]: 
apple  int64 
banana int64 
dtype: object

来源

2017-05-09 13:22:06 MaxU

虽然我看起来内存不足（32 GB），但确实有效，我想有很多列。我也注意到，当我将df分开时，为了能够做到这一点，它给了我很多nans（即使我提前从我的数据帧中删除所有nans） – Kevin

我意识到我得到na的原因是因为我没有将轴设置为1 – Kevin

@Kevin，在Pandas 0.20.1中，您可以直接从稀疏矩阵（CountVectorizer的结果）创建SparseDataFrame。请检查我的更新的答案 – MaxU

Python - 字符串列表中的特征散列列表字符串

回答

相关问题