考虑以下演示:
来源DF:
In [2]: df
Out[2]:
text
0 is it good movie
1 wooow is it very goode
2 bad movie
解决方案:让我们创建一个SparseDataFrame了TFIDF稀疏矩阵:
from sklearn.feature_extraction.text import TfidfVectorizer
vect = TfidfVectorizer(sublinear_tf=True, max_df=0.5, analyzer='word', stop_words='english')
sdf = pd.SparseDataFrame(vect.fit_transform(df['text']),
columns=vect.get_feature_names(),
default_fill_value=0)
sdf['text'] = df['text']
结果:
In [13]: sdf
Out[13]:
bad good goode wooow text
0 0.0 1.0 0.000000 0.000000 is it good movie
1 0.0 0.0 0.707107 0.707107 wooow is it very goode
2 1.0 0.0 0.000000 0.000000 bad movie
In [14]: sdf.memory_usage()
Out[14]:
Index 80
bad 8
good 8
goode 8
wooow 8
text 24
dtype: int64
P. S在.memory_usage()
注意 - 我们没有失去“空闲”。如果我们将使用pd.concat
,join
,等 - 我们将失去“稀疏性”,因为所有这些方法都会生成一个新的常规(未稀疏)合并的DataFrame的副本