scikit学习管道

在我的（IID）数据集的每个样本是这样的：
X = [A_1，A_2 ... A_N，B_1，B_2 ... B_M]

我也有各样品的标签（这是监督学习）

的一个特征是非常稀疏的（即袋的字表示），而b特征是致密的（整数，疗法e是〜45的那些）

我正在使用scikit-learn，并且我想在管道中使用GridSearchCV。

的问题：是否有可能对功能型一个和另一CountVectorizer上的功能型b使用一个CountVectorizer？

我想可以看作：

pipeline = Pipeline([ 
    ('vect1', CountVectorizer()), #will work only on features [0,(N-1)] 
    ('vect2', CountVectorizer()), #will work only on features [N,(N+M-1)] 
    ('clf', SGDClassifier()), #will use all features to classify 
]) 

parameters = { 
    'vect1__max_df': (0.5, 0.75, 1.0),  # type a features only 
    'vect1__ngram_range': ((1, 1), (1, 2)), # type a features only 
    'vect2__max_df': (0.5, 0.75, 1.0),  # type b features only 
    'vect2__ngram_range': ((1, 1), (1, 2)), # type b features only 
    'clf__alpha': (0.00001, 0.000001), 
    'clf__penalty': ('l2', 'elasticnet'), 
    'clf__n_iter': (10, 50, 80), 
} 

grid_search = GridSearchCV(pipeline, parameters, n_jobs=-1, verbose=1) 
grid_search.fit(X, y)

这可能吗？

A nice idea由@Andreas Mueller提交。然而，我想保留原始的非选择功能，因此，我不能告诉管道前期（管道开始前）的每个阶段的列索引。

例如，如果我设置了CountVectorizer(max_df=0.75)，它可能会减少一些项，并且原始列索引将会更改。

谢谢

来源

2015-05-31 omerbp

不幸的是，这是目前不如它可能。您需要使用FeatureUnion来连接各种功能，每个变压器都需要选择功能并对其进行转换。做到这一点的一种方法是制作一个选择列的变换器的管道（您需要自己写）和CountVectorizer。有一个例子可以做类似的事情here。该示例实际上将字典中的不同值分离，但您不需要这样做。另请参阅related issue for selecting columns，其中包含您需要的变压器代码。

它会看起来像这样与当前的代码：

make_pipeline(
    make_union(
     make_pipeline(FeatureSelector(some_columns), CountVectorizer()), 
     make_pipeline(FeatureSelector(other_columns), CountVectorizer())), 
    SGDClassifier())

来源

2015-06-01 13:33:12

嗨，我读了你附加和编辑我的原始，随时回应。 – omerbp

您需要使用FeatureUnion并行提取多个要素。 –

是的，并把它作为管道中的第一个地方....应该花一些时间来构建它。然而，正如我所看到的那样 - 我不能使用FeatureSelector，因为我想要做两次选择 - 一次是针对每种特征类型的，对吧？ – omerbp

我们开发PipeGraph，扩展到Scikit-学习管道，可以让你获得中间数据，建立类似的工作流程图表，特别是，解决这个问题（请参阅图片库中的示例http://mcasl.github.io/PipeGraph）

来源

2018-02-18 22:15:14

很酷，我会检查出来的！ – omerbp

scikit学习管道

回答

相关问题