0

我试图复制在Python的Binary Classification: Twitter sentiment analysis稀疏矩阵和数据帧在Python熊猫

这个项目这些步骤是:

Step 1: Get data 
Step 2: Text preprocessing using R 
Step 3: Feature engineering 
Step 4: Split the data into train and test 
Step 5: Train prediction model 
Step 6: Evaluate model performance 
Step 7: Publish prediction web service 

我在Step 4现在,但我想我无法继续。

import pandas 
import re 
from sklearn.feature_extraction import FeatureHasher 

from sklearn.feature_selection import SelectKBest 
from sklearn.feature_selection import chi2 

from sklearn import cross_validation 

#read the dataset of tweets 

header_row=['sentiment','tweetid','date','query', 'user', 'text'] 
train = pandas.read_csv("training.1600000.processed.noemoticon.csv",names=header_row) 

#keep only the right columns 

train = train[["sentiment","text"]] 

#remove puctuation, special characters, numbers and lower case the text 

def remove_spch(text): 

    return re.sub("[^a-z]", ' ', text.lower()) 

train['text'] = train['text'].apply(remove_spch) 


#Feature Hashing 

def tokens(doc): 
    """Extract tokens from doc. 

    This uses a simple regex to break strings into tokens. 
    """ 
    return (tok.lower() for tok in re.findall(r"\w+", doc)) 

n_features = 2**18 
hasher = FeatureHasher(n_features=n_features, input_type="string", non_negative=True) 
X = hasher.transform(tokens(d) for d in train['text']) 

#Feature Selection and choose the best 20.000 features using Chi-Square 

X_new = SelectKBest(chi2, k=20000).fit_transform(X, train['sentiment']) 

#Using Stratified KFold, split my data to train and test 

skf = cross_validation.StratifiedKFold(X_new, n_folds=2) 

我相信,最后一行是错误的,因为它只包含20.000功能,而不是从大熊猫的Sentiment列。我如何“加入”稀疏矩阵X_new与数据帧train,将其包含在cross_validation然后将其用于分类器?

回答

0

您应该将您的类标签传递给StratifiedKFold,然后使用skf作为迭代器,在每次迭代时它将产生测试集和训练集的索引,您可以使用它们来分离数据集。

看代码示例在官方scikit学习文档: StratifiedKFold

+0

通过你的答案,我发现一个问题是在另一个地方,所以我打开另一个问题。 – Tasos