2011-08-24 27 views
6

我正在尝试libsvm,并按照该示例来训练软件附带的heart_scale数据上的svm。我想使用我预先计算好的chi2内核。训练数据的分类率降至24%。我确定我正确地计算了内核,但我想我必须做错了什么。代码如下。你能看到任何错误吗?帮助将不胜感激。使用预先计算的chi2内核与libsvm(matlab)不好的结果

%read in the data: 
[heart_scale_label, heart_scale_inst] = libsvmread('heart_scale'); 
train_data = heart_scale_inst(1:150,:); 
train_label = heart_scale_label(1:150,:); 

%read somewhere that the kernel should not be sparse 
ttrain = full(train_data)'; 
ttest = full(test_data)'; 

precKernel = chi2_custom(ttrain', ttrain'); 
model_precomputed = svmtrain2(train_label, [(1:150)', precKernel], '-t 4'); 

这是内核是如何预先计算:

function res=chi2_custom(x,y) 
a=size(x); 
b=size(y); 
res = zeros(a(1,1), b(1,1)); 
for i=1:a(1,1) 
    for j=1:b(1,1) 
     resHelper = chi2_ireneHelper(x(i,:), y(j,:)); 
     res(i,j) = resHelper; 
    end 
end 
function resHelper = chi2_ireneHelper(x,y) 
a=(x-y).^2; 
b=(x+y); 
resHelper = sum(a./(b + eps)); 

用不同的SVM实现(vlfeat)我得到的训练数据的分类率(是的,我在训练数据进行测试,只是看看发生了什么)约90%。所以我很确定libsvm的结果是错误的。

回答

0

的问题是以下行:

resHelper = sum(a./(b + eps)); 

它应该是:

resHelper = 1-sum(2*a./(b + eps)); 
+0

谢谢你回答我的问题,我刚才看到你的回应了。 – Sallos

+0

@Sallos:虽然你的公式稍微偏离了,但真正的问题是数据正常化。看到我的回答 – Amro

15

当与支持向量机工作时,以归一化数据集作为一个预处理步骤是非常重要的。 标准化将属性放在相同的比例尺上,并防止具有较大值的属性偏置结果。它还提高了数值稳定性(最大限度地减少了由于浮点表示而导致的上溢和下溢的可能性)。

准确地说,您对卡方内核的计算稍微偏离。相反,采取的定义之下,并使用它这个更快的实现:

chi_squared_kernel

function D = chi2Kernel(X,Y) 
    D = zeros(size(X,1),size(Y,1)); 
    for i=1:size(Y,1) 
     d = bsxfun(@minus, X, Y(i,:)); 
     s = bsxfun(@plus, X, Y(i,:)); 
     D(:,i) = sum(d.^2 ./ (s/2+eps), 2); 
    end 
    D = 1 - D; 
end 

现在使用相同数据集考虑下面的例子为(从我的previous answer适应代码)你:

%# read dataset 
[label,data] = libsvmread('./heart_scale'); 
data = full(data);  %# sparse to full 

%# normalize data to [0,1] range 
mn = min(data,[],1); mx = max(data,[],1); 
data = bsxfun(@rdivide, bsxfun(@minus, data, mn), mx-mn); 

%# split into train/test datasets 
trainData = data(1:150,:); testData = data(151:270,:); 
trainLabel = label(1:150,:); testLabel = label(151:270,:); 
numTrain = size(trainData,1); numTest = size(testData,1); 

%# compute kernel matrices between every pairs of (train,train) and 
%# (test,train) instances and include sample serial number as first column 
K = [ (1:numTrain)' , chi2Kernel(trainData,trainData) ]; 
KK = [ (1:numTest)' , chi2Kernel(testData,trainData) ]; 

%# view 'train vs. train' kernel matrix 
figure, imagesc(K(:,2:end)) 
colormap(pink), colorbar 

%# train model 
model = svmtrain(trainLabel, K, '-t 4'); 

%# test on testing data 
[predTestLabel, acc, decVals] = svmpredict(testLabel, KK, model); 
cmTest = confusionmat(testLabel,predTestLabel) 

%# test on training data 
[predTrainLabel, acc, decVals] = svmpredict(trainLabel, K, model); 
cmTrain = confusionmat(trainLabel,predTrainLabel) 

所述测试数据结果:

Accuracy = 84.1667% (101/120) (classification) 
cmTest = 
    62  8 
    11 39 

和训练数据,我们得到约90%的准确率,你期望:

Accuracy = 92.6667% (139/150) (classification) 
cmTrain = 
    77  3 
    8 62 

train_train_kernel_matrix

+1

哦,酷 - 这是一个详细的答案。感谢您花时间思考我的问题。它肯定有帮助。 – Sallos

+2

@Sallos:很高兴我可以帮忙,请考虑[接受](http://meta.stackexchange.com/questions/5234/how-does-accepting-an-answer-work)一个答案,如果它解决了问题 – Amro

相关问题