Slides are here

In this tutorial, we show DIDO-based MKL-SVM (i.e., post decision making fusion, specifically the fusion of SVMs with different kernels) via decision level fuzzy integral multiple kernel learning lp-norm (DeFIMKLp). We do this on the benchmark breast cancer data set. For full math and algorithm details, see our paper "A. Pinar, J. Rice, L. Hu, D. T. Anderson, T. C. Havens, "Efficient Multiple Kernel Classification using Feature and Decision Level Fusion," IEEE Trans. on Fuzzy Systems, 2016".

First, clear your path and figures.

		  

close all; 
clear all;

Next, lets go with 5 kernels and lets load the breast cancer data set. Parameters indicate, in this order, 1 = use single versus double data size, 1 and 1 give you data scaled, per feature, in [-1,1], rand()*1000 is a random seed, and 0.75 is how much data to use for training. Note, we picked 5 kernels, which kernels and their parameters. I call this a "configuration". DeFIMKL optimizes such a configuration. However, the search for configuration is another learning step. This is the same for DeFIMKL, GAMKL, MKLGL, etc. It is a very challenging space to optimize. To date, researchers have focused their efforts on quality optimizers relative to a known configuration and in practice most do a gridded search of some form or experimentally vary the configuration to find a quality solution.

		  

% how many kernels do we want?
nk = 5;

% load our data set
[X,L,X2,L2] = fi_defimkl_load_bcancer(1,1,1,rand()*1000,0.75);

Next, make our kernels (again, relative to a single configuration). I did NOT take time to carefully generate the optimal set of kernels and parameters for this problem. My goal was to show you how to compute DeFIMKL (not how to search for a quality configuration).

		  

% number of data points 
n = size(X,1); % for training
n2 = size(X2,1); % for testing

% make kernel matrix
K = single( zeros(n,n,nk) );
K2 = single( zeros(n2,n,nk) );

% make your kernels
d = size(X,2);
Xdists = pdist2(X,X);
Med_dist = median( Xdists(:) );
% for training
K(:,:,1) = single( fi_defimkl_kernel('rbf2',X',X',1/d) ); 
K(:,:,2) = single( fi_defimkl_kernel('rbf2',X',X',1/Med_dist) ); 
K(:,:,3) = single( fi_defimkl_kernel('rbf2',X',X',2^-3) );  
K(:,:,4) = single( fi_defimkl_kernel('rbf2',X',X',2^-2) );  
K(:,:,5) = single( fi_defimkl_kernel('rbf2',X',X',2^-1) );  
% for testing
K2(:,:,1) = single( fi_defimkl_kernel('rbf2',X2',X',1/d) );
K2(:,:,2) = single( fi_defimkl_kernel('rbf2',X2',X',1/Med_dist) );
K2(:,:,3) = single( fi_defimkl_kernel('rbf2',X2',X',2^-3) );  
K2(:,:,4) = single( fi_defimkl_kernel('rbf2',X2',X',2^-2) );  
K2(:,:,5) = single( fi_defimkl_kernel('rbf2',X2',X',2^-1) ); 

Now, train via DeFIMKL and print out some of the results.

		  

% train now
[accuracy, resub, baseaccuracytrain, baseaccuracytest, FM] = fi_defimkl(L, L2, K, K2);
baseaccuracytrain
resub
baseaccuracytest
accuracy
[Phi,SEntropy,SGini] = fi_defimkl_shapley(nk,FM)

Our results are the following. Its interesting, the poly of degree 3 does perfect on training (100 percent), but it does not generalize well to our test data (87 percent). However, DeFIMKL mitigated this by learning a combination of all 5 kernels (see Phi). In return, on the test data DeFIMKL maintained the highest score (tied with some of the base kernels, which would have not been preferred during training). Note, in general MKL can, like many algorithms, have problems with becoming extremely overfit. This is one of the reasons why we introduced lp-norm regularization to DeFIMKL (an attempt to learn quality minimum function error and also lower complexity fuzzy measures).

		  

baseaccuracytrain =

    0.9820    0.9760    1.0000    0.9820    0.9820

resub =

    0.9820

baseaccuracytest =

    0.9756    0.9756    0.8780    0.9024    0.9756

accuracy =

    0.9756
	
Phi =

    0.1638    0.1638    0.3446    0.1638    0.1638

Here is how we do DeFIMKL. First, train our base SVMs

 
function [accuracy, resub, baseaccuracytrain, baseaccuracytest, FM] = fi_defimkl(train_lbl, test_lbl, K, K_test)

    % ----------- training data --------------------    
    
    % number of kernels
    nk = size(K,3); 
    % number of inputs
    NumOfInputs = nk;      
    % number of constraints
    N = NumOfInputs*(2^(NumOfInputs-1)-1);     
    % measure length 
    g = 2^(NumOfInputs)-1;                      
        
    %train classifiers
    for k = 1:NumOfInputs,
        %train
        svm_type = 0;
        kernel_type = 4; % == 4 means custom kernel
        cost = 1;    
        options = ['-s ', num2str(svm_type),...
                    ' -t ', num2str(kernel_type),...
                    ' -c ', num2str(cost),...
                    ' -e ', num2str(0.03),...
                    ' -m ', num2str(4000),... 
                    ' -q'];
        model{k} = libsvmtrain_d1(double(train_lbl),[(1:length(train_lbl))' double(K(:,:,k))],[],options); % call whatever SVM you like
        %evaluate (test)
        [~,dvtrain(:,k)] = libsvmpredict_d1(double(train_lbl), [(1:length(train_lbl))' double(K(:,:,k))] , model{k});
        if( train_lbl(1) == -1 )
            dvtrain(:,k) = (-1) .* dvtrain(:,k);
        end
        decvv = dvtrain(:,k); decvv( decvv >= 0 ) = 1; decvv( decvv < 0 ) = -1;
        counts = abs( decvv - train_lbl );
        baseaccuracytrain(k) = length( find( counts == 0 ) ) / length(counts);
    end
		

Next, learn the measure via quadratic programming

 
    % ----------- now learn the measure --------------------    
    
    % now, range compress (Sigmoid)
    dvtrain = dvtrain./sqrt(1+dvtrain.^2);
    
    % setup the QP
    [C,D,f,Z,Gamma]=fi_defimkl_qpmatrices(NumOfInputs,dvtrain,train_lbl,[]);
    options = optimset('Display','off');
    
    % this uses regularization, lambda is that value, you can turn it off if you like, but helps!
    lambda = 8; 
    FM = quadprog((2*D+lambda*eye(g)),f,C,zeros(size(C,1),1),[],[],[zeros(g-1,1); 1],ones(g,1),[],options);
    %FM = quadprog(2*D,f,C,zeros(size(C,1),1),[],[],[zeros(g-1,1); 1],ones(g,1),[],options);  % no reg
		

Next, run on test data (use fused result)

 
    % ----------- testing (using the fused result) --------------------

    % classify all points by all classifiers
    for k = 1:NumOfInputs,
        [~,dvtest(:,k)] = libsvmpredict_d1(double(test_lbl), [(1:length(test_lbl))' double(K_test(:,:,k))] , model{k});            
    end;
    if( train_lbl(1) == -1 )
        dvtest = (-1) .* dvtest;
    end
    % normalize and fuse them!
    dvtest = dvtest./sqrt(1+dvtest.^2);
    decv = fi_defimkl_chi( dvtest, FM' );
    
    decvv = decv; decvv( decvv >= 0 ) = 1; decvv( decvv < 0 ) = -1;
    counts = abs( decvv - test_lbl );
    accuracy = length( find( counts == 0 ) ) / length(counts);
		

Now, we can calculate resub if we like

 
    % ----------- training (resub w.r.t. fused result) --------------------
    
    % classify all points by all classifiers
    for k = 1:NumOfInputs,
        [~,dvtest3(:,k)] = libsvmpredict_d1(double(train_lbl), [(1:length(train_lbl))' double(K(:,:,k))] , model{k});            
    end;
    if( train_lbl(1) == -1 )
        dvtest3 = (-1) .* dvtest3;
    end
    % normalize and fuse them!
    dvtest3 = dvtest3./sqrt(1+dvtest3.^2);
    decv = fi_defimkl_chi( dvtest3, FM' );
    
    decvv = decv; decvv( decvv >= 0 ) = 1; decvv( decvv < 0 ) = -1;
    counts = abs( decvv - train_lbl );
    resub = length( find( counts == 0 ) ) / length(counts);  
		

I also calculate performance on base learners for comparison

 
    % ----------- testing (but wrt base learners) --------------------
  
    for k = 1:NumOfInputs,
        [~,dvtrain2(:,k)] = libsvmpredict_d1(double(test_lbl), [(1:length(test_lbl))' double(K_test(:,:,k))] , model{k});
        if( train_lbl(1) == -1 )
            dvtrain2(:,k) = (-1) .* dvtrain2(:,k);
        end
        decvv = dvtrain2(:,k); decvv( decvv >= 0 ) = 1; decvv( decvv < 0 ) = -1;
        counts = abs( decvv - test_lbl );
        baseaccuracytest(k) = length( find( counts == 0 ) ) / length(counts);
    end