Slides are here

In this tutorial, we demonstrate the genetic algorithm based multiple kernel learning (MKL) with p-norm regularization (GAMKLp) algorithm and MKL group lasso (MKLGL) for FIFO coupled with a SVM classifier. These are both p-norm linear convex sum (LCS) MKL formulations. We use the benchmark breast cancer data set. See our paper for full MK algorithm math and algorithm details; "A. Pinar, J. Rice, L. Hu, D. T. Anderson, T. C. Havens, "Efficient Multiple Kernel Classification using Feature and Decision Level Fusion," IEEE Trans. on Fuzzy Systems, accepted September, 2016".

First, clear your path and figures.

		  

close all; 
clear all;

Next, lets do 2 folds, use 75% of our kernel data (w.r.t. our discussed Nystrom-based sampling and linearization for kernel compression and SVM speedup) and repeat the experiment two times.

		  

% how many folds (in a N-fold cross validation setup) do we want to do?
NF = 2;

% I put in some example Nystrom kernel matrix sampling
% so, how much do we want to sample the kernel matrices?
% = 1 means use all data
% = 0.1 means use 10% of the data
PercentageOfKernelReduction = 0.75;

% data structures to store results
SingleKernelResults = zeros(1,NF);
MKLGLResults = zeros(1,NF);
GAMKLResults = zeros(1,NF);
RunGAMKLp = 1; % flag

% number of times to run? (repeat our experiments)
NumberOfRepeats = 2;

% Main loop (number of repeated experiments)
for main_loop=0:(NumberOfRepeats-1)

    % Print out iteration #
    fprintf(1,'######################## Iteration %d\n',main_loop);
    
    % How many folds?
    for t=1:NF

        fprintf(1,'####### Fold %d\n',t);

For demonstration, I use 5 kernels and load the data. Again, like in our DeFIMKL example, I just picked some "configuration". The search for configuration, i.e., what kernels, what parameters, etc., is a process on top of MKLGL and GAMKLp.

		  

        % Number of kernels (hard coded here)
        nk = 5;

        % Load our data set
        %[X,L,X2,L2] = mkl_load_sonar(1,1,1,(rand()*1000)*(main_loop*t+t),0.80);
        [X,L,X2,L2] = mkl_load_bcancer(1,1,1,(rand()*1000)*(main_loop*t+t),0.80);

        % Number of data points
        n = size(X,1);
        n2 = size(X2,1);

Next, I build the training kernels.

		  

	%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%    
        %Training data
        %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%   

        % Allocate kernel variables
        K = single( zeros(n,n,nk) );

        % Make kernels
        d = size(X,2);
        Xdists = pdist2(X,X);
        Med_dist = median( Xdists(:) );
        K(:,:,1) = single( mkl_kernel('rbf2',X',X',1/d) );
        K(:,:,2) = single( mkl_kernel('rbf2',X',X',1/Med_dist) );
        K(:,:,3) = single( mkl_kernel('rbf2',X',X',2^-3) );  
        K(:,:,4) = single( mkl_kernel('rbf2',X',X',2^-2) );  
        K(:,:,5) = single( mkl_kernel('rbf2',X',X',2^-1) ); 

Next, I build the test kernels.

		  

	%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%    
        % Testing data
        %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%           
        
        % Allocate kernel variables
        K2 = single( zeros(n2,n,nk) );

        % Make kernels
        d = size(X,2);
        Xdists = pdist2(X,X);
        Med_dist = median( Xdists(:) );
        K2(:,:,1) = single( mkl_kernel('rbf2',X2',X',1/d) );
        K2(:,:,2) = single( mkl_kernel('rbf2',X2',X',1/Med_dist) );
        K2(:,:,3) = single( mkl_kernel('rbf2',X2',X',2^-3) );  %rbf
        K2(:,:,4) = single( mkl_kernel('rbf2',X2',X',2^-2) );  %rbf
        K2(:,:,5) = single( mkl_kernel('rbf2',X2',X',2^-1) );  %rbf

Next, train single kernel and evaluate (take the best one).

		  

	%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
        %STEP 1: train single kernel and evaluate (take the best one)
        %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
		
        SingleKernelIndResults = zeros(1,nk);
        for looper=1:nk
            % Do Nystrom sampling and eigen analysis and use linear SVM (but kernel SVM in RHKS)
            % Not commenting this in the later uses of this coding
            z = randperm(size(K(:,:,looper),1)); % step 1, random permutation of indices
            s = ceil( size(K(:,:,looper),1) * PercentageOfKernelReduction ); % step 2: pick samples 
            z = z(1:s); % select them
            Kzz = K(z,z,looper); % Build our first matrix
            Kz = K(:,z,looper); % Build our second matrix
            [U,V]=eig(Kzz); % step 3: do the eigen decomp
            Xz = Kz*U*diag(1./sqrt(diag(V)));  % step 4: compute final kernel matrix (will compute w.r.t. below)
            %Kp = Xz*Xz'; % can expand to get reconstructed kernel matrix
            %err = sum((FinalKernel(:)-Kp(:)).^2); % can calc the final error of the reconstruction
            model = train(double(L'),sparse(double(Xz)),'-q'); % train with our svm (dot product kernel)
            % Now, test, have to follow same sampling
            Kz = K2(:,z,end); 
            Xz = Kz*U*diag(1./sqrt(diag(V)));
            %Kp = Xz*Xz'; %can look at reconstructed kernel matrix if desired
            [predicted_label, accuracy, decv] = predict(double(L2'),sparse(double(Xz)), model, '-q'); % predict
            SingleKernelIndResults(looper) = accuracy(1) / 100; % calc acc
        end
        % take the best individual
        SingleKernelResults(main_loop*NF + t) = max( SingleKernelIndResults );
        % normalize the scores (sum to 1)
        SingleKernelResultsnorm = SingleKernelIndResults ./ sum( SingleKernelIndResults );

Next, I run MKLGL twice, once with uniform weights initilization and once seeded by the relative performances of the base learners for initilization. I take the best score.

		  

	%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
        %STEP 2: MKLGL
        %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
                    
        % Try based on quality of base learner and uniform initial weights

        % Start with uniform weights
        [model,sigmaK,acc,dval,pred] = mkl_mklgl(L',K,1,1); % call solver
        FinalK = single(zeros(size(K2(:,:,1)))); % allocate final kernel matrix
        for j=1:nk
            FinalK = FinalK + sigmaK(j) .* K2(:,:,j); % do the aggregation - can do faster Matlab wise
        end
        [~,decv] = libsvmpredict_d1(double(L2'), [(1:length(L2))' double(FinalK)] , model); % predict labels
        % Have to deal with case of if L(1) is not a 1
        if( L(1) == -1 )
            decv = decv * (-1);
        end
        %Make score
        decv( decv <= 0 ) = -1;
        decv( decv > 0 ) = 1;
        SameSign = L2' .* decv;
        SameSign( SameSign < 0 ) = 0;
        Acc1 = sum(SameSign) / length(decv);        

        % Now, try to seed with base learner qualities

        SingleKernelResultsnorm = SingleKernelIndResults ./ sum( SingleKernelIndResults );
        [model,sigmaK,acc,dval,pred] = mkl_mklgl(L',K,1,1,SingleKernelResultsnorm);   
        FinalK = single(zeros(size(K2(:,:,1))));
        for j=1:nk
            FinalK = FinalK + sigmaK(j) .* K2(:,:,j); % again, could speed up if desired
        end
        [~,decv] = libsvmpredict_d1(double(L2'), [(1:length(L2))' double(FinalK)] , model);
        % Have to deal with case of if L(1) is not a 1
        if( L(1) == -1 )
            decv = decv * (-1);
        end
        % Make score
        decv( decv < 0 ) = -1;
        decv( decv >= 0 ) = 1;
        SameSign = L2' .* decv;
        SameSign( SameSign < 0 ) = 0;
        Acc2 = sum(SameSign) / length(decv);

        MKLGLResults(main_loop*NF + t) = max( [ Acc1 Acc2 ] ); 

Next, GAMKLp (via GA to find out the optimal weights to combine kernels). You can look in the library to see how I implemented the gamklp function using Matlabs genetic algorithm library. Note, its not the most efficient implementation but its easy to read.

		  

	%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
        %STEP 3: GAMKL (via GA to find out the optimal weights to combine kernels)
        %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

        if(RunGAMKLp==1)
            
            % Train base learners ----------------------------------------------

            Blearners = [];
            for i=1:nk % for the ith kernel
                z = randperm(size(K(:,:,i),1));
                s = ceil( size(K(:,:,i),1) * PercentageOfKernelReduction );
                z = z(1:s);
                Kzz = K(z,z,i);
                Kz = K(:,z,i);
                [U,V]=eig(Kzz);
                Xz = Kz*U*diag(1./sqrt(diag(V)));
                %Kp = Xz*Xz';        
                %err = sum((FinalKernel(:)-Kp(:)).^2);
                model = train(double(L'),sparse(double(Xz)),'-q');
                % store the results (have to use them below)
                Blearners{i}.model = model;
                Blearners{i}.z = z;
                Blearners{i}.Kzz = Kzz;
                Blearners{i}.Kz = Kz;
                Blearners{i}.U = U;
                Blearners{i}.V = V;
                Blearners{i}.Xz = Xz;
            end

            %GA learner -------------------------------------------------------
            tic;
            [ sol ] = gamklp( nk , K , X , L , Blearners , SingleKernelResultsnorm , PercentageOfKernelReduction );   
            toc

            % Then, train a SVM ------------------------------------------------
            Kt = sum(bsxfun(@times,K,reshape(sol,1,1,nk)),3);        
            z = randperm(size(Kt,1));
            s = ceil( size(Kt,1) * PercentageOfKernelReduction );
            z = z(1:s);
            Kzz = Kt(z,z);
            Kz = Kt(:,z);
            [U,V]=eig(Kzz);
            Xz = Kz*U*diag(1./sqrt(diag(V)));
            %Kp = Xz*Xz';        
            %err = sum((FinalKernel(:)-Kp(:)).^2);
            model = train(double(L'),sparse(double(Xz)),'-q');

            EachScore = zeros(1,size(L2,2));
            for i=1:size(L2,2)
                % Then, evaluate this data point -----------------------------------    
                Kt = sum(bsxfun(@times,K2(i,:,:),reshape(sol,1,1,nk)),3);
                Kz = Kt(1,z);
                Xz = Kz*U*diag(1./sqrt(diag(V)));
                %Kp = Xz*Xz';
                [predicted_label, accuracy, decv] = predict(double(L2(i)),sparse(double(Xz)), model, '-q');
                EachScore(i) = accuracy(1) / 100;
            end
            GAMKLResults(main_loop*NF + t) = sum( EachScore ) / length( EachScore ); 

        end
        
    end
    
end

Last, print off our statistics.

		  

Single kernel is 0.957317 and std of 0.050282 (max of 1.000000)
MKLGL is 0.969512 and std of 0.030690 (max of 1.000000)
GAMKL is 0.981707 and std of 0.023352 (max of 1.000000)