In this tutorial, we demonstrate how to manually import data (.mat file) from Matlab into python3 and how to prepare that data for Tensorflow. The data prepared in Matlab will be the Cifar-10 dataset.

Part 1: Preparing Data MATLAB(R2017)

First we will create the Cifar dataset using Matlab(R2017a). Here we can do it using a few straightforward scripts. Directy below is the main script (Create_Cifar.m) that loads in the cifar-10 training dataset from the downloaded "cifar-10-batches-mat" folder. The folder can be found here: Cifar-10


clc, clear all, close all;																			% Clear workspace and enviroment

%-----------------------------------------------------------------------------------
% Load in training data
%-----------------------------------------------------------------------------------

cifar10Data = 'data';                                                                               % Folder where CIFAR-10 data(cifar-10-batches-mat) is stored

% Load the CIFAR-10 training and test data.
[trainingImages, trainingLabels, testImages, testLabels] = helperCIFAR10Data.load(cifar10Data);     % Useful Matlab 2017 command, load in Cifar Data
info = load('data/cifar-10-batches-mat/batches.meta.mat');                                          % Load in Cifar class information

%-----------------------------------------------------------------------------------
% Create Training .mat File
%-----------------------------------------------------------------------------------

train.data = trainingImages;                                                                        % Training Data images
train.data = Find_zeroMean(train.data);                                                             % Take Zero Mean of Training Images

train.labels = Find_label(trainingLabels');                                                         % Training Data labels
meta.classes = {'airplane','automobile','bird', 'cat','deer', ...
	    'dog', 'frog', 'horse', 'ship', 'truck'};                                               % Cifar Class Information

imdb.meta = meta;                                                                                   % Setting up class info
imdb.train = train;                                                                                 % Setting up training info

fprintf('\n***** Training File has been created! *****\n');
save('imdb_train.mat', 'imdb', '-v7.3');

clear imdb;

%-----------------------------------------------------------------------------------
% Create Testing .mat File
%-----------------------------------------------------------------------------------

test.data = testImages;                                                                             % Testing Data images
test.data = Find_zeroMean(test.data);                                                               % Take Zero Mean of Testing Images
test.labels = Find_label(testLabels');                                                              % Testing Data labels

imdb.meta = meta;                                                                                   % Setting up class info
imdb.test = test;                                                                                   % Setting up testing info

fprintf('\n***** Testing File has been created! *****\n');
save('imdb_test.mat', 'imdb', '-v7.3');


Lets look at some support functions that help set up the Cifar dataset for Tensorflow. First, we will look at the function Find_Labels(). The helperCIFAR10Data.load() method will import the Cifar-10 data labels as strings, and they need to be converted to some set of numerical values for Tensorflow.



function [y_] = Find_label(y)                                                                       % Function: Convert Cifar-10 labels from string to numeric values

	  y_ = zeros(size(y, 1), size(y, 2));                                                           % Initialize new labels, same size as input labels

	  for i = 1:length(y)                                                                           % Loop over all input labels
	      current = y(i);                                                                           % Find current label
	      switch current                                                                            % Switch Statement to match label to value
	          case "airplane"                                                                       % Case: Airplane
	              y_(i) = 0;
	          case "automobile"                                                                     % Case: Automobile
	              y_(i) = 1;
	          case "bird"                                                                           % Case: Bird
	              y_(i) = 2;
	          case "cat"                                                                            % Case: Cat
	              y_(i) = 3;
	          case "deer"                                                                           % Case: Deer
	              y_(i) = 4;
	          case "dog"                                                                            % Case: Dog
	              y_(i) = 5;
	          case "frog"                                                                           % Case: Frog
	              y_(i) = 6;
	          case "horse"                                                                          % Case: Horse
	              y_(i) = 7;
	          case "ship"                                                                           % Case: Ship
	              y_(i) = 8;
	          case "truck"                                                                          % Case: Truck
	              y_(i) = 9;
	      end
	  end
end

Next, lets look at the function Find_zeroMean(). This takes the input data (Training or Testing) and finds the zero mean of it.


function [data] = Find_zeroMean(input)                                                              % Function: Finds Zero-mean of input data

	  data = zeros(size(input));                                                                    % Create zero-mean output, same size as input data
	  for i = 1:length(input)                                                                       % Loop over input data
	     current = single(input(:,:,:,i));                                                          % Find the current sample
	     current_u = mean(current);                                                                 % Find mean of the current sample
	     data(:,:,:,i) = bsxfun(@minus, current, current_u) ;                                       % Compute zero-mean of sample and update data
	  end
end

Lastly, here is the helperCIFAR10Data class for any viewer who needs it. This class takes in the individual data_batch files from the cifar-10-batches-mat folder and organizes the training and testing data (images, labels).


%----------------------------------------------------------------------------
% This is helper class to download and import the CIFAR-10 dataset. The
% dataset is downloaded from:
%
%  https://www.cs.toronto.edu/~kriz/cifar-10-matlab.tar.gz
%
% References
% ----------
% Krizhevsky, Alex, and Geoffrey Hinton. "Learning multiple layers of
% features from tiny images." (2009).
%----------------------------------------------------------------------------
classdef helperCIFAR10Data

	  methods(Static)

	      %------------------------------------------------------------------
	      function download(url, destination)
	          if nargin == 1
	              url = 'https://www.cs.toronto.edu/~kriz/cifar-10-matlab.tar.gz';
	          end

	          unpackedData = fullfile(destination, 'cifar-10-batches-mat');
	          if ~exist(unpackedData, 'dir')
	              fprintf('Downloading CIFAR-10 dataset...');
	              untar(url, destination);
	              fprintf('done.\n\n');
	          end
	      end
	      %------------------------------------------------------------------
	      % Return CIFAR-10 Training and Test data.
	      function [XTrain, TTrain, XTest, TTest] = load(dataLocation)

	          location = fullfile(dataLocation, 'cifar-10-batches-mat');

	          [XTrain1, TTrain1] = loadBatchAsFourDimensionalArray(location, 'data_batch_1.mat');
	          [XTrain2, TTrain2] = loadBatchAsFourDimensionalArray(location, 'data_batch_2.mat');
	          [XTrain3, TTrain3] = loadBatchAsFourDimensionalArray(location, 'data_batch_3.mat');
	          [XTrain4, TTrain4] = loadBatchAsFourDimensionalArray(location, 'data_batch_4.mat');
	          [XTrain5, TTrain5] = loadBatchAsFourDimensionalArray(location, 'data_batch_5.mat');

	          XTrain = cat(4, XTrain1, XTrain2, XTrain3, XTrain4, XTrain5);
	          TTrain = [TTrain1; TTrain2; TTrain3; TTrain4; TTrain5];

	          [XTest, TTest] = loadBatchAsFourDimensionalArray(location, 'test_batch.mat');

	      end
	  end
end

function [XBatch, TBatch] = loadBatchAsFourDimensionalArray(location, batchFileName)
		load(fullfile(location,batchFileName));
		XBatch = data';
		XBatch = reshape(XBatch, 32,32,3,[]);
		XBatch = permute(XBatch, [2 1 3 4]);
		TBatch = convertLabelsToCategorical(location, labels);
end

function categoricalLabels = convertLabelsToCategorical(location, integerLabels)
		load(fullfile(location,'batches.meta.mat'));
		categoricalLabels = categorical(integerLabels, 0:9, label_names);
end


Part 2: Importing Data, Matlab to Python, Setting up Data for Tensorflow

Now once the main script has been executed, imdb_train.m and imdb_test.m will be created. Move them to your local Tensorflow directory. For this example, they will be moved into the local Tensorflow directory and stored into a local folder named "data". Once the files have been copied/moved, we can load them into python. To begin, first list import each essential python library. Numpy is a great matrix (array) library that we will use over python lists. H5py is the essential library that allows us to load our .mat files from Matlab.

#----------------------------------------------------------
# Import Helpful Libraries
#----------------------------------------------------------

import h5py                                                                     # Library for file imports, used v7.3 and higher .mat files
import numpy                                                                    # Libtary for matrix application/represenation/computation
import tensorflow                                                               # Library for Tensorflow

Next, lets load in our training and testing data (images and labels) from both of our .mat files. We can easily do this using the H5py python library. Here the .mat file for training and testing are both Matlab structures, so we can access the components (images, labels) by treating the structure as a filepath: (E.g Matlab->Python: imdb.train.labels -> imdb/train/labels). Afterwards, we can convert all our data into Numpy Arrays which will be ideal for utilizing Tensorflow. Lastly we can confirm the size of our training and testing data (images).


#----------------------------------------------------------
# Import (.mat) Train & Test Data, Labels
#----------------------------------------------------------

trainFile = h5py.File('data/imdb_train.mat')                                	# Import .mat train file
train_data = trainFile.get('imdb/train/data')                               	# Provide .mat filepath to data
train_labels = trainFile.get('imdb/train/labels')                           	# Provide .mat filepath to labels

testFile = h5py.File('data/imdb_test.mat')                                  	# Import .mat test file
test_data = testFile.get('imdb/test/data')                                  	# Provide .mat filepath to data
test_labels = testFile.get('imdb/test/labels')                              	# Provide .mat filepath to labels

classes = trainFile.get('imdb/meta/classes')                                	# Provide .mat filepath to classes

#-----------------------------------------------------------
# Prepare Data: Numpy-Array, Labels: One-Hot-Vector
#-----------------------------------------------------------

train_data = numpy.array(train_data, dtype = numpy.float32)                 	# Converting train data into numpy array format
train_L = numpy.array(train_labels, dtype = numpy.uint8)                   	 	# Converting train labels into numpy array format
test_data = numpy.array(test_data)                                          	# Converting test data into numpy array format
test_L = numpy.array(test_labels, dtype = numpy.uint8)                      	# Converting test labels into numpy array format

print('Size of Training Data: ', len(train_data))                           	# Confirm size of training data
print('Size of Testing Data: ', len(test_data))                             	# Confirm size of testing data


Once our training and testing data is loaded into python and converted to Numpy arrays, we need to convert the labels into Tensorflow One-Hot-Vectors, link found here: One-Hot-Vectors.


#-----------------------------------------------------------
# Convert Labels into One Hot Vectors
#-----------------------------------------------------------

ohv_train = oneHotVector(train_L, len(classes))                             	# Acquire One-Hot-Vector training labels
ohv_test = oneHotVector(test_L, len(classes))                               	# Acquire One-Hot-Vector testing labels


#----------------------------------------------------------
# Convert Labels to One-Hot-Vectors
#----------------------------------------------------------

def oneHotVector(labels, num_classes):                                          # Initalize One-Hot-Vector (Vector with 'N' 0's, but 1 for label class)
	num_labels = labels.shape[0]                                                # Acquire number of labels
	index_offset = numpy.arange(num_labels) * num_classes                       # Creating set of indicies for each label sample
	ohv = numpy.zeros((num_labels, num_classes))                                # Initalize One-Hot-Vectors
	ohv.flat[index_offset + labels.ravel()] = 1                                 # Build One-Hot-Vectors
	return ohv

Lastly, once we have our training and testing labels into One-Hot-Vector format, we need a way to store all of our training data and testing data. To do this, we will implement a sudo-structure using a simple class.


#----------------------------------------------------------
# Class Data: Represent Dataset Information
#----------------------------------------------------------

class Data:                                                                     # Class to represent dataset information
	  def __init__(self):                                                       # Class Initalizer
	      self.num_examples = 0                                                 # Initalize Data Sample Size
	      self.index_in_epoch, self.epochs_completed = 0, 0                     # Initalize batch position, number of epochs completed
	      self.train_data, self.train_L, self.ohv_train = [], [], []            # Initalize training information
	      self.test_data, self.test_L, self.ohv_test = [], [], []               # Initalize testing information


#-----------------------------------------------------------
# Create Pseudo-Structure(Class) to unite Dataset
#-----------------------------------------------------------

dataset = Data()                                                            	# Create object of class
dataset.train_data = train_data                                             	# Assign Train data
dataset.train_L = train_L                                                   	# Assign Train labels
dataset.ohv_train = ohv_train                                               	# Assign Train One-Hot-Vectors
dataset.test_data = test_data                                               	# Assign Validation Data
dataset.test_L = test_L                                                     	# Assign Validation Labels
dataset.ohv_test = ohv_test                                                 	# Assign Validation One-Hot-Vectors


Here is the python file, showing how to import a data .mat file from Matlab and prepare it for Tensorflow



#----------------------------------------------------------
# Import Helpful Libraries
#----------------------------------------------------------

import h5py                                                                     # Library for file imports, used v7.3 and higher .mat files
import numpy                                                                    # Libtary for matrix application/represenation/computation
import tensorflow                                                               # Library for Tensorflow

#----------------------------------------------------------
# Class Data: Represent Dataset Information
#----------------------------------------------------------

class Data:                                                                     # Class to represent dataset information
	  def __init__(self):                                                       # Class Initalizer
	      self.num_examples = 0                                                 # Initalize Data Sample Size
	      self.index_in_epoch, self.epochs_completed = 0, 0                     # Initalize batch position, number of epochs completed
	      self.train_data, self.train_L, self.ohv_train = [], [], []            # Initalize training information
	      self.test_data, self.test_L, self.ohv_test = [], [], []               # Initalize testing information


#----------------------------------------------------------
# Convert Labels to One-Hot-Vectors
#----------------------------------------------------------

def oneHotVector(labels, num_classes):                                          # Initalize One-Hot-Vector (Vector with 'N' 0's, but 1 for label class)

	  num_labels = labels.shape[0]                                              # Acquire number of labels
	  index_offset = numpy.arange(num_labels) * num_classes                     # Creating set of indicies for each label sample
	  ohv = numpy.zeros((num_labels, num_classes))                              # Initalize One-Hot-Vectors
	  ohv.flat[index_offset + labels.ravel()] = 1                               # Build One-Hot-Vectors
	  return ohv

#----------------------------------------------------------
# Load Dataset From .Mat File (Matlab v7.3 or higher)
#----------------------------------------------------------

def load_data():

	  #----------------------------------------------------------
	  # Import (.mat) Train & Test Data, Labels
	  #----------------------------------------------------------

	  trainFile = h5py.File('data/imdb_train.mat')                              # Import .mat train file
	  train_data = trainFile.get('imdb/train/data')                             # Provide .mat filepath to data
	  train_labels = trainFile.get('imdb/train/labels')                         # Provide .mat filepath to labels

	  testFile = h5py.File('data/imdb_test.mat')                                # Import .mat test file
	  test_data = testFile.get('imdb/test/data')                                # Provide .mat filepath to data
	  test_labels = testFile.get('imdb/test/labels')                            # Provide .mat filepath to labels

	  classes = trainFile.get('imdb/meta/classes')                              # Provide .mat filepath to classes

	  #-----------------------------------------------------------
	  # Prepare Data: Numpy-Array, Labels: One-Hot-Vector
	  #-----------------------------------------------------------

	  train_data = numpy.array(train_data, dtype = numpy.float32)               # Converting train data into numpy array format
	  train_L = numpy.array(train_labels, dtype = numpy.uint8)                  # Converting train labels into numpy array format
	  test_data = numpy.array(test_data)                                        # Converting test data into numpy array format
	  test_L = numpy.array(test_labels, dtype = numpy.uint8)                    # Converting test labels into numpy array format

	  print('Size of Training Data: ', len(train_data))                         # Confirm size of training data
	  print('Size of Testing Data: ', len(test_data))                           # Confirm size of testing data

	  #-----------------------------------------------------------
	  # Convert Labels into One Hot Vectors
	  #-----------------------------------------------------------

	  ohv_train = oneHotVector(train_L, len(classes))                           # Acquire One-Hot-Vector training labels
	  ohv_test = oneHotVector(test_L, len(classes))                             # Acquire One-Hot-Vector testing labels

	  #print(train_L[0:20])                                                     # First print training labels to confirm
	  #print(ohv_train[0:20])                                                   # Print One-Hot-Vectors to compare
	  #pauseMe = input('Press Enter to Continue...')                            # Useful pause statement

	  #-----------------------------------------------------------
	  # Create Pseudo-Structure(Class) to unite Dataset
	  #-----------------------------------------------------------

	  dataset = Data()                                                          # Create object of class
	  dataset.train_data = train_data                                           # Assign Train data
	  dataset.train_L = train_L                                                 # Assign Train labels
	  dataset.ohv_train = ohv_train                                             # Assign Train One-Hot-Vectors
	  dataset.test_data = test_data                                             # Assign Validation Data
	  dataset.test_L = test_L                                                   # Assign Validation Labels
	  dataset.ohv_test = ohv_test                                               # Assign Validation One-Hot-Vectors

	  return dataset                                                            # Pseudo-Structure