libmsvm documentation

Under construction, may contain inaccurate or erroneous information.

Introduction

This package contains several extensions to the original libsvm package. Two of them motivated the development of this package:

Multiple SVM (MSVM): this extension has been developed for dealing with highly imbalanced data. Though classical SVM can handle some imbalance between classes by giving higher weights to the minority class, ensemble learning is an alternative solution which performs well for highly and moderately imbalanced classes [1].
Factorized SVM or MSVM (FSVM or FMSVM): this extension has been developped for speeding up the simultaneous classification of a large number of concepts [2].

Srveral other extensions have been made either for speed or for easz of use.

1. Non-MSVM extensions

Libmsvm includes extension to the original libsvm software written by Chih-Chung Chang and Chih-Jen Lin which is available at http://www.csie.ntu.edu.tw/~cjlin/libsvm . These extensions were included for easing the MSVM implementation but they can be useful for the use of the regular SVM library.

The svm-train and svm-predict programs are compatible with the original version but include a number of additional features.

The svm library and API is not fully compatible as some changes have been made in the original data structures but they contain the same functionality as well as additional ones.

The new features for both the high level program and the library include:

Dense vector representation. The principle is similar to the implementation proposed by Ming-Fang Weng in the libsvm-dense package available at http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/#libsvm_for_dense_data Our implementation is slightly different however and both the sparse and dense representation can be managed in a single executable program. The selection of the mode can be dynamically chosen at execution time according to the observed density of the vectors in training or testing data.

Weights for data instance. We have integrated this functionality by including the modifications to the original libsvm code added by Ming-Wei Chang, Hsuan-Tien Lin, Ming-Hen Tsai, Chia-Hua Ho and Hsiang-Fu Yu in their libsvm-weights package available at http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/#weights_for_data_instances

Block processing for predict functions associated to BLAS routines. This significantly speeds-up the predict procedures. Some BLAS routines are also used for training but yield modest speed-ups. Block processing is only used with the dense representation.

Additional kernel functions: Laplacian (L1) RBF and Chi square RBF.

1.1. svm-train and svm-predict Usage

See the original libsvm README for the original usage. The changes are as follows.

svm-train and svm-predict have two additional flags:

-D : forces dense representation
-S : forces sparse representation

By default the program automatically selects the representation mode according to the observed density of the vectors in training or testing data (dense if at least half of the vector components are non-zero, sparse otherwise).

svm-train has the same additional command line parameter as in libsvm-weights for specifying a weight file:

-W weight_file : file containing the weights for each instance

Two additional kernel functions in svm-train are defined as:

      5 -- radial basis function: exp(-gamma*L1_distance(u,v))
      6 -- radial basis function: exp(-gamma*Chi_square_distance(u,v))

1.2. Library Usage

See the original libsvm README for the original usage. The changes are as follows.

struct svm_problem describing the problem has been changed from:

struct svm_problem
{
      int l;
      double *y;
      struct svm_node **x;
};

to:

struct svm_problem
{
      int l;
      double *y;
      union svm_vector *x;
      double *W; /* instance weights, NULL if unused */
};

with:

union svm_vector
{
      double *values;          /* if dense */
      struct svm_node *nodes;  /* if sparse */
};

When the dense representation is used, the dimensionality of the vectors is stored in a single global variable that has to be initialized using the new void svm_set_dense_rep(int dim) function.

The purpose of this change is to hide everything specific to the dense or sparse representations at the higher levels. It is also to make easier the block processing as union svm_vector can contain either a single data instance vector or a block of them, contiguous in memory.

Similarly, in struct svm_model, the reference to support vectors has been changed from:

struct svm_node **SV;		/* SVs (SV[l]) */

to:

union svm_vector *SV;

The block representation is compatible with the non-block one but not necessarily the reverse. It is compatible as prob->x[i]->values[j] will always work for accessing the jth component of the ith vector (this is the same with model and SV). The difference is in the way the data is allocated. In the block mode, a single array of double of size prob->l*dim is allocated for all vectors at once and prob->x[i] = prob->x[0]+i*dim. In the non-block mode, an array of double of size dim is allocated for each vector and there is no relation between the prob->x[i] pointers. The block/non-block choice has to be made when creating or loading the problems or model or when freeing them. All other functions work the same on both version, except those requiring the block structure, e.g. svm_predict_block().

2. MSVM extensions

The libmsvm package includes two additional extensions to libsvm-plus:

the use of multiple learners for better dealing with imbalanced data sets (original MSVM);

the possibility to consider several target concepts at once; in this case, the problem is defined with a single file containing all the data samples and annotations (factorized SVM or MSVM).

The objective of merging all the binary classification problems into a single file is to avoid duplication of sample vectors when annotations of different concepts is done on the same samples. While the training still has to be done separately for each target concept, the prediction can be done at once and more efficiently for the full set.

Several extensions to the libsvm data and model formats have been made in order to support these added functionalities.

Several restrictions come with these added functionalities, the main one is that only the SVM-C binary classification mode is supported (neither multi-class nor regression is compatible with tne MSVM extensions).

2.1.Terminology

The extension with several concepts correspond to what is classically referred to as “multi-label” classification. It mainly differs from the “multi-class” classification of the original libsvm in the fact that the target categories are non exclusive and complementary. We consider here that there are n independent binary classification problems instead of a single n-ary classification problem.

In order to avoid confusion, we will refer to the different classification targets as “concepts” or “categories” and the term “label” will be used only for the two binary classes when considering any of the targets. Such labels may have values only within {-1, 0, +1}, “-1” and “+1” corresponding respectively to the negative and positive instances of a target concept and “0” meaning that the sample should be ignored for the training of this concept. This is different from the three-class problem with the original libsvm and although the formats are compatible in the case of a single target concept, their interpretation is different.

2.2. Data file formats

The current version works with several formats for training and testing data files.

2.2.1. Original libsvm data file format

As defined in the original libsvm README.

2.2.2. Multi-label libsvm data file format

As used for the libsvm data for multi-label classification.

2.2.3. Dense multi-label libmsvm data file format

The standard libsvm data file format has been extended to allow labels for more than one label on each line 'n >= 1'.

 <label1>,<label2>,...,<labeln>  <index1>:<value1> <index2>:<value2> ...
  .
  .
  .

Apart from the possibility of defining labels for several concepts in a single file, the format is the same as of the libSVM, each line correspond to an individual training sample and is ended by a '\n' character.

** Thus for learning/testing an additional parameter -CAT category_index with 0 ⇐ category_index < n (default is 0) is needed.

<label> is an integer indicating the class label. In MSVM, different types of annottaions for multi-label format are accepted:

dense representation: all the categories' labels are given for each sample. The label can only have value in {-1, 0, +1}, indicate respectively the negative, skipped and positive samples.
sparse representation: in which only the annoted categories are given for each sample. In this formate, each line represents a the annotated categories of a sample, the annotations are given by the category index (-CAT, +CAT). In case the negatives are not provided, the systm takes all non positive samples as negatives.

With this format, we have a single training and testing files for n categories. The system automatically detect the format type of the labels file.

2. Additionally, the input problem can be split into two files containing respectively the training_labels (<label1> <label2> … <labeln>) and the training_vectors (<index1>:<value1> <index2>:<value2> …).

Both have the same length (same number of lines corresponding to the total number of samples) and element format as the previous one. The use of this format is applied by giving the path of the training_labels file (i.e. -LABF training_labels_file).

The possibility of separating the annotation and vector files is useful if the training and testing samples can be represented by different feature vectors (e.g. color- or texture-based for still images).

3. The same as the previous one, but the vector file is binary. The number of samples is equal to the number of lines in the training_labels file. The vector length is computed automatically. (TO TEST and FIX).

The MSVM allows, as well, to train/predict a subsampled set from the input_file. By giving the path of an indices_file after the -SELF option.

The indices_file format: - Test file contains the indices to select from the given input_file. The file may have many lines and the subset to use is defined by providing the -INDSEL line_num (default 0).

<label>; <index1> <index2> <index3> ... <indexm>
	   .
	   .
	   .
	   The indices_file may have one label at the beginning of each file (followed by ';'),
 	   this is useful when running the program in parallel on different machines, so we 
	   provid the indices file and a line number (-INDSEL line_num) which indicates the
	   selection run. If a label is given in the line, then it indicates the category index 
	   (i.e. CAT) to learn.

Model file formats

The libmsvm package allows the use of two SVM model formats: the original libsvm format and the 'MSVM' format.

Original libsvm format

The original libsvm format starts with the header parameters (svm_type,kernel_type ,gamma,nr_classes,total_sv,rho,…etc.), then 'SV' followed by the SVs that represents the SVM model.

	 <SVM model header>
	 SV
	 alpha1 <index1>:<value1> <index2>:<value2> ...
  	 alpha2 <index1>:<value1> <index2>:<value2> ...
	 .
	 .
	 .

MSVM format

The MSVM format is used when '-SVI' is inserted in the command line. It starts with the header parameters (svm_type,kernel_type ,gamma,nr_categories,total_sv,rho,…etc.), then 'SVI' followed by the indices of the found SVs, then alpha followed by the weights of each SV index. However, svm_type should only be msvm (svm_type m_svm).

          <MSVM model header>
	  SVI Input_file_name
	  <SV_index1> <SV_index2> <SV_index3> ...
	  alpha
	  <alpha_SV1> <alpha_SV2> <alpha_SV3> ...
	  .
	  .
	  .

msvm-train usage

msvm-train has the same Usage as the original libsvm/libsvm-plus svm-train with some additional options.

Usage:

msvm-train [options] training_set_file [model_file]

options:
	-s svm_type : set type of SVM (default 0)
		0 -- C-SVC		(multi-class classification)
		1 -- nu-SVC		(multi-class classification)
		2 -- one-class SVM
		3 -- epsilon-SVR	(regression)
		4 -- nu-SVR		(regression)
	-t kernel_type : set type of kernel function (default 2)
		0 -- linear: u'*v
		1 -- polynomial: (gamma*u'*v + coef0)^degree
		2 -- radial basis function: exp(-gamma*|u-v|^2)
		3 -- sigmoid: tanh(gamma*u'*v + coef0)
		4 -- precomputed kernel (kernel values in training_set_file)
		5 -- RBF chi-square distance exp(-gamma*(u-v)^2/(u+v))
		6 -- RBF L1-distance exp(-gamma*|u-v|)
	-d degree : set degree in kernel function (default 3)
	-g gamma : set gamma in kernel function (default 1/num_features)
	-r coef0 : set coef0 in kernel function (default 0)
	-c cost : set the parameter C of C-SVC, epsilon-SVR, and nu-SVR (default 1)
	-n nu : set the parameter nu of nu-SVC, one-class SVM, and nu-SVR (default 0.5)
	-p epsilon : set the epsilon in loss function of epsilon-SVR (default 0.1)
	-m cachesize : set cache memory size in MB (default 100)
	-e epsilon : set tolerance of termination criterion (default 0.001)
	-h shrinking : whether to use the shrinking heuristics, 0 or 1 (default 1)
	-b probability_estimates : whether to train a SVC or SVR model for probability estimates, 0 or 1 (default 0)
	-wi weight : set the parameter C of class i to weight*C, for C-SVC (default 1)
	-v n: n-fold cross validation mode
	-q : quiet mode (no outputs)
	-W weight_file: set weight file
	-D : a boolean parameter to force the use of dense representation
	-S : a boolean parameter to force the use of sparse representation
	-BIN : a boolean parameter to use binary files (default False)
	-MSVM : a boolean parameter to force the use of MSVM case (default False)
	-SVI : a boolean value to force the use of SVI model format, i.e. writing the indexes of model's SVs instead of the complete vectors. (default False)
	-SAVEM : a boolean parameter to force saving the all models in the MSVM case (default False)
	-MIR : maximum imbalance ratio (default 2.0)
	-MCF : majority class fraction (default 1.0)
	-MIRF : minority class fraction (default 1.0)	
	-NBL nbl: an integer to force the use of nb learners in the case of MSVM, when nbl = 0 the automatic calculation of nbl is used. (default 0)
	-CAT cat: the index of the category to be learnt, starting from 0 (default 0)
	-SELF selFile: the path to the file of selected values in the case of subsampling approach (default NULL)
	-INDSEL indSel: the index of the learner to be run from the NBL learners 'in case of -SELF option is given' (default 0)
	-LABF labelFile: the path to the file of labels, in the case of labels are not included in the training_set_file (default NULL)

option -v randomly splits the data into n parts and calculates cross validation accuracy/mean squared error on them.

if options (-S == -D) the system will choose the optimal representation (dense or sparse) for training.

msvm-predict usage

Usage:

svm-predict [options] test_file model_file output_file

options:
	-b probability_estimates: whether to predict probability estimates, 0 or 1 (default 0); for one-class SVM only 0 is supported
	-q : quiet mode (no outputs)
	-v block_size: number of samples to treat at once(default 256)
	-D : a boolean parameter to force the use of dense representation
	-S : a boolean parameter to force the use of sparse representation
	-BIN : a boolean parameter to use binary files (default False)
	-SVI svi: the traing file where to get the SVs corresponing to the indices found in the model. This option is given in case of using model in msvm format
	-CAT cat: the index of the learnt concept, starting from 0. (default 0)
	-SELF selFile: the path to the file of selected indices in the case of testing a subset. (default NULL)
	-LABF labelFile: the path to the file of labels, in the case of labels are not included in the test_file (default NULL)
	-NBV nbv : the number of examples in the binary test file (default 0)

if options (-S == -D) the system will choose the optimal representation (dense or sparse) for training. -v indicates the block size, this parameter is effecient to handle the memory when indexing large-scale datasets. model_file is the model file generated by svm-train. test_file is the test data you want to predict. msvm-predict will produce output in the output_file (text file).

msvm-merge sage

Usage:

msvm-merge [options] input_model_file  output_model_file
options:
-SVI svi: the traing file where to get the SVs corresponing to the indixes found in the model. This option is given in case of using model in msvm format
-DELAY delay: a parameter that forces the merging, to wait till all models are generated.
        If '-DELAY > 0', the process will kep waitting (check all models with cpu sleep(delay)) until the models are generated. default (0)
-DELETE boolean: a parameter that forces the program to delete the models after fusion.

*input_model_file is a text file contains the paths of all the models to be fused
*all the fused models have the same value of gamma, and should be of the SVI format.
the models to fuse can be one model per category or one model per learner.
the final model can be used in predictiong n categories at once.

`msvm-joblist' Usage

Usage: msvm-joblist [MSVM options] [options] input_file_name options: -modelf models_folder: the working directory where to save the models (default $PWD/) -log2c begin,end,step: set the range of c (default 0,0,1) -log2g begin,end,step: set the range of g (default 0,0,1) -log2Mir begin,end,step: set the range of MIR (default 1,1,1) -allcats : a boolean parameter to force generating joblist for all the labeled categories. (default False) -hval : a boolean parameter to force the use the h_values insteat of gamma for the RBF kernel, the gamma = h_value/dm ,

      where dm is the mean distance between the vectors in the input_file_name. (default False)

-jobfile jobFile : the full path where to save the output joblist file -svmtrain pathname : set msvm-train executable full path

If not given -log2c, -log2g or -log2Mir , the system will use the suitable default value as in msvm-train. the provided MSVM options will be applied with all generated jobs, and all experimented categories. a n msvm-merge is needed after each experiment to generat a final model for each experiment.

examples

time ./msvm-joblist -b 1 -t 2 -q -BIN -MSVM -SVI -D \ -SELF /video/trecvid/sin12/2010a/tshots/select.txt \ -LABF /video/trecvid/sin12/2010d/tshots/ann-msvm/2010e.ann \ -log2Mir 1,1,1 \ -log2g 1,1,1 \ -hval -allcats \ -modelf /video/trecvid/sin12/2010a/tshots/ann_msvm_t2_2010e/LIG/hg104 \ -jobfile joblist_LIG_hg104.jbl \ /video/trecvid/sin12/2012/tshots/LIG/hg104.bin

ls /video/trecvid/sin12/2010a/tshots/ann_msvm_t2_2010e/LIG/hg104_mir2_mcf1_mirf1_c1_g2_t2/*.model > \

/video/trecvid/sin12/2010a/tshots/ann_msvm_t2_2010e/LIG/hg104_mir2_mcf1_mirf1_c1_g2_t2_list.txt

time ./msvm-merge \

DELETE \
SVI /video/trecvid/sin12/2012/tshots/LIG/hg104.bin \

/video/trecvid/sin12/2010a/tshots/ann_msvm_t2_2010e/LIG/hg104_mir2_mcf1_mirf1_c1_g2_t2_list.txt \

/video/trecvid/sin12/2010a/tshots/ann_msvm_t2_2010e/LIG/hg104_mir2_mcf1_mirf1_c1_g2_t2.model

rm /video/trecvid/sin12/2010a/tshots/ann_msvm_t2_2010e/LIG/hg104_mir2_mcf1_mirf1_c1_g2_t2_list.txt

time ./msvm-predict -b 1 -q -v 4096 -D -BIN \ -SVI /video/trecvid/sin12/2012/tshots/LIG/hg104.bin \ -SELF /video/trecvid/sin12/2010b/tshots/select.txt \ -LABF /video/trecvid/sin12/2010d/tshots/ann-msvm/2010e.ann \ /video/trecvid/sin12/2012/tshots/LIG/hg104.bin \ /video/trecvid/sin12/2010a/tshots/ann_msvm_t2_2010e/LIG/hg104_mir2_mcf1_mirf1_c1_g2_t2.model \ output_hg104_2010e.txt

Table of Contents

libmsvm documentation

Introduction

1. Non-MSVM extensions

1.1. svm-train and svm-predict Usage

1.2. Library Usage

2. MSVM extensions

2.1.Terminology

2.2. Data file formats

2.2.1. Original libsvm data file format

2.2.2. Multi-label libsvm data file format

2.2.3. Dense multi-label libmsvm data file format

Model file formats

Original libsvm format

MSVM format

msvm-train usage

msvm-predict usage

msvm-merge sage