====== libmsvm documentation ======

Under construction, may contain inaccurate or erroneous information.

===== Introduction =====

This package contains several extensions to the original libsvm package. Two of them motivated the development of this package:
  * Multiple SVM (MSVM): this extension has been developed for dealing with highly imbalanced data. Though classical SVM can handle some imbalance between classes by giving higher weights to the minority class, ensemble learning is an alternative solution which performs well for highly and moderately imbalanced classes [1].
  * Factorized SVM or MSVM (FSVM or FMSVM): this extension has been developped for speeding up the simultaneous classification of a large number of concepts [2].
Srveral other extensions have been made either for speed or for easz of use.


===== 1. Non-MSVM extensions =====

Libmsvm includes extension to the original libsvm software
written by Chih-Chung Chang and Chih-Jen Lin which is available at
http://www.csie.ntu.edu.tw/~cjlin/libsvm . These extensions were
included for easing the MSVM implementation but they can be useful
for the use of the regular SVM library.

The svm-train and svm-predict programs are compatible with the
original version but include a number of additional features.

The svm library and API is not fully compatible as some changes
have been made in the original data structures but they contain
the same functionality as well as additional ones.

The new features for both the high level program and the library
include:

  * Dense vector representation. The principle is similar to the implementation proposed by Ming-Fang Weng in the libsvm-dense package available at http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/#libsvm_for_dense_data Our implementation is slightly different however and both the sparse and dense representation can be managed in a single executable program. The selection of the mode can be dynamically chosen at execution time according to the observed density of the vectors in training or testing data.
  
  * Weights for data instance. We have integrated this functionality by including the modifications to the original libsvm code added by Ming-Wei Chang, Hsuan-Tien Lin, Ming-Hen Tsai, Chia-Hua Ho and Hsiang-Fu Yu in their libsvm-weights package available at http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/#weights_for_data_instances
  
  * Block processing for predict functions associated to BLAS routines. This significantly speeds-up the predict procedures. Some BLAS routines are also used for training but yield modest speed-ups. Block processing is only used with the dense representation.
  
  * Additional kernel functions: Laplacian (L1) RBF and Chi square RBF.


==== 1.1. svm-train and svm-predict Usage ====

See the original libsvm [[https://github.com/cjlin1/libsvm/blob/master/README|README]] for the original usage.
The changes are as follows.

svm-train and svm-predict have two additional flags:

<file>
-D : forces dense representation
-S : forces sparse representation
</file>

By default the program automatically selects the representation
mode according to the observed density of the vectors in training
or testing data (dense if at least half of the vector components
are non-zero, sparse otherwise).

svm-train has the same additional command line parameter as in
libsvm-weights for specifying a weight file:

<file>
-W weight_file : file containing the weights for each instance
</file>

Two additional kernel functions in svm-train are defined as:

        5 -- radial basis function: exp(-gamma*L1_distance(u,v))
        6 -- radial basis function: exp(-gamma*Chi_square_distance(u,v))


==== 1.2. Library Usage ====

See the original libsvm [[https://github.com/cjlin1/libsvm/blob/master/README|README]] for the original usage.
The changes are as follows.

''struct svm_problem'' describing the problem has been changed from:

  struct svm_problem
  {
        int l;
        double *y;
        struct svm_node **x;
  };

to:
	
  struct svm_problem
  {
        int l;
        double *y;
        union svm_vector *x;
        double *W; /* instance weights, NULL if unused */
  };

with:

  union svm_vector
  {
        double *values;          /* if dense */
        struct svm_node *nodes;  /* if sparse */
  }; 

When the dense representation is used, the dimensionality of the
vectors is stored in a single global variable that has to be
initialized using the new void ''svm_set_dense_rep(int dim)'' function.

The purpose of this change is to hide everything specific to the
dense or sparse representations at the higher levels. It is also
to make easier the block processing as ''union svm_vector'' can
contain either a single data instance vector or a block of them,
contiguous in memory.

Similarly, in ''struct svm_model'', the reference to support vectors
has been changed from:

	struct svm_node **SV;		/* SVs (SV[l]) */

to:

	union svm_vector *SV;

The block representation is compatible with the non-block one but
not necessarily the reverse. It is compatible as ''prob%%->%%x[i]%%->%%values[j]''
will always work for accessing the jth component of the ith vector
(this is the same with model and SV). The difference is in the way
the data is allocated. In the block mode, a single array of double
of size ''prob%%->%%l*dim'' is allocated for all vectors at once and
''prob%%->%%x[i] = prob%%->%%x[0]+i*dim''. In the non-block mode, an array of
double of size dim is allocated for each vector and there is no
relation between the ''prob%%->%%x[i]'' pointers. The block/non-block choice
has to be made when creating or loading the problems or model or
when freeing them. All other functions work the same on both version,
except those requiring the block structure, e.g. ''svm_predict_block()''.

===== 2. MSVM extensions =====

The libmsvm package includes two additional extensions to libsvm-plus:

  * the use of multiple learners for better dealing with imbalanced data sets (original MSVM);

  * the possibility to consider several target concepts at once; in this case, the problem is defined with a single file containing all the data samples and annotations (factorized SVM or MSVM).
  
The objective of merging all the binary classification problems into
a single file is to avoid duplication of sample vectors when annotations
of different concepts is done on the same samples. While the training
still has to be done separately for each target concept, the prediction
can be done at once and more efficiently for the full set.

Several extensions to the libsvm data and model formats have been made
in order to support these added functionalities.

Several restrictions come with these added functionalities, the main one is
that only the SVM-C binary classification mode is supported (neither
multi-class nor regression is compatible with tne MSVM extensions).

==== 2.1.Terminology ====

The extension with several concepts correspond to what is classically
referred to as "multi-label" classification. It mainly differs from the
"multi-class" classification of the original libsvm in the fact that
the target categories are non exclusive and complementary. We consider
here that there are n independent binary classification problems
instead of a single n-ary classification problem.

In order to avoid confusion, we will refer to the different
classification targets as "concepts" or "categories" and the term
"label" will be used only for the two binary classes when considering
any of the targets. Such labels may have values only within {-1, 0, +1},
"-1" and "+1" corresponding respectively to the negative and positive
instances of a target concept and "0" meaning that the sample should be
ignored for the training of this concept. This is different from the
three-class problem with the original libsvm and although the formats
are compatible in the case of a single target concept, their
interpretation is different.

==== 2.2. Data file formats ====

The current version works with several formats for training and
testing data files.

==== 2.2.1. Original libsvm data file format ====

As defined in the original libsvm [[https://github.com/cjlin1/libsvm/blob/master/README|README]].

==== 2.2.2. Multi-label libsvm data file format ====

As used for the libsvm [[http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multilabel.html|data for multi-label classification]].

==== 2.2.3. Dense multi-label libmsvm data file format ====


The standard libsvm data file format has been extended to allow labels for more than one label on each line 'n >= 1'.

   <label1>,<label2>,...,<labeln>  <index1>:<value1> <index2>:<value2> ...
    .
    .
    .
	
Apart from the possibility of defining labels for several
concepts in a single file, the format is the same as of the libSVM,
each line correspond to an individual training sample and is ended
by a '\n' character.

** Thus for learning/testing an additional parameter
-CAT category_index with 0 <= category_index < n (default is 0) is needed.

<label> is an integer indicating the class label. In MSVM, different types of 
annottaions for multi-label format are accepted:
  * dense representation: all the categories' labels are given for each sample.  The label can only have value in {-1, 0, +1}, indicate respectively the negative, skipped and positive samples.
  * sparse representation: in which only the annoted categories are given for each sample.  In this formate, each line represents a the annotated categories of a sample, the  annotations are given by the category index (-CAT, +CAT). In case the negatives are not provided, the systm takes all non positive samples as negatives.

With this format, we have a single training and testing files for n categories.
The system automatically detect the format type of the labels file.

2. Additionally, the input problem can be split into two files
containing respectively the training_labels (<label1> <label2>
... <labeln>) and the training_vectors (<index1>:<value1>
<index2>:<value2> ...).

Both have the same length (same number of lines corresponding
to the total number of samples) and element format as the
previous one. The use of this format is applied by giving the
path of the training_labels file (i.e. -LABF training_labels_file).

The possibility of separating the annotation and vector files
is useful if the training and testing samples can be represented
by different feature vectors (e.g. color- or texture-based for
still images).

3. The same as the previous one, but the vector file is binary.
The number of samples is equal to the number of lines in the
training_labels file.  The vector length is computed automatically.
(TO TEST and FIX).


The MSVM allows, as well, to train/predict a subsampled set
from the input_file. By giving the path of an indices_file
after the -SELF option.

The indices_file format:
- Test file contains the indices to select from the given input_file. 
The file may have many lines and the subset to use is defined by providing
the -INDSEL line_num (default 0).

<file>
<label>; <index1> <index2> <index3> ... <indexm>
	   .
	   .
	   .
	   The indices_file may have one label at the beginning of each file (followed by ';'),
 	   this is useful when running the program in parallel on different machines, so we 
	   provid the indices file and a line number (-INDSEL line_num) which indicates the
	   selection run. If a label is given in the line, then it indicates the category index 
	   (i.e. CAT) to learn.
</file>

==== Model file formats ====

The libmsvm package allows the use of two SVM model formats: the original libsvm format and the 'MSVM' format.

=== Original libsvm format ===

The original libsvm format starts with the header parameters (svm_type,kernel_type ,gamma,nr_classes,total_sv,rho,...etc.), then 'SV' followed by the SVs that represents the SVM model. 

<file>
	 <SVM model header>
	 SV
	 alpha1 <index1>:<value1> <index2>:<value2> ...
  	 alpha2 <index1>:<value1> <index2>:<value2> ...
	 .
	 .
	 .
</file> 
	 
=== MSVM format ===

The MSVM format is used when '-SVI' is inserted in the command line.
It starts with the header parameters (svm_type,kernel_type ,gamma,nr_categories,total_sv,rho,...etc.),
then 'SVI' followed by the indices of the found SVs, then alpha followed by the weights of each SV index. 
However, svm_type should only be msvm (svm_type m_svm).

<file>
          <MSVM model header>
	  SVI Input_file_name
	  <SV_index1> <SV_index2> <SV_index3> ...
	  alpha
	  <alpha_SV1> <alpha_SV2> <alpha_SV3> ...
	  .
	  .
	  .
</file>

==== msvm-train usage ====

msvm-train has the same Usage as the original libsvm/libsvm-plus svm-train with some additional options.

Usage:
<file>
msvm-train [options] training_set_file [model_file]

options:
	-s svm_type : set type of SVM (default 0)
		0 -- C-SVC		(multi-class classification)
		1 -- nu-SVC		(multi-class classification)
		2 -- one-class SVM
		3 -- epsilon-SVR	(regression)
		4 -- nu-SVR		(regression)
	-t kernel_type : set type of kernel function (default 2)
		0 -- linear: u'*v
		1 -- polynomial: (gamma*u'*v + coef0)^degree
		2 -- radial basis function: exp(-gamma*|u-v|^2)
		3 -- sigmoid: tanh(gamma*u'*v + coef0)
		4 -- precomputed kernel (kernel values in training_set_file)
		5 -- RBF chi-square distance exp(-gamma*(u-v)^2/(u+v))
		6 -- RBF L1-distance exp(-gamma*|u-v|)
	-d degree : set degree in kernel function (default 3)
	-g gamma : set gamma in kernel function (default 1/num_features)
	-r coef0 : set coef0 in kernel function (default 0)
	-c cost : set the parameter C of C-SVC, epsilon-SVR, and nu-SVR (default 1)
	-n nu : set the parameter nu of nu-SVC, one-class SVM, and nu-SVR (default 0.5)
	-p epsilon : set the epsilon in loss function of epsilon-SVR (default 0.1)
	-m cachesize : set cache memory size in MB (default 100)
	-e epsilon : set tolerance of termination criterion (default 0.001)
	-h shrinking : whether to use the shrinking heuristics, 0 or 1 (default 1)
	-b probability_estimates : whether to train a SVC or SVR model for probability estimates, 0 or 1 (default 0)
	-wi weight : set the parameter C of class i to weight*C, for C-SVC (default 1)
	-v n: n-fold cross validation mode
	-q : quiet mode (no outputs)
	-W weight_file: set weight file
	-D : a boolean parameter to force the use of dense representation
	-S : a boolean parameter to force the use of sparse representation
	-BIN : a boolean parameter to use binary files (default False)
	-MSVM : a boolean parameter to force the use of MSVM case (default False)
	-SVI : a boolean value to force the use of SVI model format, i.e. writing the indexes of model's SVs instead of the complete vectors. (default False)
	-SAVEM : a boolean parameter to force saving the all models in the MSVM case (default False)
	-MIR : maximum imbalance ratio (default 2.0)
	-MCF : majority class fraction (default 1.0)
	-MIRF : minority class fraction (default 1.0)	
	-NBL nbl: an integer to force the use of nb learners in the case of MSVM, when nbl = 0 the automatic calculation of nbl is used. (default 0)
	-CAT cat: the index of the category to be learnt, starting from 0 (default 0)
	-SELF selFile: the path to the file of selected values in the case of subsampling approach (default NULL)
	-INDSEL indSel: the index of the learner to be run from the NBL learners 'in case of -SELF option is given' (default 0)
	-LABF labelFile: the path to the file of labels, in the case of labels are not included in the training_set_file (default NULL)
</file>

option -v randomly splits the data into n parts and calculates cross
validation accuracy/mean squared error on them.

if options (-S == -D) the system will choose the optimal representation 
(dense or sparse) for training.

==== msvm-predict usage ====

Usage:
<file>
svm-predict [options] test_file model_file output_file

options:
	-b probability_estimates: whether to predict probability estimates, 0 or 1 (default 0); for one-class SVM only 0 is supported
	-q : quiet mode (no outputs)
	-v block_size: number of samples to treat at once(default 256)
	-D : a boolean parameter to force the use of dense representation
	-S : a boolean parameter to force the use of sparse representation
	-BIN : a boolean parameter to use binary files (default False)
	-SVI svi: the traing file where to get the SVs corresponing to the indices found in the model. This option is given in case of using model in msvm format
	-CAT cat: the index of the learnt concept, starting from 0. (default 0)
	-SELF selFile: the path to the file of selected indices in the case of testing a subset. (default NULL)
	-LABF labelFile: the path to the file of labels, in the case of labels are not included in the test_file (default NULL)
	-NBV nbv : the number of examples in the binary test file (default 0)
</file>

if options (-S == -D) the system will choose the optimal representation 
(dense or sparse) for training.
-v indicates the block size, this parameter is effecient to handle the memory 
when indexing large-scale datasets. 
model_file is the model file generated by svm-train.
test_file is the test data you want to predict.
msvm-predict will produce output in the output_file (text file).

==== msvm-merge sage ====

Usage:
<file>
msvm-merge [options] input_model_file  output_model_file
options:
-SVI svi: the traing file where to get the SVs corresponing to the indixes found in the model. This option is given in case of using model in msvm format
-DELAY delay: a parameter that forces the merging, to wait till all models are generated.
        If '-DELAY > 0', the process will kep waitting (check all models with cpu sleep(delay)) until the models are generated. default (0)
-DELETE boolean: a parameter that forces the program to delete the models after fusion.

*input_model_file is a text file contains the paths of all the models to be fused
*all the fused models have the same value of gamma, and should be of the SVI format.
the models to fuse can be one model per category or one model per learner.
the final model can be used in predictiong n categories at once.
</file>

`msvm-joblist' Usage
===================
Usage: msvm-joblist [MSVM options] [options] input_file_name
options:
-modelf models_folder: the working directory where to save the models (default $PWD/)
-log2c begin,end,step: set the range of c (default 0,0,1)
-log2g begin,end,step: set the range of g (default 0,0,1)
-log2Mir begin,end,step: set the range of MIR (default 1,1,1)
-allcats : a boolean parameter to force generating joblist for all the labeled categories. (default False)
-hval :  a boolean parameter to force the use the h_values insteat of gamma for the RBF kernel, the gamma = h_value/dm ,
        where dm is the mean distance between the vectors in the input_file_name. (default False)
-jobfile jobFile : the full path where to save the output joblist file
-svmtrain pathname : set msvm-train executable full path

If not given -log2c, -log2g or -log2Mir , the system will use the suitable default value as in msvm-train.
the provided MSVM options will be applied with all generated jobs, and all experimented categories.
a n msvm-merge is needed after each experiment to generat a final model for each experiment.

examples
=====================

time ./msvm-joblist -b 1 -t 2 -q -BIN -MSVM -SVI -D \
 -SELF /video/trecvid/sin12/2010a/tshots/select.txt \
 -LABF /video/trecvid/sin12/2010d/tshots/ann-msvm/2010e.ann \
 -log2Mir 1,1,1 \
 -log2g 1,1,1 \
 -hval -allcats \
 -modelf /video/trecvid/sin12/2010a/tshots/ann_msvm_t2_2010e/LIG/hg104 \
 -jobfile joblist_LIG_hg104.jbl \
 /video/trecvid/sin12/2012/tshots/LIG/hg104.bin
 
 
ls /video/trecvid/sin12/2010a/tshots/ann_msvm_t2_2010e/LIG/hg104_mir2_mcf1_mirf1_c1_g2_t2/*.model > \
  /video/trecvid/sin12/2010a/tshots/ann_msvm_t2_2010e/LIG/hg104_mir2_mcf1_mirf1_c1_g2_t2_list.txt
  
time ./msvm-merge \
  -DELETE \
  -SVI /video/trecvid/sin12/2012/tshots/LIG/hg104.bin \
  /video/trecvid/sin12/2010a/tshots/ann_msvm_t2_2010e/LIG/hg104_mir2_mcf1_mirf1_c1_g2_t2_list.txt \
  /video/trecvid/sin12/2010a/tshots/ann_msvm_t2_2010e/LIG/hg104_mir2_mcf1_mirf1_c1_g2_t2.model
  
rm /video/trecvid/sin12/2010a/tshots/ann_msvm_t2_2010e/LIG/hg104_mir2_mcf1_mirf1_c1_g2_t2_list.txt 


time ./msvm-predict -b 1 -q -v 4096 -D -BIN \
 -SVI /video/trecvid/sin12/2012/tshots/LIG/hg104.bin \
 -SELF /video/trecvid/sin12/2010b/tshots/select.txt \
 -LABF /video/trecvid/sin12/2010d/tshots/ann-msvm/2010e.ann \
 /video/trecvid/sin12/2012/tshots/LIG/hg104.bin \
 /video/trecvid/sin12/2010a/tshots/ann_msvm_t2_2010e/LIG/hg104_mir2_mcf1_mirf1_c1_g2_t2.model \
 output_hg104_2010e.txt