====== libmsvm documentation ======
Under construction, may contain inaccurate or erroneous information.
===== Introduction =====
This package contains several extensions to the original libsvm package. Two of them motivated the development of this package:
* Multiple SVM (MSVM): this extension has been developed for dealing with highly imbalanced data. Though classical SVM can handle some imbalance between classes by giving higher weights to the minority class, ensemble learning is an alternative solution which performs well for highly and moderately imbalanced classes [1].
* Factorized SVM or MSVM (FSVM or FMSVM): this extension has been developped for speeding up the simultaneous classification of a large number of concepts [2].
Srveral other extensions have been made either for speed or for easz of use.
===== 1. Non-MSVM extensions =====
Libmsvm includes extension to the original libsvm software
written by Chih-Chung Chang and Chih-Jen Lin which is available at
http://www.csie.ntu.edu.tw/~cjlin/libsvm . These extensions were
included for easing the MSVM implementation but they can be useful
for the use of the regular SVM library.
The svm-train and svm-predict programs are compatible with the
original version but include a number of additional features.
The svm library and API is not fully compatible as some changes
have been made in the original data structures but they contain
the same functionality as well as additional ones.
The new features for both the high level program and the library
include:
* Dense vector representation. The principle is similar to the implementation proposed by Ming-Fang Weng in the libsvm-dense package available at http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/#libsvm_for_dense_data Our implementation is slightly different however and both the sparse and dense representation can be managed in a single executable program. The selection of the mode can be dynamically chosen at execution time according to the observed density of the vectors in training or testing data.
* Weights for data instance. We have integrated this functionality by including the modifications to the original libsvm code added by Ming-Wei Chang, Hsuan-Tien Lin, Ming-Hen Tsai, Chia-Hua Ho and Hsiang-Fu Yu in their libsvm-weights package available at http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/#weights_for_data_instances
* Block processing for predict functions associated to BLAS routines. This significantly speeds-up the predict procedures. Some BLAS routines are also used for training but yield modest speed-ups. Block processing is only used with the dense representation.
* Additional kernel functions: Laplacian (L1) RBF and Chi square RBF.
==== 1.1. svm-train and svm-predict Usage ====
See the original libsvm [[https://github.com/cjlin1/libsvm/blob/master/README|README]] for the original usage.
The changes are as follows.
svm-train and svm-predict have two additional flags:
-D : forces dense representation
-S : forces sparse representation
By default the program automatically selects the representation
mode according to the observed density of the vectors in training
or testing data (dense if at least half of the vector components
are non-zero, sparse otherwise).
svm-train has the same additional command line parameter as in
libsvm-weights for specifying a weight file:
-W weight_file : file containing the weights for each instance
Two additional kernel functions in svm-train are defined as:
5 -- radial basis function: exp(-gamma*L1_distance(u,v))
6 -- radial basis function: exp(-gamma*Chi_square_distance(u,v))
==== 1.2. Library Usage ====
See the original libsvm [[https://github.com/cjlin1/libsvm/blob/master/README|README]] for the original usage.
The changes are as follows.
''struct svm_problem'' describing the problem has been changed from:
struct svm_problem
{
int l;
double *y;
struct svm_node **x;
};
to:
struct svm_problem
{
int l;
double *y;
union svm_vector *x;
double *W; /* instance weights, NULL if unused */
};
with:
union svm_vector
{
double *values; /* if dense */
struct svm_node *nodes; /* if sparse */
};
When the dense representation is used, the dimensionality of the
vectors is stored in a single global variable that has to be
initialized using the new void ''svm_set_dense_rep(int dim)'' function.
The purpose of this change is to hide everything specific to the
dense or sparse representations at the higher levels. It is also
to make easier the block processing as ''union svm_vector'' can
contain either a single data instance vector or a block of them,
contiguous in memory.
Similarly, in ''struct svm_model'', the reference to support vectors
has been changed from:
struct svm_node **SV; /* SVs (SV[l]) */
to:
union svm_vector *SV;
The block representation is compatible with the non-block one but
not necessarily the reverse. It is compatible as ''prob%%->%%x[i]%%->%%values[j]''
will always work for accessing the jth component of the ith vector
(this is the same with model and SV). The difference is in the way
the data is allocated. In the block mode, a single array of double
of size ''prob%%->%%l*dim'' is allocated for all vectors at once and
''prob%%->%%x[i] = prob%%->%%x[0]+i*dim''. In the non-block mode, an array of
double of size dim is allocated for each vector and there is no
relation between the ''prob%%->%%x[i]'' pointers. The block/non-block choice
has to be made when creating or loading the problems or model or
when freeing them. All other functions work the same on both version,
except those requiring the block structure, e.g. ''svm_predict_block()''.
===== 2. MSVM extensions =====
The libmsvm package includes two additional extensions to libsvm-plus:
* the use of multiple learners for better dealing with imbalanced data sets (original MSVM);
* the possibility to consider several target concepts at once; in this case, the problem is defined with a single file containing all the data samples and annotations (factorized SVM or MSVM).
The objective of merging all the binary classification problems into
a single file is to avoid duplication of sample vectors when annotations
of different concepts is done on the same samples. While the training
still has to be done separately for each target concept, the prediction
can be done at once and more efficiently for the full set.
Several extensions to the libsvm data and model formats have been made
in order to support these added functionalities.
Several restrictions come with these added functionalities, the main one is
that only the SVM-C binary classification mode is supported (neither
multi-class nor regression is compatible with tne MSVM extensions).
==== 2.1.Terminology ====
The extension with several concepts correspond to what is classically
referred to as "multi-label" classification. It mainly differs from the
"multi-class" classification of the original libsvm in the fact that
the target categories are non exclusive and complementary. We consider
here that there are n independent binary classification problems
instead of a single n-ary classification problem.
In order to avoid confusion, we will refer to the different
classification targets as "concepts" or "categories" and the term
"label" will be used only for the two binary classes when considering
any of the targets. Such labels may have values only within {-1, 0, +1},
"-1" and "+1" corresponding respectively to the negative and positive
instances of a target concept and "0" meaning that the sample should be
ignored for the training of this concept. This is different from the
three-class problem with the original libsvm and although the formats
are compatible in the case of a single target concept, their
interpretation is different.
==== 2.2. Data file formats ====
The current version works with several formats for training and
testing data files.
==== 2.2.1. Original libsvm data file format ====
As defined in the original libsvm [[https://github.com/cjlin1/libsvm/blob/master/README|README]].
==== 2.2.2. Multi-label libsvm data file format ====
As used for the libsvm [[http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multilabel.html|data for multi-label classification]].
==== 2.2.3. Dense multi-label libmsvm data file format ====
The standard libsvm data file format has been extended to allow labels for more than one label on each line 'n >= 1'.
,,...,:: ...
.
.
.
Apart from the possibility of defining labels for several
concepts in a single file, the format is the same as of the libSVM,
each line correspond to an individual training sample and is ended
by a '\n' character.
** Thus for learning/testing an additional parameter
-CAT category_index with 0 <= category_index < n (default is 0) is needed.