====== libmsvm documentation ====== Under construction, may contain inaccurate or erroneous information. ===== Introduction ===== This package contains several extensions to the original libsvm package. Two of them motivated the development of this package: * Multiple SVM (MSVM): this extension has been developed for dealing with highly imbalanced data. Though classical SVM can handle some imbalance between classes by giving higher weights to the minority class, ensemble learning is an alternative solution which performs well for highly and moderately imbalanced classes [1]. * Factorized SVM or MSVM (FSVM or FMSVM): this extension has been developped for speeding up the simultaneous classification of a large number of concepts [2]. Srveral other extensions have been made either for speed or for easz of use. ===== 1. Non-MSVM extensions ===== Libmsvm includes extension to the original libsvm software written by Chih-Chung Chang and Chih-Jen Lin which is available at http://www.csie.ntu.edu.tw/~cjlin/libsvm . These extensions were included for easing the MSVM implementation but they can be useful for the use of the regular SVM library. The svm-train and svm-predict programs are compatible with the original version but include a number of additional features. The svm library and API is not fully compatible as some changes have been made in the original data structures but they contain the same functionality as well as additional ones. The new features for both the high level program and the library include: * Dense vector representation. The principle is similar to the implementation proposed by Ming-Fang Weng in the libsvm-dense package available at http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/#libsvm_for_dense_data Our implementation is slightly different however and both the sparse and dense representation can be managed in a single executable program. The selection of the mode can be dynamically chosen at execution time according to the observed density of the vectors in training or testing data. * Weights for data instance. We have integrated this functionality by including the modifications to the original libsvm code added by Ming-Wei Chang, Hsuan-Tien Lin, Ming-Hen Tsai, Chia-Hua Ho and Hsiang-Fu Yu in their libsvm-weights package available at http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/#weights_for_data_instances * Block processing for predict functions associated to BLAS routines. This significantly speeds-up the predict procedures. Some BLAS routines are also used for training but yield modest speed-ups. Block processing is only used with the dense representation. * Additional kernel functions: Laplacian (L1) RBF and Chi square RBF. ==== 1.1. svm-train and svm-predict Usage ==== See the original libsvm [[https://github.com/cjlin1/libsvm/blob/master/README|README]] for the original usage. The changes are as follows. svm-train and svm-predict have two additional flags: -D : forces dense representation -S : forces sparse representation By default the program automatically selects the representation mode according to the observed density of the vectors in training or testing data (dense if at least half of the vector components are non-zero, sparse otherwise). svm-train has the same additional command line parameter as in libsvm-weights for specifying a weight file: -W weight_file : file containing the weights for each instance Two additional kernel functions in svm-train are defined as: 5 -- radial basis function: exp(-gamma*L1_distance(u,v)) 6 -- radial basis function: exp(-gamma*Chi_square_distance(u,v)) ==== 1.2. Library Usage ==== See the original libsvm [[https://github.com/cjlin1/libsvm/blob/master/README|README]] for the original usage. The changes are as follows. ''struct svm_problem'' describing the problem has been changed from: struct svm_problem { int l; double *y; struct svm_node **x; }; to: struct svm_problem { int l; double *y; union svm_vector *x; double *W; /* instance weights, NULL if unused */ }; with: union svm_vector { double *values; /* if dense */ struct svm_node *nodes; /* if sparse */ }; When the dense representation is used, the dimensionality of the vectors is stored in a single global variable that has to be initialized using the new void ''svm_set_dense_rep(int dim)'' function. The purpose of this change is to hide everything specific to the dense or sparse representations at the higher levels. It is also to make easier the block processing as ''union svm_vector'' can contain either a single data instance vector or a block of them, contiguous in memory. Similarly, in ''struct svm_model'', the reference to support vectors has been changed from: struct svm_node **SV; /* SVs (SV[l]) */ to: union svm_vector *SV; The block representation is compatible with the non-block one but not necessarily the reverse. It is compatible as ''prob%%->%%x[i]%%->%%values[j]'' will always work for accessing the jth component of the ith vector (this is the same with model and SV). The difference is in the way the data is allocated. In the block mode, a single array of double of size ''prob%%->%%l*dim'' is allocated for all vectors at once and ''prob%%->%%x[i] = prob%%->%%x[0]+i*dim''. In the non-block mode, an array of double of size dim is allocated for each vector and there is no relation between the ''prob%%->%%x[i]'' pointers. The block/non-block choice has to be made when creating or loading the problems or model or when freeing them. All other functions work the same on both version, except those requiring the block structure, e.g. ''svm_predict_block()''. ===== 2. MSVM extensions ===== The libmsvm package includes two additional extensions to libsvm-plus: * the use of multiple learners for better dealing with imbalanced data sets (original MSVM); * the possibility to consider several target concepts at once; in this case, the problem is defined with a single file containing all the data samples and annotations (factorized SVM or MSVM). The objective of merging all the binary classification problems into a single file is to avoid duplication of sample vectors when annotations of different concepts is done on the same samples. While the training still has to be done separately for each target concept, the prediction can be done at once and more efficiently for the full set. Several extensions to the libsvm data and model formats have been made in order to support these added functionalities. Several restrictions come with these added functionalities, the main one is that only the SVM-C binary classification mode is supported (neither multi-class nor regression is compatible with tne MSVM extensions). ==== 2.1.Terminology ==== The extension with several concepts correspond to what is classically referred to as "multi-label" classification. It mainly differs from the "multi-class" classification of the original libsvm in the fact that the target categories are non exclusive and complementary. We consider here that there are n independent binary classification problems instead of a single n-ary classification problem. In order to avoid confusion, we will refer to the different classification targets as "concepts" or "categories" and the term "label" will be used only for the two binary classes when considering any of the targets. Such labels may have values only within {-1, 0, +1}, "-1" and "+1" corresponding respectively to the negative and positive instances of a target concept and "0" meaning that the sample should be ignored for the training of this concept. This is different from the three-class problem with the original libsvm and although the formats are compatible in the case of a single target concept, their interpretation is different. ==== 2.2. Data file formats ==== The current version works with several formats for training and testing data files. ==== 2.2.1. Original libsvm data file format ==== As defined in the original libsvm [[https://github.com/cjlin1/libsvm/blob/master/README|README]]. ==== 2.2.2. Multi-label libsvm data file format ==== As used for the libsvm [[http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multilabel.html|data for multi-label classification]]. ==== 2.2.3. Dense multi-label libmsvm data file format ==== The standard libsvm data file format has been extended to allow labels for more than one label on each line 'n >= 1'. ,,..., : : ... . . . Apart from the possibility of defining labels for several concepts in a single file, the format is the same as of the libSVM, each line correspond to an individual training sample and is ended by a '\n' character. ** Thus for learning/testing an additional parameter -CAT category_index with 0 <= category_index < n (default is 0) is needed.