Datasets are represented as a matrices, with lines corresponding to sample components and with columns corresponding to variables. Three types of tasks to be performed and, respectively, three formats to be used can be distinguished:
Data encoding in classification tasks is somewhat different from encoding used by many other packages. For example, it is customary in a number of neural-network libraries to encode adherence of an image to one of the classes by means of the NClasses-dimensional vector, through setting the relevant coordinate into 1 and making the rest equal to zero. This difference should be taken into account when the ALGLIB and other program packages are used simultaneously.
Nominal variables can be encoded in several ways: as integer or using either "1-of-N" or "1-of-N-1" encoding. Most ALGLIB algorithms can use any encoding, without requesting information about which variables are real and which are nominal, as well as concerning encoding used. The algorithm will just take a real matrix, and operate on it without going into particulars of its inner structure. It provides for flexibility and usability. However, different encodings have different action upon the speed and quality of algorithm operation. It is recommended to comply with the following conventions:
The ALGLIB package can be used, even if the data are encoded without regard to these recommendations. However, some models may use such encoding to increase the speed and to improve quality of the results.
On the date this article was written, none of the algorithms can perform operations on datsets with missing values. However, this restriction can be avoided, if another value identifying the omission is added to like values of the variable. For example, if accepted values of the variable are "0", "1", "2", then the non-missing values can be encoded as "1 0 0 0", "0 1 0 0", "0 0 1 0", and the missing value can be transformed into "0 0 0 1". The following encoding will be analogous for the real variable: the non-missed value x is encoded as as "x 0", and the missing value is encoded as "0 1".
One more option is replacing the missing value by an average (or most probable) value for this variable.
Many data analysis subprograms accept the Info output parameter. This variable contains a return code of the subroutine. A positive value means normal completion, and a negative value is evidence of an error. Subroutines of this section solve similar problems, therefore their error codes make up a uniform system, too:
There are two basic views commonly held in statistics on how a classification problem solution should look like. The first viewpoint is that any object shall refer to one and only one of the classes. For example, if email classification is in question, then "spam" and "non-spam" classes can be distinguished. There can be some uncertainty in the classification (an email can be somewhat similar to spam), but only the terminal decision - whether it is spam or non-spam - will be returned.
The second approach consists in obtaining a vector of posterior probabilities, that is, a vector having component parts equal to probabilities that the object belongs to each class. The algorithm does not take any decision on the classification of an email. It just notifies how much probability there is that a particular email is spam, and how much probability there is that it is not. And the decision making based on this information is transferred to the user.
The second approach is more flexible than the first one, and it is more reasonable. How does the classification algorithm happen to know about the order of priority the user is sticking to? In some cases, it is necessary to minimize the error made in one of the classes, e.g., the misclassification of an email as spam. Then the email will be classified as spam only in that case if there is very little probability (e.g., less than 0.05%) that it is NON spam. In other cases, all classes are equal to each other, and a class with a maximum conditional probability can just be chosen. Therefore, the outcome of any classification algorithm of the ALGLIB package is a posterior probability vector, instead of the class which an object can be put into.
After the model is built, the error on a test (or training) set needs to be estimated. To estimate regression results, three measures of error can be used, that is, a root-mean-square error, an average error and an average relative error (the latter being calculated as per the records with a nonzero value of the dependent variable). These three measures of error are commonly known, and need not to be discussed.
If a classification problem is at issue, then five measures of error can be used. The first and best-known is the classification error (the number or percent of the incorrectly classified cases). The second equally known measure is cross-entropy. The ALGLIB package uses average cross-entropy per record estimated in bits (base 2 logarithm). The use of average cross-entropy (instead of total cross-entropy) permits comparable estimates for different test sets to be obtained.
The remaining three error measures are the root-mean-square error, average error and average relative error again. However, as opposed to the regression task, they are used here to characterize the posterior probability vector miscalculation. The error implies how much the probability vector calculated by means of a classification algorithm differs from the vector obtained on the basis of a test set (this vector's component parts are equal to 0 or 1, subject to the class which the object belongs to). The meaning of the root-mean-square error and average error is comprehensible: it is an error in conditional probability approximation that is averaged as per all probabilities. The average relative error is an average error in approximating the probability that an object is correctly classified (same as average error for binary tasks).
This article is intended for personal use only.