Neural networks
This page contains description of neural networks source codes. Prior to reading this page, it is necessary that you look through the paper on the general principles of data analysis methods. It contains important information which, to avoid duplication (as it is of great significance for each algorithm in this section), is moved to a separate page.
Neural networks in ALGLIB
The following two principles form the basis for neural network implementation in the ALGLIB: (1) regarding neural networks just as one of the classification/regression algorithms, and (2) reasonable simplicity of the NN API. As per the first principle, there is no "special neural network section" in the ALGLIB: the neural networks take their place among other data analysis algorithms. The second principle needs to be discussed in more detail.
Some neural network libraries offer the user the largest opportunities for tuning: down to the setting individual activation functions or to add/delete individual connections. However, in accordance with operational practices, high functionality is often just not called for: there are some standard architectures which cannot be substantially improved by fine-tuning; there are several "best" methods for training. Finally, there is another reason why the user should not be provided with too varied tools: fine-tuning will present no difficulty to the author of a program package, if a neural network needs such tuning, but the end user can find himself often at a deadlock. Therefore, too complex fine-tuning is unnecessary for an easy-to-use neural networks package. In accordance with this principle, the ALGLIB package strives to solve automatically as many issues as possible, leaving only really considerable problems to the user's solution.
Working with neural networks
The neural network in the ALGLIB is represented by the MultiLayerPerceptron structure. Although this structure has public fields, don't access them directly - use ALGLIB subroutines to work with them. Operations with neural network models are performed in three stages:
- Architecture selection and initialization of the structure by appropriate subroutine.
- Neural network training.
- Using the trained network (mapping inputs to outputs, serialization, etc).
Available Architectures
The ALGLIB pack supports neural networks without hidden layers, with one hidden layer, and with two hidden layers. "Shortcut" connections from the input layer directly to the output layer are not supported. Hidden layers have one of the standard sigmoid-like activation functions, however, a larger variety may be available to the output layer of a neural network. The output layer can either be linear (such networks are used in approximation tasks), or have a sigmoid-like activation function (outputs are bounded from above AND from below). Also available are networks having an activation function that is bounded from above OR from below. The simplest case is when this function is tends to x when x tends to +∞, and exponentially tends to zero when x tends to -∞.
Neural networks with a linear output layer and output SOFTMAX-normalization make up a special case. These are used for classification tasks, where network outputs should be nonnegative, and their sum should be strictly equal to one, permitting using them as the probability that the input vector will be referred to one of the classes (in the extreme case, outputs of the trained network are converging to these probabilities). The number of outputs in such a network is always no less than two (which is a restriction imposed by the elementary logic).
Such a set of architectures, in spite of being minimalistic, is sufficient to solve most of practical problems. One can concentrate on the problem (classification or approximation), without paying unreasonable attention to redundant details (e.g., the selection of a specific hidden layer activation function usually has little effect on the result).
Training
The user may use three algorithms for neural network training. The first of them is the L-BFGS algorithm (limited memory BFGS), which is a quasi-Newton method with fixed iteration cost - O(WCount·NPoints) and moderate memory requirements - O(WCount). This algorithm is ideally suitable for solving large-scale problems, and is quite good at dealing with problems of average and small dimensions. Either the small size of an iteration step (less than the value WStep that is passed to the subroutine) or exceeding the specified number of algorithm iterations (MaxIts parameter) serve as stopping criteria.
Note #1
It is reasonable to choose a number in the order of 0.01 as a WStep. Sometimes, if the problem is very difficult to solve, it can be reduced to 0.0001, but 0.01 is usually sufficient.
Note #2
A sufficiently small value of the error function serves as a stopping criterion in many neural network packages. The problem is that, when dealing with a real problem rather than an educational one, you do not know beforehand how adequately it can be solved. Some problems can be solved with a very low error, whilst 26% of classification error is regarded as a good solution result for certain problems. Therefore, there is no point in specifying "a sufficiently minor error" as a stopping criterion. UNTIL you solve a problem, you are unaware of the value that should be specified, whereas AFTER the problem is solved, there is no need to specify any stopping criterion.
The second algorithm is the modified Levenberg-Marquardt method using the exact Hessian of the error function (NOT linearized approximation). For a networks with up to several hundreds of weights this algorithm is comparable with L-BFGS (often it is faster than the L-BFGS). But its main advantage is not even so much its speed of operation as it is the fact that it does not require at all that stopping criteria be specified. This method will almost always converge exactly to the one of the minimums of a function. Nevertheless, there are also things putting it at a disadvantage when solving large scale problems: high iteration cost (equal to O(NPoints·WCount 2)) and high memory requirements (equal to O(WCount 2)).
The third algorithm is the early stopping method. This method is used for building neural network ensembles.
Note #3
Neither the traditional BackProp nor the RProp are implemented in the ALGLIB, due to their going entirely and unconditionally out of use. The QuickProp is not implemented in view of the fact that it will hardly prove better than the L-BFGS.
Regularization is another important issue. The ALGLIB package uses Tikhonov regularization (AKA weight decay). When the regularization factor is accurately chosen, then generalization error of the trained neural network can be improved, and training can be accelerated.
Note #4
If you don't know what Decay value to choose, you should experiment with the values within the range of 0.001 (weak regularization) up to 100 (very strong regularization). You should search through the values, starting with the minimum and making the Decay value 3 to 10 times as much at each step, while checking, by cross-validation or by means of a test set, the network's generalization error. It should be noted that if the Decay value you specify is too small (less than 0.001), it will be automatically increased up to the permissible minimum: the ALGLIB package will always implement at least minimum regularization of a task.
Cross-validation
Cross-validation is a well-known method of estimating complex models' capacity for generalization. The ALGLIB package contains two subroutines for k-fold cross-validation. The first one makes use of the Levenberg-Marquardt method as the basic training algorithm, whilst the other one applies the L-BFGS algorithm.
Training Set Format
The training set format is described in the paper that is recommended at the top of the page. That paper also deals with such problems as missing values and nominal variable encoding. It should be noted that the dataset format depends on which problem - regression or classification - the network solves.
Preprocessing
Data preprocessing (training set standardization) is implemented in the ALGLIB to increase the rate of convergence and improve generalization error. The preprocessing is implicit, that is, prior to being transferred to the neural network, the data are automatically preprocessed, and the result of the network's working is undergoing backward transformation.
Manual entries
This article is intended for personal use only.
Download ALGLIB