This page contains a brief description of the RDF classification and regression algorithm. Prior to reading this page, it is necessary that you look through the paper on the general principles of data analysis methods. It contains important information which, to avoid duplication (as it is of great significance for each algorithm in this section), is removed to a separate page.
The RDF algorithm is a modification of the original Random Forest algorithm designed by Leo Breiman and Adele Cutler. Two ideas are in combination with each other in this algorithm: these are the use of a decision tree committee getting the result by voting, and the idea of training process randomization. The algorithm is briefly described below: /text:>
Advantages of the algorithm:
At the same time, the algorithm's disadvantages may also be noted:
Operations on the decision forest are performed in the following order:
The algorithm contains three parameters that need to be adjusted: the ratio r - size of the part of the training set which will be used for the construction of individual trees; the number of trees NTrees; as well as the number NFeatures indicating the number of variables used to grow individual trees. These parameters are more fully detailed below.
The ratio r, in the range of 0 to 1, impacts the algorithm's tolerance to the noise in the training set. The weak point of the original Breiman algorithm is its inability to offer full compensation for the individual trees' tendency to overfit.
Nonregularized decision tree can accurately remember the training set. The procedure of sampling N records with repetitions used in the original algorithm results in approximately 0.632 elements of a complete training set getting into the sample (which is equal to the value r=0.632). Thus, for any element of the training set there are roughly 63% of individual trees which remember it and will gain the majority in the course of voting. If there is noise in the training set, the Breiman algorithm will repeat all errors in the training set with a high degree of accuracy, having failed to find regular structure bihend the noise.
There are different solutions to this problem: either individual trees can be regularized (as it is suggested in the paper referred to above), or noise can be leveled down, by putting variety of individual trees under control through the ratio r (as implemented in the ALGLIB). The recommended values of r range from 0.66 (a low noise level) up to 0.05 (a very high noise level). Selection is made on the basis of the relationship between the error in the training set and the generalization error (calculated by means of a test set or by out-of-bag estimation): if the ratio is much smaller than one, the value r shall be decreased, and the model should be reconstructed.
The other algorithm's parameters are much easier to set. The number of trees NTrees is recommended to be made at a level of 50 to 100, and the number NFeatures is chosen automatically and is somewhere about the half of the total number of variables.
The training set format is described in the paper that is recommended at the top of the page. That paper also deals with such problems as missing values and nominal variable encoding. It should be noted that the dataset format depends on which problem - regression or classification - the network solves.
When a RDF algorithm is used, it is strongly recommended that one should read 'Nominal Variable Encoding' section and accurately comply with the encoding recommended. The algorithm is optimized just for this data representation scheme, although it can be used with any encoding (but less efficiently)
This article is intended for personal use only.
C++ source. MPFR/GMP is used.
GMP source is available from gmplib.org. MPFR source is available from www.mpfr.org.
Python version (CPython and IronPython are supported).