Decision forest

This page contains a brief description of the RDF classification and regression algorithm. Prior to reading this page, it is necessary that you look through the paper on the general principles of data analysis methods. It contains important information which, to avoid duplication (as it is of great significance for each algorithm in this section), is removed to a separate page.


    1 RDF (Random Decision Forest) Algorithm
    2 Algorithm Discussion
    3 Working with decision forests
    4 Tuning RDF Algorithm
    5 Training Set Format
    6 Nominal Variable Encoding
    7 Downloads section

RDF (Random Decision Forest) Algorithm

The RDF algorithm is a modification of the original Random Forest algorithm designed by Leo Breiman and Adele Cutler. Two ideas are in combination with each other in this algorithm: these are the use of a decision tree committee getting the result by voting, and the idea of training process randomization. The algorithm is briefly described below: /text:>

  1. Let us assume that the training set has a size N, and the number of independent variables is equal to M. Let us add three input parameters: a ratio r (0 ≤ r ≤ 1), a number of attributes m ≤ M, and a number of trees NTrees ≥ 1.
  2. Using primary training set, let us generate a random sample sized as rN (without repetitions). Let the training set elements that failed to get into the sample be used later on for estimating the generalization error.
  3. Based on the generated sample, let us grow a decision tree. For each node of the tree, randomly choose m variables on which to base the decision at that node. Calculate the best split based on these m variables in the training set. The tree is fully grown and is never pruned.
  4. The procedure is reiterated NTrees times. The trees grown unite to form a committee deciding by voting.

Algorithm Discussion

Advantages of the algorithm:

At the same time, the algorithm's disadvantages may also be noted:

Working with decision forests

Operations on the decision forest are performed in the following order:

  1. Selection of a tuning parameters (see below).
  2. Forest construction
  3. Using the forest built (data processing, serialization, etc).

Tuning RDF Algorithm

The algorithm contains three parameters that need to be adjusted: the ratio r - size of the part of the training set which will be used for the construction of individual trees; the number of trees NTrees; as well as the number NFeatures indicating the number of variables used to grow individual trees. These parameters are more fully detailed below.

The ratio r, in the range of 0 to 1, impacts the algorithm's tolerance to the noise in the training set. The weak point of the original Breiman algorithm is its inability to offer full compensation for the individual trees' tendency to overfit.

Note #1
Nonregularized decision tree can accurately remember the training set. The procedure of sampling N records with repetitions used in the original algorithm results in approximately 0.632 elements of a complete training set getting into the sample (which is equal to the value r=0.632). Thus, for any element of the training set there are roughly 63% of individual trees which remember it and will gain the majority in the course of voting. If there is noise in the training set, the Breiman algorithm will repeat all errors in the training set with a high degree of accuracy, having failed to find regular structure bihend the noise.

There are different solutions to this problem: either individual trees can be regularized (as it is suggested in the paper referred to above), or noise can be leveled down, by putting variety of individual trees under control through the ratio r (as implemented in the ALGLIB). The recommended values of r range from 0.66 (a low noise level) up to 0.05 (a very high noise level). Selection is made on the basis of the relationship between the error in the training set and the generalization error (calculated by means of a test set or by out-of-bag estimation): if the ratio is much smaller than one, the value r shall be decreased, and the model should be reconstructed.

The other algorithm's parameters are much easier to set. The number of trees NTrees is recommended to be made at a level of 50 to 100, and the number NFeatures is chosen automatically and is somewhere about the half of the total number of variables.

Training Set Format

The training set format is described in the paper that is recommended at the top of the page. That paper also deals with such problems as missing values and nominal variable encoding. It should be noted that the dataset format depends on which problem - regression or classification - the network solves.

Nominal Variable Encoding

When a RDF algorithm is used, it is strongly recommended that one should read 'Nominal Variable Encoding' section and accurately comply with the encoding recommended. The algorithm is optimized just for this data representation scheme, although it can be used with any encoding (but less efficiently)

This article is licensed for personal use only.

Download ALGLIB for C++ / C# / ...

ALGLIB Project offers you two editions of ALGLIB:

ALGLIB Free Edition:
delivered for free
offers full set of numerical functionality
extensive algorithmic optimizations
no low level optimizations
non-commercial license

ALGLIB Commercial Edition:
flexible pricing
offers full set of numerical functionality
extensive algorithmic optimizations
high performance (SMP, SIMD)
commercial license with support plan

Links to download sections for Free and Commercial editions can be found below:

ALGLIB 3.16.0 for C++

C++ library.
Delivered with sources.
Monolithic design.
Extreme portability.

ALGLIB 3.16.0 for C#

C# library with native kernels.
Delivered with sources.
VB.NET and IronPython wrappers.
Extreme portability.

ALGLIB 3.16.0 for Delphi

Delphi wrapper around C core.
Delivered as precompiled binary.
Compatible with FreePascal.

ALGLIB 3.16.0 for CPython

CPython wrapper around C core.
Delivered as precompiled binary.