Linear regression
Linear regression is one of the best-known regression methods. The main advantages of the algorithm are simplicity and high speed. There is only one disadvantage - its unsuitability for solving inherently nonlinear problems.
Linear Regression in ALGLIB
Operations with linear models are performed in two stages:
- Linear model construction by calling one of the subroutines (subroutine choice will depend on the problem to be solved). Result is a LinearModel structure containing the model built.
- Operations with the model (data processing, model copying/serialization, etc.)
Algorithm Features and Improvements
The linear regression algorithm that is included in the ALGLIB package uses singular value decomposition (SVD). However, there are a number of improvements in the algorithm, as compared with the classical approach to this problem:
- First, prior to proceeding with the work, the ALGLIB package will standardize variables in order to improve condition number and to solve some of the problems, which may arise if variables are inadequately scaled.
- Second, the ALGLIB package will calculate a cross-validation estimate of the generalization error using fast algorithm. This algorithm uses Sherman-Morrison formula to update SVD of the task matrix when one row is left out of the training set. Fast algorithm permits generalization error to be calculated in O(N·M) time (where N is the size of the training set, and M is the number of variables). For comparison, the time solve a linear regression problem is O(N·M 2), whereas the straightforward "leave-one-out" cross-validation will take O(N 2·M 2) time.
There are two distinctive features of fast cross-validation evaluation that should be mentioned. First, the "fast" formula is applicable only to non-degenerate problems: when dealing with a degenerate task, the dimension of the problem shall be reduced using the Principal Component Method. It is automatically done if necessary, but it slows down the algorithm approximately twice as much. Secondly, there are more refined types of degeneracy: in view of the solution of linear equations, the problem is non-degenerate, but the Sherman-Morrison formula turns inapplicable to some vectors of the training set (for some of the vectors which are ordinary in appearance, it leads to the division by zero). Theoretically, a training set may contain up to M+1 of such "defective" elements (out of the available N), although there is mostly none. These "defective" vectors are not taken into account when a cross-validation error estimate is made, which brings about some misrepresentation of the algorithm's result, but it is not a bug (in case somebody is determined to perform a testing of fast cross-validation algorithm, and accidentally runs into one of such vectors).
Manual entries
This article is intended for personal use only.
Download ALGLIB