![]() |
The Newton method is a classical method. It is studied first in the "Optimization Methods" course. This old method isn't used, but it becomes a basis for a whole family of quasi-Newton methods. One of those methods is the L-BFGS algorithm.
The classical Newton method uses the Hessian of a function. The step of the method is defined as a product of an inverse Hessian matrix and a function gradient. If the function is a positive definite quadratic form, we can reach the function minimum in one step. In case of an indefinite quadratic form (which has no minimum), we will reach the maximum or saddle point. In short, the method finds the stationary point of a quadratic form. In practice, we usually have functions which are not quadratic forms. If such a function is smooth, it is sufficiently good described by a quadratic form in the minimum neighbourhood. However, the Newton method can converge both to a minimum and a maximum (taking a step into the direction of a function increasing).
Quasi-Newton methods solve this problem as follows: they use a positive definite approximation instead of a Hessian. If Hessian is positive definite, we make the step using the Newton method. If Hessian is indefinite, we modify it to make it positive definite, and then perform a step using the Newton method. The step is always performed in the direction of the function decrement. In case of a positive definite Hessian, we use it to generate a quadratic surface approximation. This should make the convergence better. If Hessian is indefinite, we just move to where function decreases.
It was stated above that we perform a step using the Newton method. Actually, it is not exactly so - in that way we just define a direction in which the step will be performed. Some modifications of quasi-Newton methods perform a precise linear minimum search along the indicated line, but it is proved that it's enough to sufficiently decrease the function value, and not necessary to find a precise minimum value. The L-BFGS algorithm tries to perform a step using the Newton method. If it does not lead to a function value decreasing, it lessens the step length to find a lesser function value.
The Hessian of a function isn't always available, more often we can only calculate the function gradient. Therefore, the following operation is used: the Hessian of a function is generated on the basis of the N consequent gradient calculations, and the quasi-Newton step is performed. There is a special formulas which allows to iteratively get a Hessian approximation. On each step approximation, the matrix remains positive definite. The algorithm uses the BFGS update scheme. BFGS stands for Broyden-Fletcher-Goldfarb-Shanno (more precisely, this scheme generates not the Hessian, but its inverse matrix, so we don't have to waste time inverting a Hessian).
The L letter in the scheme name comes from the words "limited memory". In case of big dimensions, the amount of memory required to store a Hessian (N 2) is too big, along with the machine time required to process it. Therefore, instead of using N gradient values to generate a Hessian we can use a smaller number of values, which requires a memory capacity of order of N·M. In practice, M is usually chosen from 3 to 7, in difficult cases it is reasonable to increase this constant to 20. Of course, as a result we'll get not the Hessian but its approximation. On the one hand, the convergence slows down. On the other hand, the performance could even grow up. At first sight, this statement is paradoxical. But it contains no contradictions: the convergence is measured by a number of iterations, whereas the performance depends on the number of processor's time units spent to calculate the result.
As a matter of fact, initially this method was designed to optimize the functions of a number of arguments (hundreds and thousands), because in this case it is worth having an increasing iteration number due to the lower approximation precision because the overheads become much lower. But we can use these methods for small dimension problems too. The main advantage of the method is scalability, because it provides high performance when solving high dimensionality problems, and it allows to solve small dimension problems too.
Do not calculate the function gradient on the basis of a two-point difference formula because it is insufficiently precise. In a number of cases, the algorithm will not be able to work and will return an error message. Use at least a four-point formula or analytical form of the gradient.
The optimization algorithm shall obtain values of a function/gradient during its operation. This problem is solved in most program packages by transferring the pointer to the function (C++, Delphi) or delegate (C#) which is used to calculate function value/gradient/Hessian.
The ALGLIB package, differently from other libraries, makes use of reverse communication to solve this problem. When a value/gradient of a function needs to be calculated, the algorithm state is stored within a special structure, control is returned to the calling program, which makes all calculations and recalls the computing subroutine.
Thus, the optimization algorithm is operated in accordance in the following order:
The following fields of the MinLBFGSState structure are used in order to exchange information with the user:
The MinLBFGSIteration subprogram can set one, and only one, of the following fields to True, subject to what exactly needs to be calculated:
| C++ | minlbfgs.h | |
| C# | minlbfgs.cs | |
| MPFR | minlbfgs.h | |
| Delphi | minlbfgs.pas | |
| FreePascal | minlbfgs.pas | |
| VBA | minlbfgs.bas |
This article is intended for personal use only.
C# source.
C++ source.
C++ source. MPFR/GMP is used.
GMP source is available from gmplib.org. MPFR source is available from www.mpfr.org.
FreePascal source.
Delphi source.
VBA source.
|
ALGLIB project, 1999-2010 |