These notes discuss the problem of locally minimizing a smooth nonlinear function without constraints (i.e., the feasible region is all of ). We consider methods which begin at a specified initial point and iteratively produce a sequence of points which (hopefully!) converge to a local minimizer of The simplest example of such a method is gradient descent, which is defined by the recurrence
Here, is a parameter called the step size or, in the context of machine learning, the learning rate. Another example is Newton's method, which is defined by the recurrence
Here, denotes the Hessian matrix of the objective function and denotes its inverse, which is then applied to the gradient vector Note that Newton's method doesn't require its user to manually specify a learning rate; instead, this formula can be thought of as determining a natural learning rate via the curvature of the objective function.
For these notes, I will assume that you are already familiar with gradient descent, Newton's method, and their respective performance characteristics. I assume you are aware that gradient descent takes less computational work per iteration than Newton's method, but converges much more slowly for ill-conditioned objective functions (i.e., functions that exhibit high curvature in some directions, but low curvature in others). Before we proceed, I'll take a moment to introduce some standard notation in the nonlinear optimization literature:
Using these abbreviations, we can rewrite Newton's method as follows:
Newton's method is the gold standard for local nonlinear optimization of smooth functions. However, for large-scale optimization problems, it can be very time-consuming to compute the exact Hessian matrix and solve a linear system to obtain . The idea of a quasi-Newton method is to substitute an approximation for the true Hessian which is good enough that we can still make progress, but is easier to compute than the full matrix of second partial derivatives.
In order to develop this idea, we need to have a rigorous notion of what it means for to be “close enough” to For example, one obvious way to define “close enough” would be to require that in some suitably chosen matrix norm, such as the operator norm or Frobenius norm. This would be an unwise definition, because we would have to calculate the exact Hessian matrix in order to verify that is small enough, when the whole idea of quasi-Newton methods was to avoid computing in the first place.
Instead, we will consider the algebraic properties of and declare to be “close enough” if it satisfies those same properties. The defining characteristic of the Hessian matrix is that it measures second-order change in the objective function or equivalently, first-order change in its gradient
Discarding higher-order terms yields the approximate identity Since this is the defining property of the Hessian it is natural that we ask our approximation to satisfy In the optimization literature, this is known as the secant equation.
We also note that the Hessian matrix is symmetric (since partial derivative operators commute) and positive-definite (in a neighborhood of a local minimum). We will therefore require our approximation to also be symmetric and positive-definite.
Now, these three requirements (symmetry, positive-definiteness, and the secant equation) are far from enough to uniquely pin down a choice of We will use our remaining degrees of freedom to pick a matrix which is as easy as possible to work with. Recall that it takes time to solve a general linear system in order to calculate If we choose a matrix which has some special structure (e.g., a diagonal matrix), we can reduce the cost to or possibly even
The simplest possible choice of would be a positive scalar multiple of the identity matrix, When this is certainly symmetric and positive-definite. However, unless and are parallel vectors (which will generally not be the case), there is very little hope that a scalar multiple of the identity matrix will be able to satisfy the secant equation . Nonetheless, we can still proceed by asking that the error in the secant equation be as small as possible. That is, we determine the optimal value of by minimizing the squared Euclidean norm of the residual:
This is a one-dimensional optimization problem which we can solve analytically by differentiating with respect to and setting the derivative equal to zero.
By plugging this optimal choice of into the quasi-Newton update equation , we obtain the following iterative formula:
This is called the Barzilai–Borwein method, first published in 1988.
At this point, we pause to make an important observation. If we take the secant equation and multiply both sides by , we arrive at the equivalent form We will call this the dual secant equation, and to clarify the distinction, we will call the original secant equation the primal secant equation. (Note that these names are not standardized in the nonlinear optimization literature.) Now, we just derived the Barzilai–Borwein method by determining the value of that minimizes the residual in the primal secant equation. What if we had instead minimized the residual in the dual secant equation?
This alternative optimal value of gives rise to a new iterative formula, called the dual Barzilai–Borwein method.
Observe that this method is nearly identical to the original Barzilai–Borwein method, except that the original scaling factor has been replaced by its reciprocal and the vectors and have been interchanged. This makes perfect sense, because the dual secant equation is nothing more than the primal secant equation after replacing by and exchanging the roles of and This observation leads us to the following principle of duality:
The two Barzilai–Borwein methods we have just derived constitute our first example of a primal/dual pair of quasi-Newton methods.
Symmetric first-order update ansatz:
Impose the primal secant equation:
This shows that the SR1 method is self-dual.