Gradient descent is a first-order iterative optimization algorithm for finding a local minimum of a differentiable function. To find a local minimum of a function using gradient descent, we take steps proportional to the negative of the gradient (or approximate gradient) of the function at the current point. But if we instead take steps proportional to the positive of the gradient, we approach a local maximum of that function; the procedure is then known as gradient ascent. Gradient descent is generally attributed to Cauchy, who first suggested it in 1847,[1] but its convergence properties for non-linear optimization problems were first studied by Haskell Curry in 1944.[2]
梯度下降是寻找可微函数的局部最小值的先序迭代优化算法。为了找到一个局部最小值,我们采用步长来成比例地从当前点沿着函数的负梯度或负梯度的大致方向(来接近最小值)。但是如果我们如果采用正梯度,我们会找到函数的局部最大值;这个过程称为梯度下降。总的来说,梯度下降是高斯在1847年最早提出的,但Haskell Curry在1944年最早研究了它的非线性优化问题的收敛特性。
Gradient descent is based on the observation that if the multi-variable function
is defined and differentiable in a neighborhood of a point
, then
decreases fastest if one goes from
in the direction of the negative gradient of
at
.
梯度下降给予对于多变量函数F(x)的观察,F(x)是定义在点a临域可导的函数,F(x)从a出发沿着a的负梯度方向下降地最快。
It follows that, iffor
small enough, then
.
每次迭代,
是很小的步长,可得到
In other words, the termis subtracted from
because we want to move against the gradient, toward the local minimum. With this observation in mind, one starts with a guess
for a local minimum of
, and considers the sequence
such that
换种说法,从a点(处的值)减去 ,因为我们想逆着梯度,向着局部最小值移动。记住这个观察量,从一个猜测点
开始向着函数
的局部最小移动,考虑序列
可以得到
We have a monotonic sequence
so hopefully the sequence converges to the desired local minimum. Note that the value of the step size
is allowed to change at every iteration. With certain assumptions on the function
(for example,
convex and
Lipschitz) and particular choices of
(e.g., chosen either via a line search that satisfies the Wolfe conditions, or the Barzilai–Borwein method[3][4] shown as following),
原文:https://www.cnblogs.com/yuelien/p/13711173.html