学算法——gradient descent

时间：2020-09-22 12:04:55 阅读：70 评论：0 收藏：0 [点我收藏+]

Gradient descent is a first-order iterative optimization algorithm for finding a local minimum of a differentiable function. To find a local minimum of a function using gradient descent, we take steps proportional to the negative of the gradient (or approximate gradient) of the function at the current point. But if we instead take steps proportional to the positive of the gradient, we approach a local maximum of that function; the procedure is then known as gradient ascent. Gradient descent is generally attributed to Cauchy, who first suggested it in 1847,^[1] but its convergence properties for non-linear optimization problems were first studied by Haskell Curry in 1944.^[2]

梯度下降是寻找可微函数的局部最小值的先序迭代优化算法。为了找到一个局部最小值，我们采用步长来成比例地从当前点沿着函数的负梯度或负梯度的大致方向（来接近最小值）。但是如果我们如果采用正梯度，我们会找到函数的局部最大值；这个过程称为梯度下降。总的来说，梯度下降是高斯在1847年最早提出的，但Haskell Curry在1944年最早研究了它的非线性优化问题的收敛特性。

技术分享图片

Gradient descent is based on the observation that if the multi-variable function $技术分享图片$ is defined and differentiable in a neighborhood of a point $技术分享图片$ , then $技术分享图片$ decreases fastest if one goes from $技术分享图片$ in the direction of the negative gradient of $技术分享图片$ at $技术分享图片$ .

　　梯度下降给予对于多变量函数F(x)的观察，F(x)是定义在点a临域可导的函数，F(x)从a出发沿着a的负梯度方向 $技术分享图片$ 下降地最快。

It follows that, if $技术分享图片$ for $技术分享图片$ small enough, then $技术分享图片$ .

每次迭代 $技术分享图片$ ， $技术分享图片$ 是很小的步长，可得到 $技术分享图片$

In other words, the term $技术分享图片$ is subtracted from $技术分享图片$ because we want to move against the gradient, toward the local minimum. With this observation in mind, one starts with a guess $技术分享图片$ for a local minimum of $技术分享图片$ , and considers the sequence $技术分享图片$ such that $技术分享图片$

换种说法，从a点（处的值）减去 $技术分享图片$ ，因为我们想逆着梯度，向着局部最小值移动。记住这个观察量，从一个猜测点 $技术分享图片$ 开始向着函数 $技术分享图片$ 的局部最小移动，考虑序列 $技术分享图片$ 可以得到 $技术分享图片$

We have a monotonic sequence

$技术分享图片$

so hopefully the sequence $技术分享图片$ converges to the desired local minimum. Note that the value of the step size $技术分享图片$ is allowed to change at every iteration. With certain assumptions on the function $技术分享图片$ (for example, $技术分享图片$ convex and $技术分享图片$ Lipschitz) and particular choices of $技术分享图片$ (e.g., chosen either via a line search that satisfies the Wolfe conditions, or the Barzilai–Borwein method^[3]^[4] shown as following),

$技术分享图片$

学算法——gradient descent

原文：https://www.cnblogs.com/yuelien/p/13711173.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年09月23日 (328)
2021年09月24日 (313)
2021年09月17日 (191)
2021年09月15日 (369)
2021年09月16日 (411)
2021年09月13日 (439)
2021年09月11日 (398)
2021年09月12日 (393)
2021年09月10日 (160)
2021年09月08日 (222)