Linear Regression with multiple variables 多变量线性回归
\(h_\theta(x) = \theta^Tx = \theta_0x0 + \theta_1x_1 + \theta_2x_2 + \ldots + \theta_nx_n\)
\(\theta_0, \theta_1, \ldots, \theta_n\) --> (n + 1) - dimensional vector
\(J(\theta_0, \theta_1, \ldots, \theta_n) = \frac{1}{2m}\sum^m_{i = 1}(h_\theta(x^{(i)}) - y^{(i)})^2\) --> (n + 1) - dimensional vector function
Repeat {
? $\theta_j := \theta_j - \alpha\frac{\partial}{\partial\theta_j}J(\theta_0,\ldots,\theta_n) $
}
Previously (n = 1):
Repeat {
? \(\theta_0 := \theta_0 - \alpha\frac{1}{m}\sum_{i = 1}^m(h_\theta(x^{(i)}) - y^{(i)})\)
? \(\theta_1 := \theta_1 - \alpha\frac{1}{m}\sum_{i = 1}^m(h_\theta(x^{(i)}) - y^{(i)})x^{(i)}\)
}
New algorithm (\(n \geq 1\)):
Repeat {
? \(\theta_j := \theta_j - \alpha\frac{1}{m}\sum_{i = 1}^m(h_\theta(x^{(i)}) - y^{(i)})x_j^{(i)}\)
}
梯度下降运算中的实用技巧I:特征放缩
Idea: Make sure features are on a similiar scale. 确保特征值有近似的规模。
E.g. \(x_1\) = size(0-2000\({feet}^2\)), \(x_2\) = number of bedrooms(1-5)
---> \(x_1\) = \(\frac{size({feet}^2)}{2000}\), \(x_2\) = \(\frac{number of bedrooms}{5}\)
Get every feature into approximately a \(-1 \leq x_i \leq 1\) range. Too small or too large is not acceptable. 让每个特征值在接近[-1,1]的范围内。
\(-100 \leq x_i \leq 100\) or \(-0.0001 \leq x_i \leq 0.0001\) (×)
Replace \(x_i\) with \(x_i - \mu_i\) to make features have approxinately zero mean (Do not apply to \(x_0 = 1\)). 用\(x_i - \mu_i\)代替\(x_i\)使得特征值有接近0的平均值。
E.g. \(x_1 = \frac{size - 1000}{2000}\), \(x_2 = \frac{bedrooms - 2}{5}\). --> \(-0.5 \leq x_1 \leq 0.5\), \(-0.5 \leq x_2 \leq 0.5\).
更一般的规律:\(x_1 = \frac{x_1 - \mu_1}{S1}\)。
\(\mu_i: x\)的平均值,\(S_1:\)特征值的范围——最大值减去最小值(或者看做变量的标准差)。
梯度下降运算中的实用技巧II:学习速率
\(J(\theta)\) should decrease after every iteration. 通过观察\(J(\theta)\)的曲线随着迭代次数的增加的变化情况,当曲线几乎变为直线时说明梯度下降算法已收敛。
【注】对每一个特定的问题,梯度下降算法所需的迭代次数可以相差很大。
Example automatic convergence test: 自动收敛测试
Declare convergence if \(J(\theta)\) decrease by less than \(10^{-3}\) in one iteration. 如果代价函数\(J(\theta)\)的下降小于一个很小的数\(\varepsilon\),那么就认为已经收敛。
【注】通常选择一个合适的\(\varepsilon\)是非常困难的,所以通常用方法一。
Summary
If \(\alpha\) is too small: slow convergence. 如果\(\alpha\)太小会导致收敛速度慢。
If \(\alpha\) is too large: \(J(\theta)\) may not decrease on every iteration; may not converge. 如果\(\alpha\)太大会导致\(J(\theta)\)并不是在每一步都减小或者不收敛。
Choose \(\alpha\)
try …,0.001,0.003,0.01,0.03,0.1,0.3,1,…
\(h_\theta(x) = \theta_0 + \theta_1 \times frontage + \theta_2 \times depth\)
--> Land area: \(x = frontage \times depth\) --> $h_\theta(x) = \theta_0 + \theta_1x $
有时从另一个角度去审视问题,定义一个新的特征值,而不是直接使用开始时使用的特征值,确实会得到一个更好的模型。
例如:当直线不能很好的拟合曲线时,选择二次/三次模型。
\(h_\theta(x) = \theta_0 + \theta_1x_1 + \theta_2x_2 + \theta_3x3 = \theta_0 + \theta_1(size) + \theta_2(size)^2 + \theta_3(size)^3\)
其中,\(x_1 = (size), x_2 = (size)^2, x_3 = (size)^3\)
Method to solve for \(\theta\) analytically 一种求\(\theta\)的解析解法。
与梯度下降法不同的是,此方法可直接一次性求解\(\theta\)的最优值。
If 1D(\(\theta \in R\)) 如果\(\theta\)是个实数
另\(\frac{d}{d\theta}J(\theta) = 0\) \(\rightarrow\) \(\theta\)
\(\theta \in R^{n+1}\), \(J(\theta_0,\theta_1,\ldots,\theta_m) = \frac{1}{2m}\sum_{i=1}^m(h_\theta(x^{(i)}) - y^{(i)})^2\)
另\(\frac{\partial}{\partial\theta_j}J(\theta) = 0\) (for every \(j\)) \(\rightarrow\) \(\theta_0, \theta_1,\ldots,\theta_n\)
写成向量组的形式后,可通过下面的公式直接计算,证明略。\(X\theta = y\), \(X^TX\theta = X^Ty\), \(\rightarrow\) \(\theta = (X^TX)^{-1}X^Ty\).
pinv(X'*X)*X'*y
m training examples, n features.
正规方程及它们的不可逆性
What if \(X^TX\) is non-invertible(singular/degenerate)?
当\(X^TX\)不可逆时怎么办(为奇异矩阵或退化矩阵)?
Redundant features (linearly dependent). 有多余的特征值时删掉多余部分
E.g. \(x_1\) = size in \(feet^2\), \(x_2\) = size in \(m^2\)
Too many features(e.g. \(m \leq n\)). 太多特征值
Delete some features, or use regularization. 删除某些特征值或者对其进行正则化
机器学习(Machine Learning)- 吴恩达(Andrew Ng) 学习笔记(四)
原文:https://www.cnblogs.com/songjy11611/p/12191297.html