# PyTorch_Part6_正则化

## 一、正则化之weight_decay

### 2. 损失函数：衡量模型输出与真实标签的差异

L1 Regularization: $$\sum_i |w_i|$$ 因为常在坐标轴（顶点）上取极值，容易训练出稀疏参数

L2 Regularization: $$\sum_i w_i^2$$ $$w_{i+1} = w_i - Obj‘ = w_i - (\frac{\partial Loss}{\partial w_i}+\lambda * w_i) = w_i(1-\lambda) - \frac{\partial Loss}{\partial w_i}$$ ，因此常被称为权重衰减

### 3. 以简单的三层感知机为例

# ============================ step 1/5 数据 ============================
def gen_data(num_data=10, x_range=(-1, 1)):

w = 1.5
train_x = torch.linspace(*x_range, num_data).unsqueeze_(1)
train_y = w*train_x + torch.normal(0, 0.5, size=train_x.size())
test_x = torch.linspace(*x_range, num_data).unsqueeze_(1)
test_y = w*test_x + torch.normal(0, 0.3, size=test_x.size())

return train_x, train_y, test_x, test_y

train_x, train_y, test_x, test_y = gen_data(x_range=(-1, 1))

# ============================ step 2/5 模型 ============================
class MLP(nn.Module):
def __init__(self, neural_num):
super(MLP, self).__init__()
self.linears = nn.Sequential(
nn.Linear(1, neural_num),
nn.ReLU(inplace=True),
nn.Linear(neural_num, neural_num),
nn.ReLU(inplace=True),
nn.Linear(neural_num, neural_num),
nn.ReLU(inplace=True),
nn.Linear(neural_num, 1),
)

def forward(self, x):
return self.linears(x)

net_normal = MLP(neural_num=n_hidden)
net_weight_decay = MLP(neural_num=n_hidden)

# ============================ step 3/5 优化器 ============================
optim_normal = torch.optim.SGD(net_normal.parameters(), lr=lr_init, momentum=0.9)
optim_wdecay = torch.optim.SGD(net_weight_decay.parameters(), lr=lr_init, momentum=0.9, weight_decay=1e-2)
# 包含了weight_decay

# ============================ step 4/5 损失函数 ============================
loss_func = torch.nn.MSELoss()

# ============================ step 5/5 迭代训练 ============================

writer = SummaryWriter(comment=‘_test_tensorboard‘, filename_suffix="12345678")
for epoch in range(max_iter):

# forward
pred_normal, pred_wdecay = net_normal(train_x), net_weight_decay(train_x)
loss_normal, loss_wdecay = loss_func(pred_normal, train_y), loss_func(pred_wdecay, train_y)

loss_normal.backward()
loss_wdecay.backward()

optim_normal.step()
optim_wdecay.step()

...


## 二、 正则化之Dropout

### 2. 带来以下三种影响：

• 特征依赖性降低
• 权重数值平均化
• 数据尺度减小

Test: $$100 = \sum_{100} W_x$$
Train: $$70 = \sum_{70} W_x \Longrightarrow 100 = \sum_{70} W_x/(1-p)$$

net = Net(input_num, d_prob=0.5)
net.linears[1].weight.detach().fill_(1.)

net.train() # 测试结束后调整回运行状态
y = net(x)
print("output in training mode", y)

net.eval()	# 测试开始时使用
y = net(x)
print("output in eval mode", y)

output in training mode tensor([9942.], grad_fn=<ReluBackward1>)
output in eval mode tensor([10000.], grad_fn=<ReluBackward1>)


### 3. 仍以线性回归为例：

class MLP(nn.Module):
def __init__(self, neural_num, d_prob=0.5):
super(MLP, self).__init__()
self.linears = nn.Sequential(

nn.Linear(1, neural_num),
nn.ReLU(inplace=True),

nn.Dropout(d_prob),
nn.Linear(neural_num, neural_num),
nn.ReLU(inplace=True),

nn.Dropout(d_prob),
nn.Linear(neural_num, neural_num),
nn.ReLU(inplace=True),

nn.Dropout(d_prob),
nn.Linear(neural_num, 1),
)

def forward(self, x):
return self.linears(x)

net_prob_0 = MLP(neural_num=n_hidden, d_prob=0.)
net_prob_05 = MLP(neural_num=n_hidden, d_prob=0.5)


## 三、Batch Normalization

### 1. Batch Normalization：批标准化

1. 可以更大学习率，加速模型收敛
2. 可以不用精心设计权值初始化
3. 可以不用dropout或较小的dropout
4. 可以不用L2或者较小的weight decay
5. 可以不用LRN(local response normalization)

### 3. _BatchNorm

pytorch中nn.BatchNorm1d nn.BatchNorm2d nn.BatchNorm3d 都继承于_BatchNorm，并且有以下参数：

__init__(self, num_features,  	# 一个样本特征数量（最重要）
eps=1e-5, 		# 分母修正项
momentum=0.1, 	# 指数加权平均估计当前mean/var
affine=True,	# 是否需要affine transform
track_running_stats=True)	# 是训练状态，还是测试状态


BatchNorm层主要参数：

• running_mean：均值
• running_var：方差
• weight：affine transform中的gamma
• bias： affine transform中的beta

running_mean = (1 - momentum) * running_mean + momentum * mean_t

running_var = (1 - momentum) * running_var + momentum * var_t

### 4. 仍以人民币二分类为例：

class LeNet_bn(nn.Module):
def __init__(self, classes):
super(LeNet_bn, self).__init__()
self.conv1 = nn.Conv2d(3, 6, 5)
self.bn1 = nn.BatchNorm2d(num_features=6)

self.conv2 = nn.Conv2d(6, 16, 5)
self.bn2 = nn.BatchNorm2d(num_features=16)

self.fc1 = nn.Linear(16 * 5 * 5, 120)
self.bn3 = nn.BatchNorm1d(num_features=120)

self.fc2 = nn.Linear(120, 84)
self.fc3 = nn.Linear(84, classes)

def forward(self, x):
out = self.conv1(x)
out = self.bn1(out)
out = F.relu(out)

out = F.max_pool2d(out, 2)

out = self.conv2(out)
out = self.bn2(out)
out = F.relu(out)

out = F.max_pool2d(out, 2)

out = out.view(out.size(0), -1)

out = self.fc1(out)
out = self.bn3(out)
out = F.relu(out)

out = F.relu(self.fc2(out))
out = self.fc3(out)
return out

def initialize_weights(self):
for m in self.modules():
if isinstance(m, nn.Conv2d):
nn.init.xavier_normal_(m.weight.data)
if m.bias is not None:
m.bias.data.zero_()
elif isinstance(m, nn.BatchNorm2d):
m.weight.data.fill_(1)
m.bias.data.zero_()
elif isinstance(m, nn.Linear):
nn.init.normal_(m.weight.data, 0, 1)
m.bias.data.zero_()

1. 使用net = LeNet(classes=2)不经过初始化：

1. 经过精心设计的初始化net.initialize_weights()

1. 使用net = LeNet_bn(classes=2)结果如下：即使Loss有不稳定的区间，其最大值不像前两种超过1.5

## 四、Normalizaiton_layers

### 1. Layer Normalization

1. 不再有running_mean和running_var
2. gamma和beta为逐元素
nn.LayerNorm(normalized_shape, # 该层特征形状
eps=1e-05,
elementwise_affine=True	# 是否需要affine transform
)


### 2. Instance Normalization

nn.InstanceNorm2d(num_features,
eps=1e-05,
momentum=0.1,
affine=False,
track_running_stats=False)
# 同样还有1d, 3d


### 3. Group Normalization

nn.GroupNorm(num_groups, 	# 分组个数，必须是num_channel的因子
num_channels,
eps=1e-05,
affine=True)


1. 不再有running_mean和running_var
2. gamma和beta为逐通道（channel）的

### 4. 小结

PyTorch_Part6_正则化

(0)
(0)