optimizer = optim.SGD(model.parameters(), lr=0.01)
for epoch in range(N):
optimizer.zero_grad()
output = model(input)
loss = loss_fn(output, target)
loss.backward()
optimizer.step()
这是最简单最常见的使用方式
SGD随机梯度下降本质是只取一个样本来计算梯度,避免了梯度下降用全部样本计算梯度的大量运算,而在上面的代码里的loss.backward()会使用全部样本来计算梯度,那么optimizer.step()又怎么来随机取样本呢
以下是SGD的step()源码,请问哪一步体现了随机取样本?
def step(self, closure=None):
"""Performs a single optimization step.
Arguments:
closure (callable, optional): A closure that reevaluates the model
and returns the loss.
"""
loss = None
if closure is not None:
with torch.enable_grad():
loss = closure()
for group in self.param_groups:
weight_decay = group['weight_decay']
momentum = group['momentum']
dampening = group['dampening']
nesterov = group['nesterov']
for p in group['params']:
if p.grad is None:
continue
d_p = p.grad
if weight_decay != 0:
d_p = d_p.add(p, alpha=weight_decay)
if momentum != 0:
param_state = self.state[p]
if 'momentum_buffer' not in param_state:
buf = param_state['momentum_buffer'] = torch.clone(d_p).detach()
else:
buf = param_state['momentum_buffer']
buf.mul_(momentum).add_(d_p, alpha=1 - dampening)
if nesterov:
d_p = d_p.add(buf, alpha=momentum)
else:
d_p = buf
p.add_(d_p, alpha=-group['lr'])
return loss
> SGD随机梯度下降本质是只取一个样本来计算梯度,避免了梯度下降用全部样本计算梯度的大量运算,而在上面的代码里的loss.backward()会使用全部样本来计算梯度,那么optimizer.step()又怎么来随机取样本呢
先在的主流框架中所谓的SGD实际上都是Mini-batch Gradient Descent (MBGD,亦成为SGD)。对于含有N个训练样本的数据集,每次参数更新,仅依据一部分数据计算梯度。小批量梯度下降法既保证了训练速度,也保证了最后收敛的准确率。
随机取样本不在优化器中进行,优化器只负责计算所有参数对应的梯度。取样本是在每次迭代时获取的。
```python
for epoch in range(N):
optimizer.zero_grad()
output = model(input)
```
实际上这里更详细的操作应该是这样的:
```python
for epoch in range(N):
for batch in data_loader: # <== 随机获取一部分数据是由你构造的data_loader实现的
optimizer.zero_grad()
output = model(batch) # <== 此处输入即为随机的一部分数据
```