CS231n 笔记

几篇 CS231n 的备忘。

课程资料

Softmax Loss

计算样本 $i$ 的 softmax loss 通常使用下面的公式

$L_i = -\log(\cfrac{\mathrm{e}^{s_{y_i}}}{\sum{\mathrm{s_j}}})$

在训练 Assignment 的 TwoLayerNet 时偶尔会遇到

RuntimeWarning: overflow encountered in exp

计算结果会变成 NaN。这是由于 $\mathrm{e}^x$ 是个指数函数，稍大一点的 score 值便会让 np.exp 超出可计算的精度。在 Assignment 2 可以找到修复过的 softmax 函数：

def softmax_loss(x, y):
  probs = np.exp(x - np.max(x, axis=1, keepdims=True))
  probs /= np.sum(probs, axis=1, keepdims=True)
  N = x.shape[0]
  loss = -np.sum(np.log(probs[np.arange(N), y])) / N
  dx = probs.copy()
  dx[np.arange(N), y] -= 1
  dx /= N
  return loss, dx

用 x - np.max(x, axis=1, keepdims=True) 将参数规范化，使 np.exp 的输入不大于 0，则输出在 0 和 1 之间。

另一种常见的状况是

RuntimeWarning: invalid value encountered in log

由于计算 $prob_i = \cfrac{\mathrm{e}^{s_k}}{\sum\mathrm{e}^{s_j}}$ 时的精度丢失导致极小的计算结果被近似到 0，此时再计算 np.log 就出问题了。提升计算精度后可以解决，简单粗暴地加上常量 $\epsilon = 10^{-7}$ 则可以不产生任何开销的情况下更优雅地处理类似情况。

SGD 的 Stochastic

看多了随机梯度下降（Stochastic Gradient Descent）说法，概念上就把它和梯度下降等价起来了。二周目经过概念梳理才知道 SGD 的 Stochastic 是指随机进行 batch 采样用来计算梯度（本以为是说随机的 initial weight，也就是梯度下降的开始位置的随机）

SGD 不同于 GD（Gradient Descent）SGD 的每次 update 都是基于 batch 计算得出的梯度，而 GD 的每次 update 是对样本全集计算梯度。SGD 比 GD 有更好的收敛性而且在合理的 learning rate 下更有可能收敛到全局最优解，很容易从这种通过引入随机性改善结果的方式联想到随机森林算法。

The convergence of stochastic gradient descent has been analyzed using the theories of convex minimization and of stochastic approximation. Briefly, when the learning rates $\eta$ decrease with an appropriate rate, and subject to relatively mild assumptions, stochastic gradient descent converges almost surely to a global minimum when the objective function is convex or pseudoconvex, and otherwise converges almost surely to a local minimum. This is in fact a consequence of the Robbins-Siegmund theorem.

——Wikipedia: Stochastic gradient descent

Adam

Course note 介绍了 Adam 算法，而且提供了一个简化实现。

m = beta1*m + (1-beta1)*dx
v = beta2*v + (1-beta2)*(dx**2)
x += - learning_rate * m / (np.sqrt(v) + eps)

一周目直接使用了这段代码（想来也不可能把 solution 直接贴在 course note 上）后来 Xin Qiu 帮忙 review 代码才发现的问题。

翻了翻论文的算法清单：

Adam Algorithm

不难看出，随着 $t\to\infty$ 有 $\hat{m}_t\to{m_t}, \hat{v}_t\to{v_t}$ 就得到了在 course note 里看到的简化版本。通过简化的版本很容易看出 Adam 实际上结合了 Momentum 和 RMSProp，感觉这里 bias correction 是让这个组合能正确收敛的关键 trick，论文附录的收敛性证明有。

Hyper-parameter 清单

Optimizer
- Batch Size
- Update Rule (e.g. Momentum / RMSProp / Adam)
- Learning Rate
- Iterations
Model
- Weight Initialization (e.g. Random / Xavier)
- Weight Scale
- Regularzation Function
- Regularization Strength
- Loss Function
- Activation Function (e.g. ReLU / Maxout / ELU)
Architecture
- Dropout
- Batch Normalization
- Fully-connected Layer
  - Number of Neurons
- Convolution
  - Number of Filters
  - Filter Size
Pre-processing
- Zero Centering
- PCA and Whitening
- Data Argumentation

以上。