本文共 1587 字,大约阅读时间需要 5 分钟。
吴恩达 Andrew Ng
把巨大的数据集分成一个一个的小部分
5000000 examples, 1000 × 5000, X{ 1}...X{ 5000} X { 1 } . . . X { 5000 } , Y{ 1}...Y{ 5000} Y { 1 } . . . Y { 5000 }
epoch means a single pass through the training set
Batch gradient descent’s cost decrease on every iteration
Mini-batch gradient descent may not decrease on every iteration. It trends downwards, but it’s going to be a little bit noisier.
mini-batch size = m: Batch gradient descent
mini-batch size = 1: Stochastic gradient descent (随机梯度下降法)
不会收敛,最终在最小值处波动一般 mini-batch 大小为 64、128、256、512
Vt=βVt−1+(1−β)θt V t = β V t − 1 + ( 1 − β ) θ t
Vt V t approximately average over 11−β 1 1 − β items
迭代几次公式,展开递推式
初始化 V0=0 V 0 = 0
As for computation and memory efficiency, it’s a good choice.
during initial phase of estimating, make it more accurate
Vt1−βt V t 1 − β t
oscillations 震荡,波动
ball rolling down a bowl
usually β=0.9 β = 0.9
root mean square prop
Adaptive Moment Estimation
结合 Momentum 和 RMSprop
适用性广泛
Hyperparameters:
α: α : needs to be tuned, β1:0.9 β 1 : 0.9 , β2:0.999 β 2 : 0.999 , ϵ:10−8 ϵ : 10 − 8
随着迭代的增加,渐渐减小学习率
α=11+decayrate×iterationα0 α = 1 1 + d e c a y r a t e × i t e r a t i o n α 0
α=0.95epoch_numα0 α = 0.95 e p o c h _ n u m α 0
α=kepoch_num√α0 α = k e p o c h _ n u m α 0
discrete staircase
manually controlling alpha (small model)