1. 1. Basic
  2. 2. Recurrent Neural Networks
    1. 2.1. Different types of RNNs
  3. 3. Forward propagation
  4. 4. Back propagation
    1. 4.1. Gradient descent
  5. 5. 梯度消失和梯度爆炸
  6. 6. Dropout Regularization
  7. 7. Gradient Checking
  8. 8. Mini-batch Gradient Descent
  9. 9. Speed up the learning by optimize the algorithm related to the learning rate
    1. 9.1. Exponentially weighted averages
    2. 9.2. Momentum
    3. 9.3. RMSprop
    4. 9.4. Adam Optimization Algorithm
    5. 9.5. Learning rate decay
  10. 10. Hyperparameter tuning
  11. 11. Batch Normalization
    1. 11.1. Gradient descent with batch normalization
      1. 11.1.1. Forward propagation
      2. 11.1.2. Backward propagation
    2. 11.2. Batch normalization at test time
  12. 12. Softmax Regression
  13. 13. Training Tricks
    1. 13.1. The dataset
    2. 13.2. Error analysis
      1. 13.2.1. Training-dev set
  14. 14. Transfer Learning
  15. 15. Multi-task learning
  16. 16. Convolutional Neural Networks
    1. 16.1. Padding
    2. 16.2. A convolution layer
    3. 16.3. A ConvNet
    4. 16.4. Pooling layers
    5. 16.5. A NN Example
    6. 16.6. Why use the convolutional layer
  17. 17. Classic Networks
    1. 17.1. LeNet-5
    2. 17.2. AlexNet
    3. 17.3. VGG-16
    4. 17.4. ResNet
    5. 17.5. DenseNet
    6. 17.6. 1x1 convolution
    7. 17.7. Inception Network
    8. 17.8. Benchmarking
  18. 18. Object Detection
    1. 18.1. Localization / Landmark detection
    2. 18.2. Sliding windows detection
    3. 18.3. Turning FC layers into Conv layers
    4. 18.4. Convolutional implementation of sliding windows
    5. 18.5. You Only Look Once
    6. 18.6. IoU
    7. 18.7. Non-max suppression
    8. 18.8. Anchor boxes
    9. 18.9. YOLO algorithm
    10. 18.10. R-CNN
      1. 18.10.1. Fast R-CNN
      2. 18.10.2. Faster R-CNN
  19. 19. Face Recognition
    1. 19.1. One Shot Learning
      1. 19.1.1. Siamese Network
        1. 19.1.1.1. Training
  20. 20. Neural Style Transfer
    1. 20.1. Content cost function
    2. 20.2. Style cost function
  21. 21. RNN
    1. 21.1. GRU
    2. 21.2. LSTM
  22. 22. Embeddings
    1. 22.1. 类比推理
    2. 22.2. 训练并预测 word embeddings
    3. 22.3. Negative Sampling
  23. 23. Seq2Seq
    1. 23.1. Encoder-Decoder
  24. 24. Attention
    1. 24.1. attention scoring function
      1. 24.1.1. masked softmax
      2. 24.1.2. Additive attention
      3. 24.1.3. Scaled dot-product attention
      4. 24.1.4. Bahdanau attention
      5. 24.1.5. Self-attention
      6. 24.1.6. Multi-head attention
      7. 24.1.7. 位置编码
  25. 25. Transformer
    1. 25.1. Encoder

Deep Learning Notes

The post is just a note, it’s not a complete tutorial.

Basic

J(w,b)=1mi=1mL(y^(i),y(i))+λ2mw2J(w,b)=\frac{1}{m}\sum_{i=1}^{m}L(\hat{y}^{(i)},y^{(i)}) + \frac{\lambda}{2m}||w||^2

The goal of training is to minimize J(w,b)J(w,b).

This, L(y^(i),y(i))L(\hat{y}^{(i)},y^{(i)}) is the loss function, and λ2mw2\frac{\lambda}{2m}||w||^2 is the regularization term.

Regularization is used to prevent overfitting. There are two types of regularization:

  • L1 regularization: w1||w||_1
  • L2 regularization: w2||w||_2
  • L2 regularization is more common.

When λ\lambda is large, the weights will be small, and the model will be simpler.

1
2
3
4
5
6
7
8
9
10
\begin{document}
\begin{tikzpicture}[domain=0:4,scale=1.1]
\draw[very thin,color=gray] (-0.1,-1.1) grid (3.9,3.9);
\draw[->] (-0.2,0) -- (4.2,0) node[right] {$x$};
\draw[->] (0,-1.2) -- (0,4.2) node[above] {$f(x)$};
\draw[color=red] plot (\x,\x) node[right] {$f(x) =x$};
\draw[color=blue] plot (\x,{sin(\x r)}) node[right] {$f(x) = \sin x$};
\draw[color=orange] plot (\x,{0.05*exp(\x)}) node[right] {$f(x) = \frac{1}{20} \mathrm e^x$};
\end{tikzpicture}
\end{document}

Recurrent Neural Networks

rnn

  • Use the last output waaw_{aa} and new input waxw_{ax} to make a prediction.
  • shortcoming: RNN don’t use the later information in the sequence. He said, "Teddy Rosevelt was a man" and He said, "Teddy bears are cute".

Different types of RNNs

rnn

  • One-to-one: standard neural network
  • One-to-many: music generation
  • Many-to-one: sentiment classification
  • Many-to-many: machine translation

Forward propagation

a<t>=g(Wa[a<t1>,x<t>]+ba)a^{<t>}=g(\mathbf{W_a}[\mathbf{a}^{<t-1>},\mathbf{x}^{<t>}]+b_a)

y^<t>=g(Wyaa<t>+by)\hat{y}^{<t>}=g(W_{ya}a^{<t>}+b_y)

Back propagation

L<t>(y<t>,y^<t>)=y<t>log(y^<t>)(1y<t>)log(1y^<t>)L^{<t>}(y^{<t>},\hat{y}^{<t>})=-y^{<t>}log(\hat{y}^{<t>})-(1-y^{<t>})log(1-\hat{y}^{<t>})

L=t=1TL<t>L=\sum_{t=1}^{T}L^{<t>}

Gradient descent

  • Goal: minimize LL
  • LWya=t=1TL<t>Wya=t=1TL<t>y^<t>y^<t>Wya=t=1T(y^<t>y<t>)a<t>\frac{\partial L}{\partial W_{ya}}=\sum_{t=1}^{T}\frac{\partial L^{<t>}}{\partial W_{ya}} = \sum_{t=1}^{T}\frac{\partial L^{<t>}}{\partial \hat{y}^{<t>}}\frac{\partial \hat{y}^{<t>}}{\partial W_{ya}} = \sum_{t=1}^{T}(\hat{y}^{<t>}-y^{<t>})a^{<t>}

梯度消失和梯度爆炸

  • 梯度消失:当权重矩阵 WW 的特征值小于 1 时,梯度会指数级衰减。

  • 梯度爆炸:当权重矩阵 WW 的特征值大于 1 时,梯度会指数级增长。

Why? Here is an example:

a[l]=g(W[l]a[l1]+b[l])a^{[l]} = g(W^{[l]}a^{[l-1]}+b^{[l]})

a[l]=g(W[l]g(W[l1]a[l2]+b[l1])+b[l])a^{[l]} = g(W^{[l]}g(W^{[l-1]}a^{[l-2]}+b^{[l-1]})+b^{[l]})

a[l]=g(W[l]g(W[l1]g(W[l2]a[l3]+b[l2])+b[l1])+b[l])a^{[l]} = g(W^{[l]}g(W^{[l-1]}g(W^{[l-2]}a^{[l-3]}+b^{[l-2]})+b^{[l-1]})+b^{[l]})

a[l]=g(W[l]g(W[l1]g(W[l2]g(W[l3]a[l4]+b[l3])+b[l2])+b[l1])+b[l])a^{[l]} = g(W^{[l]}g(W^{[l-1]}g(W^{[l-2]}g(W^{[l-3]}a^{[l-4]}+b^{[l-3]})+b^{[l-2]})+b^{[l-1]})+b^{[l]})

We can see that the output a[l]a^{[l]} is a function of W[l]W^{[l]}, W[l1]W^{[l-1]}, W[l2]W^{[l-2]}, W[l3]W^{[l-3]}, a[l1]a^{[l-1]}, a[l2]a^{[l-2]}, a[l3]a^{[l-3]}, a[l4]a^{[l-4]}, b[l]b^{[l]}, b[l1]b^{[l-1]}, b[l2]b^{[l-2]}, b[l3]b^{[l-3]}.

The output a[l]a^{[l]} is a function of the weights and biases of all the layers before it. The gradients of the weights and biases of the layers before it will be multiplied together.

So if the weights and biases are large, the gradients will be large, and if the weights and biases are small, the gradients will be small.

对于这种情况,我们可以通过改变w的初始化方法,或者使用ReLU激活函数来解决。

W[l]=np.random.randn(shape)np.sqrt(2n[l1])W^{[l]} = np.random.randn(shape) * np.sqrt(\frac{2}{n^{[l-1]}}) (适用于ReLU)

Dropout Regularization

  • Randomly shut down some neurons in each layer.
  • It prevents overfitting.
  • 他更多的用在计算机视觉方面。
  • 缺点是它让 cost function 难以预测。
1
2
3
4
keep_prob = 0.8
d3 = np.random.rand(a3.shape[0], a3.shape[1]) < keep_prob
a3 = np.multiply(a3, d3)
a3 /= keep_prob

why divide by keep_prob? It’s to keep the expected value of a[3]a^{[3]} the same, so that the cost function JJ doesn’t change. 可以简化测试部分的代码,因为它减少了可能引入的缩放问题。

Gradient Checking

  • 用于检查反向传播算法是否正确。
  • 用于检查梯度是否正确。
  • Don’t use in training, only for debugging.
  • Doesn’t work with dropout.

JθJ(θ+ε)J(θε)2ε\frac{\partial J}{\partial \theta} \approx \frac{J(\theta+\varepsilon)-J(\theta-\varepsilon)}{2\varepsilon}

Check

approxbackprop2approx2+backprop2107\frac{\| \nabla_{\text{approx}} - \nabla_{\text{backprop}} \|_2}{\| \nabla_{\text{approx}} \|_2 + \| \nabla_{\text{backprop}} \|_2} \approx 10^{-7}

Mini-batch Gradient Descent

For example, we have 10000 examples, and we set the batch size to 64. Then we have 10000/64 \approx 156 mini-batches.

A minibatch for x and y is called (x{t},y{t})(x^{\{t\}},y^{\{t\}}).

  • Batch Gradient Descent: use all m examples in each iteration.
  • Stochastic Gradient Descent: use 1 example in each iteration. (too noisy, and never converges, and you will lose the speed advantage of computing in parallel.)
  • Mini-batch Gradient Descent: use b examples in each iteration.

Exponentially weighted averages

this is used to smooth the data.

vt=βvt1+(1β)θtv_t = \beta v_{t-1} + (1-\beta)\theta_t

when β=0.9\beta = 0.9, it means that we will average over the last 10 days. why?

vt=0.9vt1+0.1θtv_t = 0.9v_{t-1} + 0.1\theta_t

vt1=0.9vt2+0.1θt1v_{t-1} = 0.9v_{t-2} + 0.1\theta_{t-1}

so,

vt=0.99θt9+(10.9)i=090.9iθtiv_t = 0.9^9\theta_{t-9} + (1-0.9)\sum_{i=0}^{9}0.9^i\theta_{t-i}

β11β1e\beta^{\frac{1}{1-\beta}} \approx \frac{1}{e}

when a weight less than 1/e1/e, we can consider it as 0. So the formula above is averaging over the last 11β=10\frac{1}{1-\beta}=10 days.

The advantage of this method is that it doesn’t need to store all the data we want to average.

The problem is that it’s not accurate at the beginning:

thefirststep:v1=0.1θ1the first step: v_1 = 0.1\theta_1

thesecondstep:v2=0.9v1+0.1θ2=0.09θ1+0.1θ2the second step: v_2 = 0.9v_1 + 0.1\theta_2 = 0.09\theta_1 + 0.1\theta_2

you can see, at the beginning, the average is not accurate, because v1v_1 << θ1\theta_1 and v2v_2 << θ2\theta_2. So we need to use bias correction.

vt=vt1βtv_t = \frac{v_t}{1-\beta^t}

The magic is that 1βt1-\beta^t is close to 1 as tt increases.

Momentum

The Momentum algorithm is used to speed up the gradient descent algorithm.

The Gradient Descent algorithm sometimes is slow, because it oscillates back and forth to the minimum.

In the Momentum algorithm, we compute an exponentially weighted average of the gradients and then use that gradient to update the weights. This can reduce the oscillation.

vt+1=βvt+dwv^{t+1} = \beta v^t + dw

W=Wαvt+1W = W - \alpha v_{t+1}

When the J(w,b)J(w,b) is like a “bowl” shape, the Momentum algorithm may not work well. The reduction of w and b will become larger and larger as the iteration increases. But the momentum algo will speed down the reduction because we use the average of the gradients.

RMSprop

RMSprop is similar to Momentum, but it’s used to solve the problem of the “bowl” shape of J(w,b)J(w,b).

Sdw=β2Sdw+(1β2)dw2S_{dw} = \beta_2 S_{dw} + (1-\beta_2)dw^2

Sdb=β2Sdb+(1β2)db2S_{db} = \beta_2 S_{db} + (1-\beta_2)db^2

W=WαdwSdw+εW = W - \alpha \frac{dw}{\sqrt{S_{dw}}+\varepsilon}

b=bαdbSdb+εb = b - \alpha \frac{db}{\sqrt{S_{db}}+\varepsilon}

The ε\varepsilon is used to avoid the division by zero.

Adam Optimization Algorithm

Adam is the combination of Momentum and RMSprop.

vdw=β1vdw+(1β1)dwv_{dw} = \beta_1 v_{dw} + (1-\beta_1)dw

vdb=β1vdb+(1β1)dbv_{db} = \beta_1 v_{db} + (1-\beta_1)db

Sdw=β2Sdw+(1β2)dw2S_{dw} = \beta_2 S_{dw} + (1-\beta_2)dw^2

Sdb=β2Sdb+(1β2)db2S_{db} = \beta_2 S_{db} + (1-\beta_2)db^2

vdwcorrected=vdw1β1tv_{dw}^{corrected} = \frac{v_{dw}}{1-\beta_1^t}

vdbcorrected=vdb1β1tv_{db}^{corrected} = \frac{v_{db}}{1-\beta_1^t}

Sdwcorrected=Sdw1β2tS_{dw}^{corrected} = \frac{S_{dw}}{1-\beta_2^t}

Sdbcorrected=Sdb1β2tS_{db}^{corrected} = \frac{S_{db}}{1-\beta_2^t}

W=WαvdwcorrectedSdwcorrected+εW = W - \alpha \frac{v_{dw}^{corrected}}{\sqrt{S_{dw}^{corrected}}+\varepsilon}

b=bαvdbcorrectedSdbcorrected+εb = b - \alpha \frac{v_{db}^{corrected}}{\sqrt{S_{db}^{corrected}}+\varepsilon}

ϵ=108\epsilon = 10^{-8}

Learning rate decay

  • α=11+decay_rateepoch_numα0\alpha = \frac{1}{1+decay\_rate*epoch\_num}*\alpha_0

  • exponential decay: α=0.95epoch_numα0\alpha = 0.95^{epoch\_num}*\alpha_0

  • α=ktα0\alpha = \frac{k}{\sqrt{t}}*\alpha_0 (k is a constant, t is the iteration number)

  • discrete staircase decay: α=α00.95(epoch_numdecay_step)\alpha = \alpha_0 * 0.95^{(\frac{epoch\_num}{decay\_step})}

Hyperparameter tuning

  • Randomly choose the hyperparameters.
  • Coarse to fine: first use a wide range of hyperparameters, and then use a narrower range to find the best hyperparameters.
  • Use the appropriate scale to pick hyperparameters.

Batch Normalization

The normalization technique is not only used in the input xx, but also used in the hidden layers. It normalizes the mean and variance of the z[l]z^{[l]} in each layer (正态分布就是做这事的,我们可以通过改变下面的 γ\gammaβ\beta 来强制改变某一个神经元输出的 aa (在下面最终叫 z~\tilde{z} 了) 分布的均值和方差,这样可以让 z[l]z^{[l]} 的分布更加稳定,更容易训练。而且还有很多好处,下面会提到。)

Normalize the z[l]z^{[l]} in each layer like we normalize the input xx before.

z[l]=W[l]a[l1]+b[l]z^{[l]} = W^{[l]}a^{[l-1]}+b^{[l]}

a[l]=g(z[l])a^{[l]} = g(z^{[l]})

μ=1mi=1mz[l](i)\mu = \frac{1}{m}\sum_{i=1}^{m}z^{[l](i)}

σ2=1mi=1m(z[l](i)μ)2\sigma^2 = \frac{1}{m}\sum_{i=1}^{m}(z^{[l](i)}-\mu)^2

znorm[l]=z[l]μσ2+εz_{norm}^{[l]} = \frac{z^{[l]}-\mu}{\sqrt{\sigma^2+\varepsilon}}

z~[l]=γznorm[l]+βformula(1)\tilde{z}^{[l]} = \gamma z_{norm}^{[l]} + \beta \quad formula(1)

the γ\gamma and β\beta are learnable parameters.

the b[l]b^{[l]} should be canceled out because 归一化操作的过程中,减去了均值μ\mu

The image shows that in a hidden layer we want to let z have a larger variance but not like this, so we need to use the formula(1)

  • Speed up the training, makes the output more stable.
  • It makes the weights less sensitive to the initial values.
  • It makes the learning rate less sensitive to the choice of hyperparameters.
  • 在 FC layer中,计算的是特征维度,而在卷积神经网络中,计算的是通道维度上的均值和方差。

Gradient descent with batch normalization

Forward propagation

μ=1mi=1mz[l](i)\mu = \frac{1}{m}\sum_{i=1}^{m}z^{[l](i)}

σ2=1mi=1m(z[l](i)μ)2\sigma^2 = \frac{1}{m}\sum_{i=1}^{m}(z^{[l](i)}-\mu)^2

znorm[l](i)=z[l](i)μσ2+εz_{norm}^{[l](i)} = \frac{z^{[l](i)}-\mu}{\sqrt{\sigma^2+\varepsilon}}

z~[l](i)=γznorm[l](i)+β\tilde{z}^{[l](i)} = \gamma z_{norm}^{[l](i)} + \beta

y^=a[l]=g(z~[l])\hat{y} = a^{[l]} = g(\tilde{z}^{[l]})

Backward propagation

compute the gradients of ww, γ\gamma and β\beta.

update parameters:

W[l]=W[l]αJW[l]W^{[l]} = W^{[l]} - \alpha \frac{\partial J}{\partial W^{[l]}}

β=βαJβ\beta = \beta - \alpha \frac{\partial J}{\partial \beta}

γ=γαJγ\gamma = \gamma - \alpha \frac{\partial J}{\partial \gamma}

在神经网络训练的过程中,使用了 batch normalization 之后,他可以适当减少 hidden layer 前层和后层的关系,使得后层的学习更加轻松。他还有轻微的正则化效果。正则化本质上是通过增加一些噪声来减少过拟合,而 batch normalization 通过调整 γ\gammaβ\beta 来影响 input(或者是前层的输出) 的分布,进而降低训练集改变(或者是前层输出的改变)对后面层的影响,也可以看作是增加了一些噪声。

Batch normalization at test time

当我们训练时,我们需要计算 μ\muσ2\sigma^2,这里的 μ\muσ2\sigma^2 是在某个 mini-batch 上计算的。
当我们测试时,可能需要逐一处理样本,所以很难计算 μ\muσ2\sigma^2。所以我们需要用到指数加权平均的方法用训练集的各个 mini batch 的数据来计算 μ\muσ2\sigma^2

μt=βμt1+(1β)μt\mu^t = \beta \mu^{t-1} + (1-\beta)\mu^t

σ2=βσ2(t1)+(1β)σ2(t)\sigma^2 = \beta \sigma^{2(t-1)} + (1-\beta)\sigma^{2(t)}

znorm[l](i)=z[l](i)μσ2+εz_{norm}^{[l](i)} = \frac{z^{[l](i)}-\mu}{\sqrt{\sigma^2+\varepsilon}}

z~[l](i)=γznorm[l](i)+β\tilde{z}^{[l](i)} = \gamma z_{norm}^{[l](i)} + \beta

Softmax Regression

z[l]=W[l]a[l1]+b[l]z^{[l]} = W^{[l]}a^{[l-1]}+b^{[l]}

t=ez[l]t = e^{z^{[l]}}

a[l]=ez[l]i=1ctia^{[l]} = \frac{e^{z^{[l]}}}{\sum_{i=1}^{c}t_i}

L(y^,y)=i=1cyilog(y^i)L(\hat{y},y) = -\sum_{i=1}^{c}y_ilog(\hat{y}_i)

J=1mi=1mL(y^(i),y(i))J = \frac{1}{m}\sum_{i=1}^{m}L(\hat{y}^{(i)},y^{(i)})

JJ is the cost function, and LL is the loss function.

  • 取幂操作可以让输出的值变得更大,这样可以让输出的值更加明显,最后再除以 i=1cti\sum_{i=1}^{c}t_i 来保证输出的值在 0 到 1 之间,并且和为 1。

  • The softmax function is used for the multi-class classification problem.

对于 Loss function,因为 y 大多数是一个 one-hot 向量,所以 yilog(y^i)y_ilog(\hat{y}_i) 只有一个值是非零的,所以 L(y^,y)=yilog(y^i)L(\hat{y},y) = -y_ilog(\hat{y}_i),这里的 ii 是指的是 yy 中非零的那个值的下标。因此这里 Loss function 的作用就是找到这个非零的值(true sample),让这个非零的值对应的预测值y尽可能的大。(最大似然估计)

Training Tricks

The dataset

  • 7/3: training set/dev set
  • 6/2/2: training set/dev set/test set
  • 98/1/1: training set/dev set/test set (for 1B+ dataset)
  • 三个数据集要来自同一个分布。

Error analysis

  • 选定一个 Baseline,这个 Baseline 可以是人类的表现(人类的错误率),查看Training set、Dev set、test set 的错误率互相的差值,我们倾向于首先解决错误率差最高的那个问题(如dev set的错误率比training set的错误率高很多,那么可能就首要解决过拟合等的问题),这样解决问题的效率会更高。

  • 对与 mislabbled 的样本,如果样本量不是特别大,我们可以手动去筛查到底是什么 feature 更多地导致了错误的分类。

  • 人工统计错误其实很重要。

  • 对于数据集中的更关心的数据源(比如对于图片,我们有时候更关心手机拍的图而不是web图),我们可以在 dev set 和 test set 中让这种数据所占的比例比training set 中更多一些。

Training-dev set

  • 如果 training-dev set 和 training set 的错误率差距很大,那么可能是模型过拟合,无法泛化到同一个分布的其他数据上。
  • 如果 training-dev set 和 dev set 的错误率差距很大,那么可能是 dev set 和 training set 来自不同的分布(data mismatch problem)。

for data mismatch problem,we can manually check the dev/test set to see what the problem we are facing。我们可以通过人工合成数据来解决.

Transfer Learning

  • Use the pre-trained model and fine-tune it.

Multi-task learning

  • 用于同时训练多个任务,可以提高模型的泛化能力。

Convolutional Neural Networks

Padding

  • Valid: no padding
  • Same: padding so that the output size is the same as the input size.

p = \frac{f-1}{2}$$ p is the padding size, f is the filter size. ### Strided convolution - Stride: the number of pixels we move each time. $$n_H = \lfloor\frac{n_H^{[0]}+2padingsize-filtersize}{stride}+1\rfloor

A convolution layer

  • l is the layer number.

  • the input(activation in the last layer’s output): nH[l1]nW[l1]nc[l1]n_H^{[l-1]}*n_W^{[l-1]}*n_c^{[l-1]}

  • each filter is: f[l]f[l]c[l1]f^{[l]}*f^{[l]}*c^{[l-1]}

  • weight: f[l]f[l]c[l1]ncf^{[l]}*f^{[l]}*c^{[l-1]}*n_c

  • bias: nc[l]n_c^{[l]}

  • activation(the output): nH[l]nW[l]nc[l]n_H^{[l]}*n_W^{[l]}*n_c^{[l]}

在a和w相乘的过程中,将所有通道的值相加

nw[l]=nw[l1]+2p[l]f[l]s[l]+1n_w^{[l]} = \lfloor\frac{n_w^{[l-1]}+2p^{[l]}-f^{[l]}}{s^{[l]}}+1\rfloor

A ConvNet

  • The width and height of the input decrease and the channel increases as the network goes deeper.

Pooling layers

  • Max pooling: take the max value in the pooling window.
  • Average pooling: take the average value in the pooling window.

A NN Example

可以发现,大多数卷积神经网络中,随着层数的增加,Activation size 会越来越小。

图中 parameters 是指的是权重和偏置的数量。每个 filter 都有一个偏置。

Why use the convolutional layer

  • Parameter sharing: a feature detector that’s useful in one part of the image is probably useful in another part of the image.
  • Sparsity of connections: in each layer, each output value depends only on a small number of inputs.
  • Reduce the parameters: the number of parameters is reduced.

这些优点使得卷积神经网络在图像识别中有很好的效果,比如平移不变性,平移不变性是指,如果我们在图片中移动一个物体,那么我们的模型应该能够识别出这个物体。

Classic Networks

LeNet-5

AlexNet

VGG-16

alt text

ResNet

ResNet solves the problem of vanishing gradients and exploding gradients by using the skip connection(aka shortcut connection).

z[l+2]=W[l+2]a[l+1]+b[l+2]z^{[l+2]} = W^{[l+2]}a^{[l+1]}+b^{[l+2]}

a[l+2]=g(z[l+2]+a[l])a^{[l+2]} = g(z^{[l+2]}+a^{[l]})

Let consider the extreme case, if W equals to 0, then the output of the layer will be the same as the output of the layer l (a[l]).

DenseNet

改良版的 ResNet,它不是将前面的输出和后面的输出相加,而是将前面的输出和后面的输出拼接在一起。并且借用了泰勒公式的思想,假设 ResNet 是这样的:

z[l+2]=a[l+1]+g(a[l])z^{[l+2]} = a^{[l+1]}+g(a^{[l]})

或者,更简单的表示:

f(x)=x+g(x)f(x) = x+g(x)

那么 DenseNet 的项就更多,并且使用[, ] 来拼接。

x[x,f1(x),f2([x,f1(x)]),f3([x,f1(x),f2([x,f1(x)])]x \rarr [x, f_1(x), f_2([x, f_1(x)]), f_3([x, f_1(x), f_2([x, f_1(x)])]

1x1 convolution

It’s used to reduce the number of channels. So we can reduce a large number of computations.

Reasonaly build a bottleneck layer can not only reduce a bunch of computations, but also
does not affect the performance of the model.

Inception Network

Inception network is used to solve the problem of choosing the filter size and choosing filter. It uses different filter sizes and then concatenates the output. Let the net learn which filter size is the best.

Benchmarking

  • Ensembling: train multiple networks independently and average their outputs.
  • Multi-crop at test time: crop the image in multiple ways and average the outputs.

Object Detection

Localization -> Detection

Localization / Landmark detection

本质上需要训练集提供 bounding box、label、landmark 的坐标,然后拿去训练。

Sliding windows detection

After we train a classifier to check whether a image contains this object we expect or not, we can use the sliding windows to detect the object, the algorithm is like this:

  1. Use a small window to check whether the object is in this window or not.
  2. Move the window to the right and check again…
  3. if we dont find the object, we can use a larger window to check again.

Turning FC layers into Conv layers

We use some filters with the same size as the input, so the output will be the size of 11n, n is the number of the filters.

Convolutional implementation of sliding windows

如果我们使用卷积的方式来实现滑动窗口,由于很多的计算都可以通过算法来共享,因此计算量会大大减少。(图中红色斜线代表的就是可以被共享的计算结果)而且仅仅只用进行一次支持并行的卷积操作。

You Only Look Once

aka YOLO, it’s a real-time object detection algorithm. It divides the image into a grid and then predicts the bounding box and the class of the object in each grid.

IoU

Intersection over Union, it’s used to measure the accuracy of the object detection algorithm.

Non-max suppression

非极大值抑制 in Chinese.

When the detection algorithm detects multiple bounding boxes for the same object(will discrad all the bounding boxes that the probability is less equal than 0.6), the algo will keep the one with the highest probability and then remove the bounding boxes that have a high IoU(>=0.5) with the kept one.

Anchor boxes

Each object in training image is assigned to the grid that contains the object’s midpoint and the anchor box with the highest IoU with the object.

训练集的输出将会有两个部分,也就是两个 Anchor boxes 数据。

$ y = (p_c, b_x, b_y, b_h, b_w, c_1, c_2, c_3, p_c, b_x, b_y, b_h, b_w, c_1, c_2, c_3)$

YOLO algorithm

Waiting for update…

R-CNN

Sometimes the detector will run on some rectangles that seem like no object in it, so we can use the R-CNN to solve this problem.

R-CNN will use the segmentation algorithm to segment the image and then use the object detection algorithm to detect the object in the segmented image.

Fast R-CNN

Fast R-CNN is faster than R-CNN, because it uses the convolutional layer to extract the features of the image and then use the object detection algorithm to detect the object in the feature map.

Faster R-CNN

It uses the Region Proposal Network(A convolutional network) to generate the region proposals.

Face Recognition

One Shot Learning

指在机器学习领域中的一种范式,其目标是让模型通过仅仅一次或很少次的观察就能够学习并理解新的概念或类别

Siamese Network

Siamense Network uses two identical networks to learn the features of the two images and then use the distance between the two features to decide whether the two images are the same person or not.

d(x1,x2)=f(x1)f(x2)2d(x_1,x_2) = ||f(x_1)-f(x_2)||^2

Training

d is the distance between the two features, f is the similarity function.

L(A,P,N) = max(||f(A)-f(P)||^2-||f(A)-f(N)||^2+\alpha,0)$$ (Triplet Loss. A is the anchor, P is the positive, N is the negative) ##### Face Verification and Binary Classification $$\hat{y} = \sum_{k=1}^{n}(w_i|f(x^{(i)})_k-f(x^{(j)})_k|^2 + b)

n is the number of the parameters.

Neural Style Transfer

可以将一张图片的风格应用到另一张图片上。

paper: A Neural Algorithm of Artistic Style

Its cost function can be devided into two parts: content cost function and style cost function.

The basic idea of this algorithm and the last algorithm is to use the pre-trained model to extract the features of the image, the difference of two images is the distance of the features.

J(G)=αJcontent(C,G)+βJstyle(S,G)J(G) = \alpha J_{content}(C,G) + \beta J_{style}(S,G)

Content cost function

Jcontent(C,G)=14nHnWnCall entries(a(C)a(G))2J_{content}(C,G) = \frac{1}{4n_Hn_Wn_C}\sum_{all\text{ }entries}(a^{(C)}-a^{(G)})^2

Style cost function

Jstyle[l](S,G)=14(nCnHnW)2(Ggram[l](S)Ggram[l](G))F2J_{style}^{[l]}(S,G) = \frac{1}{4(n_Cn_Hn_W)^2}\lVert (G^{[l](S)}_{gram}-G^{[l](G)}_{gram})\rVert_F^2

=14(nCnHnW)2k=1nCk=1nC(Gkk[l](S)Gkk[l](G))2= \frac{1}{4(n_Cn_Hn_W)^2}\sum_{k=1}^{n_C}\sum_{k'=1}^{n_C}(G^{[l](S)}_{kk'}-G^{[l](G)}_{kk'})^2

Gkk[l](S)=i=1nHj=1nWaijk[l](S)aijk[l](S)G^{[l](S)}_{kk'} = \sum_{i=1}^{n_H}\sum_{j=1}^{n_W}a^{[l](S)}_{ijk}a^{[l](S)}_{ijk'}

Gkk[l](G)=i=1nHj=1nWaijk[l](G)aijk[l](G)G^{[l](G)}_{kk'} = \sum_{i=1}^{n_H}\sum_{j=1}^{n_W}a^{[l](G)}_{ijk}a^{[l](G)}_{ijk'}

Jstyle(S,G)=l=1Lλ[l]Jstyle[l](S,G)J_{style}(S,G) = \sum_{l=1}^{L}\lambda^{[l]}J_{style}^{[l]}(S,G)

kk and kk' are the channels of the feature map, range from 1 to nC[l]n_C^{[l]}.
λ\lambda is a hyperparameter that controls the weight of every layer.

the Gkk[l](S)G^{[l](S)}_{kk'} and Gkk[l](G)G^{[l](G)}_{kk'} are the Gram matrix of the feature map, means the correlation between the channel kk and kk'. The science behind this is that the style of the image is the correlation between the channels of the feature map.

(Source: https://www.cnblogs.com/yifanrensheng/p/12862174.html)

RNN

waiting for update

GRU

waiting for update

LSTM

waiting for update

Embeddings

embedding is a way to represent the words in the text.
假设有 10000 个单词,我们可以将每个单词表示成一个 300 维的向量,这个向量就是这个单词的 embedding。300 相当于是类别的数量。

alt text

类比推理

我们可以使用 consine similarity 来计算两个向量之间的相似度。

训练并预测 word embeddings

可以使用 skip-gram 模型。skip-gram 输入单词的上下文,然后预测下一个单词。输出层是 softmax 层。通常我们会用 hierarchical softmax 来代替 softmax 层,本质上是一个二叉树,每个叶子节点代表一个单词。常用的单词会在树的上层,不常用的单词会在树的下层。

Negative Sampling

Negative Sampling 是用来加速训练的,它的思想是,我们不需要对所有的单词进行 softmax 计算,我们只需要对一部分的单词进行 softmax 计算,这部分单词是我们随机选择的,然后我们将这部分单词和我们的正样本进行比较,然后计算损失。

Seq2Seq

Encoder-Decoder

Encoder is used to encode and output the same length of the input, and the decoder is used to decode the output of the encoder.

Attention

attention

  • key: something objects
  • value: one key corresponds to one value
  • query: the object we want to find

the attention mechanism is used to find the value that corresponds to the query.

we can abstract the question to a function f(x)f(x).
x is the query, y is the value that corresponds to the key.

we want to find the value that most corresponds to the query.

f1(x)=1ni=1nyif_1(x) = \frac{1}{n}\sum_{i=1}^{n}y_i

for f_1, it’s the average of the value, so it’s not the best way to find the value that corresponds to the query —— we average our attention to all the values!

f2(x)=i=1nαi(x,xi)yif_2(x) = \sum_{i=1}^{n}\alpha_i(x, x_i)y_i

for f_2, we use the α\alpha to weight the value, so it’s better than f_1.

how should we choose the α\alpha?

αi(x,xi)=exp(12(xxi)2)j=1nexp(12(xxj)2)\alpha_i(x, x_i) = \frac{exp(-\frac{1}{2}(x-x_i)^2)}{\sum_{j=1}^{n}exp(-\frac{1}{2}(x-x_j)^2)}

f3(x)=i=1nαi(x,xi)yif_3(x) = \sum_{i=1}^{n}\alpha_i(x, x_i)y_i

f3(x)=i=1nsoftmax(12(xxi)2)yif_3(x) = \sum_{i=1}^{n}softmax(-\frac{1}{2}(x-x_i)^2)y_i

||x-x_i||^2 is the distance between the query and the key.

the α\alpha is the softmax of the distance between the query and the key.

the value of the α\alpha increases as xxi2||x-x_i||^2 decreases.

f3(x)=i=1nexp(12((xxi)w)2)j=1nexp(12((xxj)w)2)yif_3(x) = \sum_{i=1}^{n}\frac{\exp(-\frac{1}{2}((x-x_i)w)^2)}{\sum_{j=1}^{n}\exp(-\frac{1}{2}((x-x_j)w)^2)}y_i

we add the weight w to the distance between the query and the key, so we can learn the weight w to find the best value that corresponds to the query!

attention scoring function

α(q,ki)=softmax(a(q,ki))=exp(a(q,ki))j=1mexp(a(q,kj))R.\alpha(\mathbf{q}, \mathbf{k}_i) = \mathrm{softmax}(a(\mathbf{q}, \mathbf{k}_i)) = \frac{\exp(a(\mathbf{q}, \mathbf{k}_i))}{\sum_{j=1}^m \exp(a(\mathbf{q}, \mathbf{k}_j))} \in \mathbb{R}.

we use aa as the scoring function, it’s used to calculate the similarity between the query and the key.

masked softmax

we can use the masked softmax to mask some values that we don’t want to pay attention to, for these values, we set it’s key(x_i) to a very large negative number, so it’s softmax output is zero.

Additive attention

当查询和键是不同长度的向量时,可以使用加性注意力作为评分函数。

a(q,k)=wvtanh(Wqq+Wkk)R,a(\mathbf q, \mathbf k) = \mathbf w_v^\top \text{tanh}(\mathbf W_q\mathbf q + \mathbf W_k \mathbf k) \in \mathbb{R},

其中可学习的参数是WqRh×q\mathbf W_q\in\mathbb R^{h\times q}, WkRh×k\mathbf W_k\in\mathbb R^{h\times k}wvRh\mathbf w_v\in\mathbb R^{h}

Scaled dot-product attention

在后面的自注意力机制中,我们会使用到这个评分函数。

当查询和键是相同长度的向量时,可以使用效率更高的缩放点积注意力作为评分函数。

a(q,k)=qk/d.a(\mathbf q, \mathbf k) = \mathbf{q}^\top \mathbf{k} /\sqrt{d}.

softmax(QKd)VRn×v.\mathrm{softmax}\left(\frac{\mathbf Q \mathbf K^\top }{\sqrt{d}}\right) \mathbf V \in \mathbb{R}^{n\times v}.

Bahdanau attention

Self-attention

我们将输入 X\mathbf X 分别传递给三个全连接层来生成查询 Q\mathbf Q、键 K\mathbf K 和值 V\mathbf V。这三个全连接层的权重分别是 WqRq×d\mathbf W_q\in\mathbb R^{q\times d}WkRk×d\mathbf W_k\in\mathbb R^{k\times d}WvRv×d\mathbf W_v\in\mathbb R^{v\times d}。对应如下公式

Q=XWq,K=XWk,V=XWv.\mathbf Q = \mathbf X \mathbf W_q, \quad \mathbf K = \mathbf X \mathbf W_k, \quad \mathbf V = \mathbf X \mathbf W_v.

然后,我们将得到的 Q\mathbf QK\mathbf KV\mathbf V 进行 Scaled Dot-product Attention 运算。对应如下共识

Attention(Q,K,V)=softmax(QKd)V.\mathrm{Attention}(\mathbf Q, \mathbf K, \mathbf V) = \mathrm{softmax}\left(\frac{\mathbf Q \mathbf K^\top}{\sqrt{d}}\right) \mathbf V.

发生了什么?使用相同的输入生成查询、键和值,再之后的运算中,将生成的查询与键进行点积运算,到这里,就计算得到了查询和键的相关性。也就是说,我们可以通过这种方式,让模型自己学习到输入序列中各个元素 xix _i 之间的相关性,这就是自注意力机制的核心思想。自注意力机制直观上是在进行一种“自我关注”。

Multi-head attention

alt text

将输入通过多的全连接层(也就是训练 Weight )生成多的查询、键和值,然后将这些查询、键和值输入到不同的注意力头中,最后将这些头的输出拼接在一起,然后通过一个全连接层来生成最终的输出。

多头注意力机制的优势在于,每个头由于都有不同的权重,因此机器可以学习关注到输入序列中不同的部分,这样有助于模型更好地学习输入序列中的相关性。

位置编码

在自注意力机制中,我们会发现他的输入都是并行计算的,也就是说,我们的模型并不知道输入序列中元素的位置信息。为了解决这个问题,我们可以引入位置编码,在实践上,就是将编码好的位置信息 P\mathbf{P} 与输入序列 X\mathbf{X} 直接相加,得到新的输入序列 X\mathbf{X}'

X=X+P\mathbf{X}' = \mathbf{X} + \mathbf{P}

为什么直接相加就能让模型学习到位置信息呢?这是因为,我们的位置编码 P\mathbf{P} 是一个和输入序列 X\mathbf{X} 同样大小的矩阵,而且这个矩阵是有规律的,也就是说,我们的模型可以通过学习到这个规律,来学习到输入序列中元素的位置信息。(?)

在 Transformer 中,我们使用的位置编码是通过以下公式计算得到的

PE(pos,2i)=sin(pos100002i/d),PE(pos,2i+1)=cos(pos100002i/d),\begin{aligned} \mathrm{PE}_{(pos, 2i)} &= \sin\left(\frac{pos}{10000^{2i/d}}\right), \\ \mathrm{PE}_{(pos, 2i+1)} &= \cos\left(\frac{pos}{10000^{2i/d}}\right), \end{aligned}

Transformer

Encoder

原始输入经过嵌入层后,经过位置编码后,输入到了第一个多头注意力层。在这个多头注意力层中,我们得到了一个输出,这个输出是一个和输入序列相同大小的矩阵。

a:

接着残差连接(多头注意力层的输入)和层归一化。为什么是层归一化而不是批归一化呢?因为批归一化是在相同的特征维度上,不同的样本上进行归一化,而层归一化是在不同的特征维度上,相同的样本上进行归一化,而我们输入的是序列数据,所以我们使用层归一化。

然后,会进入到一个逐位前馈神经网络。基于位置的前馈网络(Position-wise Feed-Forward Networks)对序列中的每个位置的元素进行相同的变换,这个变换是一个全连接层,然后再经过一个 ReLU 激活函数,最后再经过一个全连接层。

然后,再次进行残差连接和层归一化。

end a

这里,我们可以叠加多个 a 层。