The post is just a note, it’s not a complete tutorial.
Basic
The goal of training is to minimize .
This, is the loss function, and is the regularization term.
Regularization is used to prevent overfitting. There are two types of regularization:
- L1 regularization:
- L2 regularization:
- L2 regularization is more common.
When is large, the weights will be small, and the model will be simpler.
1 | \begin{document} |
Recurrent Neural Networks
- Use the last output and new input to make a prediction.
- shortcoming: RNN don’t use the later information in the sequence.
He said, "Teddy Rosevelt was a man"
andHe said, "Teddy bears are cute"
.
Different types of RNNs
- One-to-one: standard neural network
- One-to-many: music generation
- Many-to-one: sentiment classification
- Many-to-many: machine translation
Forward propagation
Back propagation
Gradient descent
- Goal: minimize
梯度消失和梯度爆炸
-
梯度消失:当权重矩阵 的特征值小于 1 时,梯度会指数级衰减。
-
梯度爆炸:当权重矩阵 的特征值大于 1 时,梯度会指数级增长。
Why? Here is an example:
We can see that the output is a function of , , , , , , , , , , , .
The output is a function of the weights and biases of all the layers before it. The gradients of the weights and biases of the layers before it will be multiplied together.
So if the weights and biases are large, the gradients will be large, and if the weights and biases are small, the gradients will be small.
对于这种情况,我们可以通过改变w的初始化方法,或者使用ReLU激活函数来解决。
(适用于ReLU)
Dropout Regularization
- Randomly shut down some neurons in each layer.
- It prevents overfitting.
- 他更多的用在计算机视觉方面。
- 缺点是它让 cost function 难以预测。
1 | keep_prob = 0.8 |
why divide by keep_prob
? It’s to keep the expected value of the same, so that the cost function doesn’t change. 可以简化测试部分的代码,因为它减少了可能引入的缩放问题。
Gradient Checking
- 用于检查反向传播算法是否正确。
- 用于检查梯度是否正确。
- Don’t use in training, only for debugging.
- Doesn’t work with dropout.
Check
Mini-batch Gradient Descent
For example, we have 10000 examples, and we set the batch size to 64. Then we have 10000/64 156 mini-batches.
A minibatch for x and y is called .
- Batch Gradient Descent: use all m examples in each iteration.
- Stochastic Gradient Descent: use 1 example in each iteration. (too noisy, and never converges, and you will lose the speed advantage of computing in parallel.)
- Mini-batch Gradient Descent: use b examples in each iteration.
Speed up the learning by optimize the algorithm related to the learning rate
Exponentially weighted averages
this is used to smooth the data.
when , it means that we will average over the last 10 days. why?
so,
when a weight less than , we can consider it as 0. So the formula above is averaging over the last days.
The advantage of this method is that it doesn’t need to store all the data we want to average.
The problem is that it’s not accurate at the beginning:
you can see, at the beginning, the average is not accurate, because << and << . So we need to use bias correction.
The magic is that is close to 1 as increases.
Momentum
The Momentum algorithm is used to speed up the gradient descent algorithm.
The Gradient Descent algorithm sometimes is slow, because it oscillates back and forth to the minimum.
In the Momentum algorithm, we compute an exponentially weighted average of the gradients and then use that gradient to update the weights. This can reduce the oscillation.
When the is like a “bowl” shape, the Momentum algorithm may not work well. The reduction of w and b will become larger and larger as the iteration increases. But the momentum algo will speed down the reduction because we use the average of the gradients.
RMSprop
RMSprop is similar to Momentum, but it’s used to solve the problem of the “bowl” shape of .
The is used to avoid the division by zero.
Adam Optimization Algorithm
Adam is the combination of Momentum and RMSprop.
Learning rate decay
-
-
exponential decay:
-
(k is a constant, t is the iteration number)
-
discrete staircase decay:
Hyperparameter tuning
- Randomly choose the hyperparameters.
- Coarse to fine: first use a wide range of hyperparameters, and then use a narrower range to find the best hyperparameters.
- Use the appropriate scale to pick hyperparameters.
Batch Normalization
The normalization technique is not only used in the input , but also used in the hidden layers. It normalizes the mean and variance of the in each layer (正态分布就是做这事的,我们可以通过改变下面的 和 来强制改变某一个神经元输出的 (在下面最终叫 了) 分布的均值和方差,这样可以让 的分布更加稳定,更容易训练。而且还有很多好处,下面会提到。)
Normalize the in each layer like we normalize the input before.
the and are learnable parameters.
the should be canceled out because 归一化操作的过程中,减去了均值
- Speed up the training, makes the output more stable.
- It makes the weights less sensitive to the initial values.
- It makes the learning rate less sensitive to the choice of hyperparameters.
- 在 FC layer中,计算的是特征维度,而在卷积神经网络中,计算的是通道维度上的均值和方差。
Gradient descent with batch normalization
Forward propagation
Backward propagation
compute the gradients of , and .
update parameters:
在神经网络训练的过程中,使用了 batch normalization 之后,他可以适当减少 hidden layer 前层和后层的关系,使得后层的学习更加轻松。他还有轻微的正则化效果。正则化本质上是通过增加一些噪声来减少过拟合,而 batch normalization 通过调整 和 来影响 input(或者是前层的输出) 的分布,进而降低训练集改变(或者是前层输出的改变)对后面层的影响,也可以看作是增加了一些噪声。
Batch normalization at test time
当我们训练时,我们需要计算 和 ,这里的 和 是在某个 mini-batch 上计算的。
当我们测试时,可能需要逐一处理样本,所以很难计算 和 。所以我们需要用到指数加权平均的方法用训练集的各个 mini batch 的数据来计算 和 。
Softmax Regression
is the cost function, and is the loss function.
-
取幂操作可以让输出的值变得更大,这样可以让输出的值更加明显,最后再除以 来保证输出的值在 0 到 1 之间,并且和为 1。
-
The softmax function is used for the multi-class classification problem.
对于 Loss function,因为 y 大多数是一个 one-hot 向量,所以 只有一个值是非零的,所以 ,这里的 是指的是 中非零的那个值的下标。因此这里 Loss function 的作用就是找到这个非零的值(true sample),让这个非零的值对应的预测值y尽可能的大。(最大似然估计)
Training Tricks
The dataset
- 7/3: training set/dev set
- 6/2/2: training set/dev set/test set
- 98/1/1: training set/dev set/test set (for 1B+ dataset)
- 三个数据集要来自同一个分布。
Error analysis
-
选定一个 Baseline,这个 Baseline 可以是人类的表现(人类的错误率),查看Training set、Dev set、test set 的错误率互相的差值,我们倾向于首先解决错误率差最高的那个问题(如dev set的错误率比training set的错误率高很多,那么可能就首要解决过拟合等的问题),这样解决问题的效率会更高。
-
对与 mislabbled 的样本,如果样本量不是特别大,我们可以手动去筛查到底是什么 feature 更多地导致了错误的分类。
-
人工统计错误其实很重要。
-
对于数据集中的更关心的数据源(比如对于图片,我们有时候更关心手机拍的图而不是web图),我们可以在 dev set 和 test set 中让这种数据所占的比例比training set 中更多一些。
Training-dev set
- 如果 training-dev set 和 training set 的错误率差距很大,那么可能是模型过拟合,无法泛化到同一个分布的其他数据上。
- 如果 training-dev set 和 dev set 的错误率差距很大,那么可能是 dev set 和 training set 来自不同的分布(data mismatch problem)。
for data mismatch problem,we can manually check the dev/test set to see what the problem we are facing。我们可以通过人工合成数据来解决.
Transfer Learning
- Use the pre-trained model and fine-tune it.
Multi-task learning
- 用于同时训练多个任务,可以提高模型的泛化能力。
Convolutional Neural Networks
Padding
- Valid: no padding
- Same: padding so that the output size is the same as the input size.
p = \frac{f-1}{2}$$ p is the padding size, f is the filter size. ### Strided convolution - Stride: the number of pixels we move each time. $$n_H = \lfloor\frac{n_H^{[0]}+2padingsize-filtersize}{stride}+1\rfloor
A convolution layer
-
l is the layer number.
-
the input(activation in the last layer’s output):
-
each filter is:
-
weight:
-
bias:
-
activation(the output):
在a和w相乘的过程中,将所有通道的值相加。
A ConvNet
- The width and height of the input decrease and the channel increases as the network goes deeper.
Pooling layers
- Max pooling: take the max value in the pooling window.
- Average pooling: take the average value in the pooling window.
A NN Example
可以发现,大多数卷积神经网络中,随着层数的增加,Activation size 会越来越小。
图中 parameters 是指的是权重和偏置的数量。每个 filter 都有一个偏置。
Why use the convolutional layer
- Parameter sharing: a feature detector that’s useful in one part of the image is probably useful in another part of the image.
- Sparsity of connections: in each layer, each output value depends only on a small number of inputs.
- Reduce the parameters: the number of parameters is reduced.
这些优点使得卷积神经网络在图像识别中有很好的效果,比如平移不变性,平移不变性是指,如果我们在图片中移动一个物体,那么我们的模型应该能够识别出这个物体。
Classic Networks
LeNet-5
AlexNet
VGG-16
ResNet
ResNet solves the problem of vanishing gradients and exploding gradients by using the skip connection(aka shortcut connection).
Let consider the extreme case, if W equals to 0, then the output of the layer will be the same as the output of the layer l (a[l]).
DenseNet
改良版的 ResNet,它不是将前面的输出和后面的输出相加,而是将前面的输出和后面的输出拼接在一起。并且借用了泰勒公式的思想,假设 ResNet 是这样的:
或者,更简单的表示:
那么 DenseNet 的项就更多,并且使用[, ] 来拼接。
1x1 convolution
It’s used to reduce the number of channels. So we can reduce a large number of computations.
Reasonaly build a bottleneck layer can not only reduce a bunch of computations, but also
does not affect the performance of the model.
Inception Network
Inception network is used to solve the problem of choosing the filter size and choosing filter. It uses different filter sizes and then concatenates the output. Let the net learn which filter size is the best.
Benchmarking
- Ensembling: train multiple networks independently and average their outputs.
- Multi-crop at test time: crop the image in multiple ways and average the outputs.
Object Detection
Localization -> Detection
Localization / Landmark detection
本质上需要训练集提供 bounding box、label、landmark 的坐标,然后拿去训练。
Sliding windows detection
After we train a classifier to check whether a image contains this object we expect or not, we can use the sliding windows to detect the object, the algorithm is like this:
- Use a small window to check whether the object is in this window or not.
- Move the window to the right and check again…
- if we dont find the object, we can use a larger window to check again.
Turning FC layers into Conv layers
We use some filters with the same size as the input, so the output will be the size of 11n, n is the number of the filters.
Convolutional implementation of sliding windows
如果我们使用卷积的方式来实现滑动窗口,由于很多的计算都可以通过算法来共享,因此计算量会大大减少。(图中红色斜线代表的就是可以被共享的计算结果)而且仅仅只用进行一次支持并行的卷积操作。
You Only Look Once
aka YOLO, it’s a real-time object detection algorithm. It divides the image into a grid and then predicts the bounding box and the class of the object in each grid.
IoU
Intersection over Union, it’s used to measure the accuracy of the object detection algorithm.
Non-max suppression
非极大值抑制 in Chinese.
When the detection algorithm detects multiple bounding boxes for the same object(will discrad all the bounding boxes that the probability is less equal than 0.6), the algo will keep the one with the highest probability and then remove the bounding boxes that have a high IoU(>=0.5) with the kept one.
Anchor boxes
Each object in training image is assigned to the grid that contains the object’s midpoint and the anchor box with the highest IoU with the object.
训练集的输出将会有两个部分,也就是两个 Anchor boxes 数据。
$ y = (p_c, b_x, b_y, b_h, b_w, c_1, c_2, c_3, p_c, b_x, b_y, b_h, b_w, c_1, c_2, c_3)$
YOLO algorithm
Waiting for update…
R-CNN
Sometimes the detector will run on some rectangles that seem like no object in it, so we can use the R-CNN to solve this problem.
R-CNN will use the segmentation algorithm to segment the image and then use the object detection algorithm to detect the object in the segmented image.
Fast R-CNN
Fast R-CNN is faster than R-CNN, because it uses the convolutional layer to extract the features of the image and then use the object detection algorithm to detect the object in the feature map.
Faster R-CNN
It uses the Region Proposal Network(A convolutional network) to generate the region proposals.
Face Recognition
One Shot Learning
指在机器学习领域中的一种范式,其目标是让模型通过仅仅一次或很少次的观察就能够学习并理解新的概念或类别
Siamese Network
Siamense Network uses two identical networks to learn the features of the two images and then use the distance between the two features to decide whether the two images are the same person or not.
Training
d is the distance between the two features, f is the similarity function.
L(A,P,N) = max(||f(A)-f(P)||^2-||f(A)-f(N)||^2+\alpha,0)$$ (Triplet Loss. A is the anchor, P is the positive, N is the negative) ##### Face Verification and Binary Classification $$\hat{y} = \sum_{k=1}^{n}(w_i|f(x^{(i)})_k-f(x^{(j)})_k|^2 + b)
n is the number of the parameters.
Neural Style Transfer
可以将一张图片的风格应用到另一张图片上。
paper: A Neural Algorithm of Artistic Style
Its cost function can be devided into two parts: content cost function and style cost function.
The basic idea of this algorithm and the last algorithm is to use the pre-trained model to extract the features of the image, the difference of two images is the distance of the features.
Content cost function
Style cost function
and are the channels of the feature map, range from 1 to .
is a hyperparameter that controls the weight of every layer.
the and are the Gram matrix of the feature map, means the correlation between the channel and . The science behind this is that the style of the image is the correlation between the channels of the feature map.
(Source: https://www.cnblogs.com/yifanrensheng/p/12862174.html)
RNN
waiting for update
GRU
waiting for update
LSTM
waiting for update
Embeddings
embedding is a way to represent the words in the text.
假设有 10000 个单词,我们可以将每个单词表示成一个 300 维的向量,这个向量就是这个单词的 embedding。300 相当于是类别的数量。
类比推理
我们可以使用 consine similarity 来计算两个向量之间的相似度。
训练并预测 word embeddings
可以使用 skip-gram 模型。skip-gram 输入单词的上下文,然后预测下一个单词。输出层是 softmax 层。通常我们会用 hierarchical softmax 来代替 softmax 层,本质上是一个二叉树,每个叶子节点代表一个单词。常用的单词会在树的上层,不常用的单词会在树的下层。
Negative Sampling
Negative Sampling 是用来加速训练的,它的思想是,我们不需要对所有的单词进行 softmax 计算,我们只需要对一部分的单词进行 softmax 计算,这部分单词是我们随机选择的,然后我们将这部分单词和我们的正样本进行比较,然后计算损失。
Seq2Seq
Encoder-Decoder
Encoder is used to encode and output the same length of the input, and the decoder is used to decode the output of the encoder.
Attention
- key: something objects
- value: one key corresponds to one value
- query: the object we want to find
the attention mechanism is used to find the value that corresponds to the query.
we can abstract the question to a function .
x is the query, y is the value that corresponds to the key.
we want to find the value that most corresponds to the query.
for f_1, it’s the average of the value, so it’s not the best way to find the value that corresponds to the query —— we average our attention to all the values!
for f_2, we use the to weight the value, so it’s better than f_1.
how should we choose the ?
||x-x_i||^2 is the distance between the query and the key.
the is the softmax of the distance between the query and the key.
the value of the increases as decreases.
we add the weight w to the distance between the query and the key, so we can learn the weight w to find the best value that corresponds to the query!
attention scoring function
we use as the scoring function, it’s used to calculate the similarity between the query and the key.
masked softmax
we can use the masked softmax to mask some values that we don’t want to pay attention to, for these values, we set it’s key(x_i) to a very large negative number, so it’s softmax output is zero.
Additive attention
当查询和键是不同长度的向量时,可以使用加性注意力作为评分函数。
其中可学习的参数是, 和。
Scaled dot-product attention
在后面的自注意力机制中,我们会使用到这个评分函数。
当查询和键是相同长度的向量时,可以使用效率更高的缩放点积注意力作为评分函数。
Bahdanau attention
Self-attention
我们将输入 分别传递给三个全连接层来生成查询 、键 和值 。这三个全连接层的权重分别是 、 和 。对应如下公式
然后,我们将得到的 、 和 进行 Scaled Dot-product Attention 运算。对应如下共识
发生了什么?使用相同的输入生成查询、键和值,再之后的运算中,将生成的查询与键进行点积运算,到这里,就计算得到了查询和键的相关性。也就是说,我们可以通过这种方式,让模型自己学习到输入序列中各个元素 之间的相关性,这就是自注意力机制的核心思想。自注意力机制直观上是在进行一种“自我关注”。
Multi-head attention
将输入通过多组的全连接层(也就是训练 Weight )生成多组的查询、键和值,然后将这些查询、键和值输入到不同的注意力头中,最后将这些头的输出拼接在一起,然后通过一个全连接层来生成最终的输出。
多头注意力机制的优势在于,每个头由于都有不同的权重,因此机器可以学习关注到输入序列中不同的部分,这样有助于模型更好地学习输入序列中的相关性。
位置编码
在自注意力机制中,我们会发现他的输入都是并行计算的,也就是说,我们的模型并不知道输入序列中元素的位置信息。为了解决这个问题,我们可以引入位置编码,在实践上,就是将编码好的位置信息 与输入序列 直接相加,得到新的输入序列 。
为什么直接相加就能让模型学习到位置信息呢?这是因为,我们的位置编码 是一个和输入序列 同样大小的矩阵,而且这个矩阵是有规律的,也就是说,我们的模型可以通过学习到这个规律,来学习到输入序列中元素的位置信息。(?)
在 Transformer 中,我们使用的位置编码是通过以下公式计算得到的
Transformer
Encoder
原始输入经过嵌入层后,经过位置编码后,输入到了第一个多头注意力层。在这个多头注意力层中,我们得到了一个输出,这个输出是一个和输入序列相同大小的矩阵。
a:
接着残差连接(多头注意力层的输入)和层归一化。为什么是层归一化而不是批归一化呢?因为批归一化是在相同的特征维度上,不同的样本上进行归一化,而层归一化是在不同的特征维度上,相同的样本上进行归一化,而我们输入的是序列数据,所以我们使用层归一化。
然后,会进入到一个逐位前馈神经网络。基于位置的前馈网络(Position-wise Feed-Forward Networks)对序列中的每个位置的元素进行相同的变换,这个变换是一个全连接层,然后再经过一个 ReLU 激活函数,最后再经过一个全连接层。
然后,再次进行残差连接和层归一化。
end a
这里,我们可以叠加多个 a 层。