Perceptron

感知器,是一个线性分类器(Linear Classifier)。

其中 \(x_r\) 为输入,\(w_r\) 为权重(weights),\(w_0\) 为偏置项(bias term)。

\(g\) 为激活函数(activation function),在感知器中 \(g = \mbox{ unit step function }\)。

\(\hat{y}\) 为输出,\(\hat{y} = u(\sum_{r = 0}^n w_r x_r)\)。

使用向量表示:

\(\hat{y} = u(w^\mathrm{T} x)\)。

PLA

Perceptron Learning Algorithm

步骤:

  1. 初始化权重 \(w\),可以是全 0,或者随机数值

  2. 对于任意一个特征为 \(x^{(i)}\) 的实例 \(i\),进行分类计算:\(\hat{y}^{(i)} = u(w^\mathrm{T} x^{(i)})\)

  3. 选择一个分类错误的的实例,进行权重的更新:\(w \leftarrow w + \triangle w\)

  4. 重复步骤 2 和 3 直到误差减小至阈值或者迭代次数到达最大值

权重更新:


感知器是线性分类器,也就是说它无法对非线性可分数据进行分类。

要实现非线性分类,需要

  • 其他激活函数

  • 多个感知器:Multi-Layer Perceptron (MLP)

Activation Functions

Unit Step

\[
u(x) = \begin{cases}+1\ x > 0 \\ 0\ x \leq 0\end{cases}
\]

Sign

\[
\mathrm{sgn}(x) = \begin{cases}+1\ x > 0 \\ -1\ x \leq 0\end{cases}
\]

Sigmoid

\[
\sigma (x) = \frac{1}{1 + e^{-x}}
\]

tanh

\[
\tanh (x)
\]

ReLU

\[
\max (0, x)
\]


如果 \(g\) 是可导的,那么可以用梯度下降法来快速找到 \(\varepsilon = y - \hat{y}\) 的最小值。

Sigmoid,tanh 和 ReLU 是非线性激活函数,且是可导的。

Gradient Descent

梯度下降法

权重更新(single neuron):

\[
\triangledown \varepsilon = \frac{\mathrm{d}\varepsilon}{\mathrm{d}w} = \begin{pmatrix}{\partial\varepsilon / \partial w_1}\\ \vdots \\{\partial\varepsilon / \partial w_r}\\ \vdots \\{\partial\varepsilon / \partial w_n}\\\end{pmatrix} = \frac{\mathrm{d}\varepsilon}{\mathrm{d}g} \frac{\mathrm{d}g}{\mathrm{d}f} x
\]

Binary Cross-Entropy (for classification)

\[
\varepsilon = -y \log \hat{y}
\]

Square Error (for regression)

\[
\varepsilon = \frac{1}{2}(y - \hat{y})^2
\]

Error \(\varepsilon\) \(\mathrm{d}\varepsilon / \mathrm{d}\hat{y}\) Activation Function \(\hat{y} = g(f)\) \(\mathrm{d}g / \mathrm{d}f\)
Square Error \(-(y - \hat{y})\) Sigmoid \((1 - g)g\)
Binary Cross Entropy \(-\frac{y}{\hat{y}}\) tanh \(1 - g^2\)
ReLU \([f > 0]\)

Neural Networks

使用向量表示:

Layer Activation:

Backpropagation

权重更新(neural network):

\[
\triangledown \varepsilon = \frac{\mathrm{d}\varepsilon}{\mathrm{d}W} = \begin{pmatrix}{\partial\varepsilon / \partial W^{[1]}}\\ \vdots \\{\partial\varepsilon / \partial W^{[l]}}\\ \vdots \\{\partial\varepsilon / \partial W^{[L]}}\\\end{pmatrix}
\]

\[
\frac{\mathrm{d}\varepsilon}{\mathrm{d}W^{[l]}} = \frac{\mathrm{d}\varepsilon}{\mathrm{d}\hat{y}} \frac{\mathrm{d}\hat{y}}{\mathrm{d}W^{[l]}}
\]

计算 \(\mathrm{d}\hat{y} / \mathrm{d}W^{[l]}\):

We calculate the gradient \(\mathrm{d}\hat{y} / \mathrm{d}W^{[l]}\) of the current layer \(l\) as the multiplication of:

  1. Previous layer’s activation \(a^{[l -1]}\)

  2. Current layer’s gradient of the activation function \(\mathrm{d}g^{[l]} / \mathrm{d}f^{[l]}\)

  3. Next layer’s weights \(W^{[l + 1]}\)

  4. Backpropagated partial gradient from later layers \(\delta^{[l + 1]}\)