Line Separability

许多的模型假设了数据特征是线性可分的,但是实际上数据不一定符合该假设。这种情况在非结构化数据中尤为常见,如图像、时间、文本数据。

假如特征非线性可分,那么就不能得到好的线性模型的学习结果,需要更加复杂的模型。

如何判断线性可分性:

  • 图像化,但是只适用于低维度数据集

  • Computational metrics

    • Linear Soft-Margin SVM

    • Reduce dimensions (LDA, PCA), then visualize separability

如何缓解:

  • 找到真正有用的特征

    • Feature Extraction (collect new features of the data)
    • Feature Selection (keep fewer, more useful features)
  • 特征变换

    • Feature Engineering (e.g., x → x2)

    • Change Basis Vectors (e.g., PCA, LDA)

    • Kernel (e.g., Kernel SVM)

    • Feature Learning (e.g., Neural Networks)

Principal Component Analysis (PCA)

\[
\begin{pmatrix}
x_{1} \\
x_{2} \\
\end{pmatrix}
\overset{\begin{matrix}{\mbox{PCA}}\\{\mbox{Projection}}\\ \end{matrix}}{\rightarrow}
\begin{pmatrix}
{PC}_{1} \\
{PC}_{2} \\
\end{pmatrix}
\]

\[
S_1 > S_2
\]


\[
\begin{pmatrix}
{PC}_{1} \\
{PC}_{2} \\
\end{pmatrix}
\overset{\begin{matrix}{\mbox{Reduce}}\\{\mbox{Dimensions}}\\ \end{matrix}}{\rightarrow}
\begin{pmatrix}
{PC}_{1} \\
\end{pmatrix}
\]

Maximize Data Variance

Good for supervised learning and unsupervised learning

Linear Discriminant Analysis (LDA)

\[
F = \frac{|\mu_r - \mu_b|^2}{s_r^2 + s_b^2}
\]


\[
\begin{pmatrix}
x_{1} \\
x_{2} \\
\end{pmatrix}
\overset{\begin{matrix}{\mbox{LDA}}\\{\mbox{Projection}}\\ \end{matrix}}{\rightarrow}
\begin{pmatrix}
{LD}_{1} \\
{LD}_{2} \\
\end{pmatrix}
\]

\[
F_1 > F_2
\]


\[
\begin{pmatrix}
{LD}_{1} \\
{LD}_{2} \\
\end{pmatrix}
\overset{\begin{matrix}{\mbox{Reduce}}\\{\mbox{Dimensions}}\\ \end{matrix}}{\rightarrow}
\begin{pmatrix}
{LD}_{1} \\
\end{pmatrix}
\]

Maximize Class Separation

Better for supervised learning but not for unsupervised learning

Curse of Dimensionality

特征的数量太大时:

数据过于稀疏,无法获得真正的决策边界(对于分类来说),也很容易造成过拟合。

数据间距离过于接近,影响 kNN,clustering 等的表现。

如何判断维度过多:

Visualize histogram of distances (check for variance σ2) 但是一般来说分析这个会很繁琐

如何缓解:

  • Feature Selection

  • Dimensionality Reduction

    • Linear Matrix Factorization (e.g., PCA, LDA)

    • Deep Auto-Encoders

Imbalanced Data

数据值在特征中不是均匀分布的,在这种情况下,评价指标会产生误导,模型会过拟合至多数类。

事件发生不均匀(比如罕见的疾病),或者数据收集不均匀,都有可能造成这个问题。

如何判断数据不平衡:

Visualize histogram or bar chart of feature values

如何缓解:

  • Collect more data instances

  • Resample instances (e.g., Undersampling, Oversampling, SMOTE)

Synthetic Minority Oversampling Technique (SMOTE)