Home Principal Component Analysis (PCA)
Post
Cancel

Principal Component Analysis (PCA)

What is PCA?

PCA is one of the most used unsupervised algorithms and dimensionality reduction algorithms. PCA reduces the dimensions of a D-dimensional data set by projecting onto a K-dimensional subspace where $K < D$.As PCA reduces the dimensions, it is important to select an axis that preserves the maximum variance because when the variance is large, the differences between the data become clear, which can make the model better, like image below.



So PCA finds the axis with the maximum variance, and then finds the second axis orthogonal to this first axis and maximally preserving the remaining variance.
The unit vector defining the i-th axis is called the i-th principal component (PC). In the example above, the first PC is c1 and the second PC is c2.



PCA is used for:

  • Noise filtering
  • Visualization
  • Feature Extraction
  • Stock market predictions
  • Gene data analysis

The goal of PCA is:

  • Identify patterns in data.
  • Detect the correlation between variables.

The role of PCA is:

  • Standardize the data.
  • Obtain the eigenvectors and eigenvalues from the covariance matrix or correlation matrix, or perform Singular Vector Decomposition
  • Sort eigenvalues in descending order and choose the $k$ eigenvectors that correspond to the $k$ largest eigenvalues where $k$ is the number of dimensions of the new feature subspace.
  • Construct the projection matrix W from the selected $k$ eigen vectors.
  • Transform the original data set X via W to obtain a k-dimensional feature subspace Y.

But PCA has a weakness. PCA is highly effected by data outliers.

PCA can be expressed as:





Steps of PCA



  1. Mean centering : The way to get $\tilde{X}$ which set the average of the random variables(One point) in the X matrix to 0.

    • Calculate the average of each column(data of one random variable) of the X matrix.
      $\bar{x_{i}} = \frac{1}{n}(x_{1,i}\ +\ x_{2,i}\ +\ \cdots\ +\ x_{n,i})$

    • For each column, subtract the (column’s) mean from the data. This matrix is called the centered matrix $\tilde{X}$.
      $\tilde{X}\ =\ X\ -\ \begin{bmatrix} \bar{x_{1}}&\bar{x_{2}}&\cdots&\bar{x_{i}}\ \vdots&&&\vdots\ \bar{x_{1}}&\bar{x_{2}}&\cdots&\bar{x_{i}}\ \end{bmatrix}\ =\ \begin{bmatrix} x_{1,1}&x_{1.2}&\cdots&x_{1,i}\ \vdots&&&\vdots\ x_{n,1}&x_{n,2}&\cdots&x_{n,i}\ \end{bmatrix}\ -\ \begin{bmatrix} \bar{x_{1}}&\bar{x_{2}}&\cdots&\bar{x_{i}}\ \vdots&&&\vdots\ \bar{x_{1}}&\bar{x_{2}}&\cdots&\bar{x_{i}}\ \end{bmatrix}$

  2. Get SVD(Singular-value decomposition) of $\tilde{X}$

    $\tilde{X} = U \Sigma V^{T}$ where $U\ :\ i \times i\ \text{orthogonal matrix which is eigen-vector of} AA^{T}$
    $\Sigma\ :\ m \times n\ \text{rectangular diagonal matrix}$
    $V\ :\ n \times n\ \text{orthogonal matrix which is eigen-vector of} A^{T}A$

    In this stage, we can get principal axes which is V.

  3. Get PC score

    PC score = US in step2.

Example



Code



1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression

sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

pca = PCA(n_components=2)
X_train = pca.fit_transform(X_train)
X_test = pca.transform(X_test)

classifier = LogisticRegression(random_state=0)
classifier.fit(X_train, y_train)

y_pred = classifier.predict(X_test)



Result







Implementation

This post is licensed under CC BY 4.0 by the author.