I find a lot of people confused over whether they should use PCA or LDA for their application. Also, many don’t understand the fundamental difference between PCA and LDA. Hopefully by the end of this blog post, the difference between PCA and LDA will be clear to the user. ( I also learnt the exact differences while trying to implement both of them).
Both PCA and LDA are widely used as dimensionality reduction techniques as a pre-processing step for Machine Learning and Pattern Recognition problems. Our desired outcome through both these techniques is to reduce the dimension of the dataset
with minimal loss of information. This reduces the computational cost, speeds up computational time, and most importantly reduces overfitting by projecting the dataset onto a lower-dimensional space that describes our data best.
The main difference between the two is that PCA is an “unsupervised learning” algorithm, since it “ignores” the class labels to find the directions (the so-called principal components) that maximize the variance in a dataset. In contrast to PCA, LDA is a “supervised learning” technique and computes the directions (“linear discriminants”) that will represent the axes that maximize the separation between multiple classes as well.
In case of LDA, rather than thus just finding axes(eigen vectors) that maximise the variance of our data, we are additionally interested in the axes that maximise separation between multiple classes. This ensures good class separability in our dataset, which PCA kinda ignores.
Another difference is that in PCA, there is no assumption on the data points being distributed normally. However, if the data points come from other distributions, PCA is only really approximating their features via their first two moments, so it’s not really optimal unless the data points are normally distributed. On the other hand, In LDA, you explicitly assume that the data points come from two separate multivariate normal distributions with different means but the same covariance matrix.
This makes LDA a less generalized method compared to PCA.
Visualizing PCA and LDA plots:-
The plots have been generating using scikit-learn ML library on the Iris Dataset. The Iris dataset consists of 150 images of 3 classes of flowers, each flower having 4 features.
The above two images make it clear that where the PCA accounts for the most variance in the whole dataset, the LDA gives us the axes that account for the most variance between the individual classes.
When to use which technique?
It might seem like LDA is always a better technique to go with, but such is not the case. Comparisons show that PCA outperforms LDA if the number of samples per class is relatively small (as in this Iris Dataset). However, when you have a large dataset having multiple classes, then it’s better to use LDA, because class separability will be an important factor in that case while reducing dimensionality.
NOTE: PCA can also be used together with LDA. I will leave it to the user to explore that scenario.
Other useful links:-