Principal Component Analysis | Scikit-Learn Implementation
Principal Component Analysis is a technique that is used to reduce dimensions, or in simple words, attributes of the dataset, to a lower dimension, without losing any of the information from the data. The new dimensions generated after the process are called Principal Components.
Let us take an example. Suppose that we have the iris dataset. This dataset has 4 feature vectors. Now, we do not have any method of plotting the scatter plot for a dataset of 4 dimensions, but if we reduce the dimensions to 3 or 2, we can surely create a 2D or 3D scatter plot.
Scikit-Learn Implementation
We will use the PCA class of the sklearn.decomposition python module to reduce the dimensionality of the dataset (iris). And then we will also create a 3D plot of the generated components (eigenvectors).
Importing the libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
from sklearn.datasets import load_iris
from sklearn.decomposition import PCA
Loading the dataset
Code
iris = load_iris()
Extracting the target vector
Code
target_species = iris.target
Reducing Dimensions
Now, this is the step where we reduce the 4-dimensional iris dataset to 3 dimensions.
Code
x_reduced = PCA(n_components=3).fit_transform(iris.data)
Here, the number of principal components is defined by “n_components”.
Creating a 3D scatterplot of the new components
Code
fig = plt.figure()
axes = Axes3D(fig)
axes.set_title(‘Iris Dataset by PCA’, size=14)
axes.set_xlabel(‘First eigenvector’)
axes.set_ylabel(‘Second eigenvector’)
axes.set_zlabel(‘Third eigenvector’)
axes.w_xaxis.set_ticklabels(())
axes.w_yaxis.set_ticklabels(())
axes.w_zaxis.set_ticklabels(())
axes.scatter(x_reduced[:,0],x_reduced[:,1],x_reduced[:,2], c=target_species)
Output

So, as we can see from the above graph that the 4 dimensions of the iris dataset have been converted to 3 dimensions.
The 3 colored scatter plot in the above graph, determines the 3 categorical classes to be predicted in the dataset.