We tutor students in a variety of statistics, data analysis, and data modeling classes. Principal component analysis is one of the topics our statistics tutors cover. It is a complex topic, and there are numerous resources on principal component analysis. But, students get lost in the vast quantity of material. So in this brief article, we:
- Break down the essential PCA concepts students need to understand at the graduate level; and
- Provided you necessary R code to perform a principal component analysis;
- Select the principal components to use; and
- Interpret the output of your principal component analysis.
We tackle the above PCA questions by answering the following questions as directly as we can.
- What is PCA or Principal Component Analysis?
- Why PCA?
- What do the New Variables (Principal Components) Indicate?
- How are the Principal Components Constructed
- How many New Variables (Principal Components) are created in a PCA?
- What do the PCs mean?
- What type of data is PCA best suited for?
- Is PCA sensitive to scaling?
- Should you scale your data in PCA?
- Why is Variance Prized in PCA?
- What is the Secret of PCA?
- Using PCA for Prediction
- The Mechanics of PCA – Step by Step
- How many Principal Components should I use
- The essential R Code you need to run PCA?
- Interpreting the PCA Graphs?
What is PCA or Principal Component Analysis?
PCA stands for principal component analysis. It is primarily an exploratory data analysis technique but can also be used selectively for predictive analysis. Applications of PCA include data compression, blind source separation, de-noising signals, multi-variate analysis, and prediction. PCA helps you understand data better by modeling and visualizing selective combinations of the various independent variables that impact a variable of interest.
There is plenty of data available today. We have a problem of too much data! PCA helps you narrow down the influencing variables so you can better understand and model data. PCA can suggest linear combinations of the independent variables with the highest impact.
- Reduction: PCA helps you ‘collapse’ the number of independent variables from dozens to as few as you like and often just two variables. Visualizing data in 2 dimensions is easier to understand than three or more dimensions.
- New information in Principal Components: PCA creates new variables from the existing variables in different proportions. These new variables are simply named Principal Components (‘PC’) and referred to as PC1, PC2, PC3, etc.
- Independent variables: PCA not only creates new variables but creates them in such a manner that they are not correlated. This independence helps avoids multicollinearity in the variables.
- Outliers: When working with many variables, it is challenging to spot outliers, errors, or other suspicious data points. Reducing a large number of variables and visualizing them help you spot outliers. Spotting outliers is a significant benefit and application of PCA.
What do the New Variables (Principal Components) Indicate?
These new variables or Principal Components indicate new coordinates or planes. The Principal Components are combinations of old variables at different weights or “Loadings”. You can see what the principal component mean visually on this page.
How are the Principal Components Constructed?
PCA methodology builds principal components in a manner such that:
- The principal component is the vector that has the highest information.
- Here we measure information with variability. An independent variable that has little variability has little information. Whereas if higher variance could indicate more information.
- Maximum information (variance) is placed in the first principal component (PC1).
- Then the second principal components is selected again trying to maximize the variance. Principal components must be uncorrelated. This is done by selecting PCs that are orthogonal, making them uncorrelated.
- The remaining information squeezed into PC3, PC4, and so on. Principal components are driven by variance.
- Principal components pick up as much information as the original dataset.
This selection process is why scree plots drop off from left to right. It is also why you can work with a few variables or PCs. The PCA methodology is why you can drop most of the PCs without losing too much information.
How many Principal Components are created in a PCA?
There will be as many principal components as there are independent variables.
What do the PCs mean?
PCs, geometrically speaking, represent the directions that have the most variance (maximal variance). Remember, the PCs were selected to maximize information gain by maximizing variance. A great way to think about this is the relative positions of the independent variables. Ed Hagen, a biological anthropologist at Washington State University beautifully captures the positioning and vectors here.
What type of data is PCA best suited for?
Many Independent variables: PCA is ideal to use on data sets with many variables. Three or ideally many more dimensions is where PCA makes a significant contribution. It isn’t easy to understand and interpret datasets with more variables (higher dimensions). PCA helps boil the information embedded in the many variables into a small number of Principal Components.
Numeric Variables: PCA can be applied only on quantitative data sets. It cannot be used on categorical data sets. This can be considered one of the drawbacks of PCA.
Is PCA sensitive to scaling?
Yes, PCA is sensitive to scaling. Scaling will change the dimensions of the original variables. While it is mostly beneficial, scaling impacts the applications of PCA for prediction and makes predictions more complicated.
Should you scale your data in PCA?
Scaling is an act of unifying the scale or metric. Scaling is the process of dividing each value in your independent variables matrix by the column’s standard deviation. You essentially change the units/metrics into units of z values or standard deviations from the mean. So you may have been working with miles, lbs, #of ratings, etc. But once scaled, you are working with z scores or standard deviations from the mean.
So should you scale your data in PCA before doing the analysis?
If your independent variables have different units/metrics, you should ideally scale them. Scaling them will help you compare the independent variables with different units more efficiently.
If your independent variables have the same units/metrics, you do not have to scale them. However, if they have different variances, you have to decide if you still want to scale your independent variables. Be aware that independent variables with higher variances will dominate the variables with lower variances if you do not scale them. Remember that you are trying to understand what contributes to the dependent variable. So if the significance of an independent variable is dependent on the variance, you actually lose clarity by scaling.
There is another benefit of scaling and normalizing your data. If your dataset is very large, scaling may speed up your analysis.
Why is variance prized in PCA?
When a variable (principal component in our case) has a high degree of variance, it indicates the data is spread out. When the data is widely dispersed, it is easier to see and identify differences and categorize the variables into different segments.
What is the secret of PCA?
Eigenvectors: Eigenvectors indicate the direction of the new variables. Eigenvectors are formed from the covariance matrix. It shows the directions of the axes with most information (variance). These become our Principal Components. Eigenvectors are displayed in box plots for each PC. These box plots indicate the weights of each of the original variables in each PC and are also called loadings. (slope displays the relationship between the PC1 and PC2
Eigenvalues: Eigenvalues are coefficients of eigenvectors. Eigenvalues indicate the variance accounted for by a corresponding Principal Component. (The computation is the sum of the squared distances of each value along the Eigenvectors/PC direction.
The variance explained by each PC is the Sum of Squared Distances along the vectors for both the principal components divided by n-1 (where n is the sample size)
Using PCA for Prediction?
We can use PCA for prediction by multiplying the transpose of the original data set by the transpose of the feature vector (PC).
The Mechanics of PCA – Step by Step
Here are the steps you will follow if you are going to do a PCA analysis by hand. Please be kind to yourself and take a small data set.
- Gather the data. Sort out the independent variables separately. The independent variables are what we are studying now. Your independent variables are now a matrix of independent variables arranged in columns.
- Decide if you want to center and scale your data. There are advantages and disadvantages to doing this.
- Centering your data: Subtract each value by the column average. If you have done this correctly, the average of each column will now be zero.
- Scaling your data: Divide each value by the column standard deviation. You remove the metrics and make the units z values or standard deviations from the mean.
- Once you have scaled and centered your independent variables, you have a new matrix – your second matrix.
- Transpose the new matrix to form a third matrix.
- Compute the Covariance matrix by multiplying the second matrix and the third matrix above. This is your fourth matrix.
- Calculate the eigenvectors and eigenvalues. There are multiple ways this can be done.
- Sort the eigenvalues from the largest to the smallest. Reorder the eigenvectors in the corresponding order. You now have your fifth matrix.
- The eigenvectors in step 9 are now multiplied by your second matrix in step 5 above.
The essential R Code you need to run PCA?
R programming has prcomp and princomp built in. In addition, there are a number of packages that you can use to run your PCA analysis. Some of these include AMR, FactoMineR, and Factoextra. We have chosen the Factoextra package for this article.
- name <- prcomp(data, scale = TRUE) #R code to run your PCA analysis and define the PCA output/model with a name.
- name #R code to see the entire output of your PCA analysis..
- summary(name) #R code get the summary – the standard deviations, proportion of variance explained by each PC and the cumulative proportion of variance explained by each PC.
- fviz_pca_var(name) #R code to give you the graph of the variables indicating the direction.
- fviz_pca_ind(name) #R code to plot individual values
- fviz_pca_biplot(name) #R code to plot both individual points and variable directions
These are the basic R functions you need. You can do a lot more in terms of formatting and deep dives but this is all you need to run an interpret the data with a PCA!
Interpreting the PCA Graphs of the Dimensions/Variables
In the factoextra PCA package, fviz_pca_var(name) gives you the graph of the variables indicating the direction. You will see that:
- Variables that appear together are positively correlated.
- Variables that are opposite to each other are negatively correlated.
- Variables near the center impact less than variables far away from the center point.
Graphing the original variables in the PCA graphs may reveal new information. A visual examination is all you need to do. You maybe able to see clusters and help visually segment variables. In the factoextra PCA package, fviz_pca_ind(pcad1s) is used to plot individual values.
We hope these brief answers to your PCA questions make it easier to understand. This is a deep topic so please continue to explore more resources and books. The best way to understand PCA is to apply it as you go read and study the theory. Do let us know if we can be of assistance.
Some Additional Resources on the topic include: