Theory of PCA - Answer Key

Author

Will Gammerdinger, Noor Sohaili

Published

September 5, 2025

Exercise 1

There are a couple properties of variance and co-variance that we can verify:

cov(X,X) is equal to var(X). We can observe this is mathematically below:

As a result, you will sometimes see covariance matrices written as:

  1. Confirm this property by estimating the variance for Gene A.
# Estimate the variance for Gene A
var(recentered_expression_matrix[, "Gene_A"])
[1] 799.3333
  1. Now estimate the covariance for Gene A and Gene A
# Estimate the covariance for Gene A and Gene A
cov(recentered_expression_matrix[, "Gene_A"], recentered_expression_matrix[, "Gene_A"])
[1] 799.3333
  1. Is the value the same? Does it match the value in the covariance matrix for Gene A and Gene A?
# Extract the covariance estimate of Gene A and Gene A from the covariance matrix
cov_matrix["Gene_A", "Gene_A"]
[1] 799.3333
  1. cov(X,Y) is equal to cov(Y,X). We can observe this is mathematically below:

Estimate the covariance of Gene B and Gene A.

# Estimate the covariance of Gene B and Gene A
cov(recentered_expression_tibble[, "Gene_B"], recentered_expression_tibble[, "Gene_A"])
         Gene_A
Gene_B 584.6667
  1. How does this compare to the covariance that we estimated by hand?

It is the same.

Exercise 2

When looking at the percent explained by each principal component, the first principal component should explain the most and each of the following principal components should explain less than the previous principal component. Let’s have a look at our pct_var_explained object, are our results congruent with this expectation?

pct_var_explained
    PC_1     PC_2 
96.16723  3.83277 

Yes, PC_1 explains 96.1672305 and PC_2 explains 3.8327695

Exercise 3

Create a plot of the Principal Components Analysis derived from prcomp(). Is it the same as the plot we derived except only rotated 180°?

# Create a tibble to hold the PC scores prcomp() found and also make the Cell IDs into a column
prcomp_pc_scores_tibble <- prcomp_PCA$x %>% 
  as.data.frame() %>% 
  rownames_to_column("cells") %>% 
  as_tibble()

# Plot the PC scores found by prcomp()
ggplot(prcomp_pc_scores_tibble, aes(x = PC1, y = PC2, label = cells)) +
  geom_point( color = "cornflowerblue") +
  geom_text(hjust = 0, vjust = -1) +
  theme_bw() +
  xlim(-50, 50) +
  ylim(-12, 12) +
  xlab(paste0("PC 1 (Variance Explained ", round(prcomp_eigenvalues["PC_1"]/sum(prcomp_eigenvalues) * 100, digits = 2),"%)")) +
  ylab(paste0("PC 2 (Variance Explained ", round(prcomp_eigenvalues["PC_2"]/sum(prcomp_eigenvalues) * 100, digits = 2),"%)")) +
  ggtitle("PCA of Expression Values from Four Cells") +
  theme(plot.title = element_text(hjust = 0.5))

Yes, it is the same plot just rotated 180°.