# Load tidyverse
library(tidyverse)Number of PCs
This lesson explores how to determine the number of Principal Components (PCs) in PCA with a focus on bulk RNA-seq data. Participants will learn why the number of PCs is limited to the smaller of the number of samples or genes, how gene-wise and sample-wise covariance matrices relate to this and how data characteristics, such as duplicated genes, scaling, linear dependence, and constants, reduce the effective dimensionality of the data.
Principal Components Analysis, Covariance matrix, Eigenvectors, Eigenvalues, Dimensionality reduction, Rank deficiency
Approximate time: 45 minutes
Learning Objectives
- Derive the number of PCs for a PCA
Number of PCs
When doing a Principal Components Analysis (PCA), you will be reducing the dimensionality of your data into Principal Components (PCs). The maximum number of PCs, sometimes called rank of the PCA, for a bulk RNA-seq analysis will at most either be the the number of genes or the number of sample, whichever is fewer. In other words, the number of PCs ≤ minimum(number_of_genes, number_of_samples). Since the number for samples is almost always fewer than the number of genes for bulk RNA-seq experiments, then the number of PCs is usually at most the number of samples. In the sections below we will discuss the reasons for this. For this lesson we will need to load tidyverse.
Creation of the Covariance Matrix
Let’s consider the expression_tibble that we created in the Theory of PCA lesson.
# Print the expression_tibble
expression_tibble# A tibble: 4 × 3
samples Gene_A Gene_B
<chr> <dbl> <dbl>
1 Sample_1 0 4
2 Sample_2 12 30
3 Sample_3 65 57
4 Sample_4 23 18
Next, we re-centered the data using:
# Determine the center of the data by:
# Finding the average expression of gene A
Gene_A_mean <- mean(expression_tibble$Gene_A)
# Finding the average expression of gene B
Gene_B_mean <- mean(expression_tibble$Gene_B)
# Shift the data points so that they data is centered on the origin
recentered_expression_tibble <- expression_tibble %>%
mutate(
Gene_A = Gene_A - Gene_A_mean,
Gene_B = Gene_B - Gene_B_mean
)
# Print out the re-centered data
recentered_expression_tibble# A tibble: 4 × 3
samples Gene_A Gene_B
<chr> <dbl> <dbl>
1 Sample_1 -25 -23.2
2 Sample_2 -13 2.75
3 Sample_3 40 29.8
4 Sample_4 -2 -9.25
We followed this up by creating a gene-wise covariance matrix:
# Move the samples to the rownames and convert the tibble to a matrix
recentered_expression_matrix <- recentered_expression_tibble %>%
column_to_rownames("samples") %>%
as.matrix()
# Create a covariance matrix
cov_matrix <- cov(recentered_expression_matrix)
# Print out the covariance matrix
cov_matrix Gene_A Gene_B
Gene_A 799.3333 584.6667
Gene_B 584.6667 506.2500
From this we can see that it is a 2x2 gene-wise covariance matrix. However, we could have alternatively created a 4x4 sample-wise covariance matrix using. Unfortunately, we will need to do this using some for loops. One small note before we start, since we re-centered the data based on gene, the mean of Gene_A and Gene_B are both 0, so those gene mean terms can drop out of the covariance equation. We can confirm that with:
# Print the mean for Gene_A once it has been re-centered
mean(recentered_expression_matrix[,"Gene_A"])[1] 0
# Print the mean for Gene_B once it has been re-centered
mean(recentered_expression_matrix[,"Gene_B"])[1] 0
So now our covariance formula will look like:
# Obtain the number of samples
n_samples <- nrow(recentered_expression_matrix)
# Obtain the number of genes
n_genes <- ncol(recentered_expression_matrix)
# Create an empty n_samples x n_samples matrix to hold sample-wise covariance values
cov_matrix_samples <- matrix(NA, nrow = n_samples, ncol = n_samples)
# Put the names of the samples as the row names of the sample-wise covariance matrix
rownames(cov_matrix_samples) <- rownames(recentered_expression_matrix)
# Put the names of the samples as the column names of the sample-wise covariance matrix
colnames(cov_matrix_samples) <- rownames(recentered_expression_matrix)
# Estimate the sample-wise covariance matrix
for (i in 1:n_samples) {
for (j in 1:n_samples) {
cov_matrix_samples[i, j] <- (1 / (n_samples - 1)) * sum(recentered_expression_matrix[i, ] * recentered_expression_matrix[j, ])
}
}
# Print out the sample-wise covariance matrix
cov_matrix_samples Sample_1 Sample_2 Sample_3 Sample_4
Sample_1 388.52083 87.02083 -563.8958 88.35417
Sample_2 87.02083 58.85417 -146.0625 0.18750
Sample_3 -563.89583 -146.06250 828.3542 -118.39583
Sample_4 88.35417 0.18750 -118.3958 29.85417
In order to avoid confusion, let’s go ahead and rename our gene-wise covariance matrix:
# Rename the gene-wise covariance matrix to avoid confusion
cov_matrix_genes <- cov_matrixComparing the gene-wise and sample-wise covariance matrices
Let’s remind ourselves of the eigenvalues returned by our gene-wise covariance matrix:
# Obtain the eigenvalues for the gene-wise covariance matrix
round(eigen(cov_matrix_genes)$values, digits = 3)[1] 1255.543 50.040
And we can compare those eigenvalues to the eigenvalues returned by our sample-wise covariance matrix:
# Obtain the eigenvalues for the sample-wise covariance matrix and compare them to the eigenvalues for the gene-wise covariance matrix
round(eigen(cov_matrix_samples)$values, digits = 3)[1] 1255.543 50.040 0.000 0.000
The eigenvalues are the same except the sample-wise covariance matrix returns two additional zero value eigenvalues. We can remind ourselves that eigenvalues can be though of as influence, so eigenvalues with a value of zero have no influence and are not considered Principal Components. We can solve the cov_matrix_samples’s \(\lambda\) values by hand and it will be a bit more clear why zero terms emerge.
Simplifying this would yield:
Therefore, the values, which satisfy \(\lambda\) would be 0, 1255.54 and 50.04.
If we go back to our number of genes, 2, and numbers of samples, 4, we can see that the number of PCs was equal to the whichever was fewer, n_samples or n_genes. However, we stated that the number of PCs ≤ minimum(number_of_genes, number_of_samples), not just equal to it. In the next section we will explore why this is the case.
In practice, we usually use gene-wise covariance in bulk RNA-seq PCA, but the sample-wise perspective helps us see why zeros eigenvalues emerge and how the PCA is fundamentally about the geometry of the data rather than its orientation.
Sources of Reducing the Number of PCs for a PCA
Now that we have demonstrated how the number of PCs = minimum(number_of_genes, number_of_samples), let’s understand sources of data that reduce the number of PCs so that the number of PCs ≤ minimum(number_of_genes, number_of_samples). Four major sources of this come from:
We will provide examples of each below, but first let’s make the expression matrix that we will be using:
# Create an expression matrix to use
expression_matrix <- expression_tibble %>%
as.data.frame() %>%
column_to_rownames("samples") %>%
as.matrix()Duplicated Data
In the first case we will duplicate Gene B and create a new column called Gene_C. We will use the prcomp() function to expedite our PCA:
# Modify the expression matrix to have duplicated data in it
expression_matrix_duplicate <- expression_matrix %>%
cbind(Gene_C = expression_matrix[,"Gene_B"])
# Run prcomp to get PCA results on the expression matrix with duplicated data
duplicate_PCA <- prcomp(expression_matrix_duplicate)
# Print the eigenvalues from the PCA derived from the expression matrix with duplicated data
round(duplicate_PCA$sdev ** 2, digits = 3)[1] 1739.601 72.232 0.000
We note that we are returned 2 non-zero eigenvalues and an eigenvalue with the value of 0. We started with 4 rows (samples) and 3 columns (genes), but we were only returned a PCA with rank 2, because the third column was a duplicate of the second column.
Scaled Data
Another method of reduction is if a column of our data is scaled from data in another column. Let’s go ahead and see an example of this by adding a column for Gene_C that is double the value of Gene_B.
# Modify the expression matrix to have scaled data in it
expression_matrix_scaled <- expression_matrix %>%
cbind(Gene_C = expression_matrix[,"Gene_B"] * 2)
# Run prcomp to get PCA results on the expression matrix with scaled data
scaled_PCA <- prcomp(expression_matrix_scaled)
# Print the eigenvalues from the PCA derived from the expression matrix with scaled data
round(scaled_PCA$sdev ** 2, digits = 3)[1] 3233.430 97.153 0.000
Once again, we can see that we are returned 2 non-zero eigenvalues and one eigenvalue that has a value of 0. Thus, the rank of the PCA is less than the minimum(number_of_genes, number_of_samples).
Linear Combinations of Data
Linear combinations of the data can also reduce the rank of a PCA. Now let’s consider a case where Gene_C is the sum of Gene_A and two times Gene_B.
# Modify the expression matrix to have data that is linear combinations of Gene_A and Gene_B in it
expression_matrix_linear_combinations <- expression_matrix %>%
cbind(Gene_C = expression_matrix[,"Gene_A"] + expression_matrix[,"Gene_B"] * 2)
# Run prcomp to get PCA results on the expression matrix with data containing linear combinations of Gene_A and Gene_B
linear_combinations_PCA <- prcomp(expression_matrix_linear_combinations)
# Print the eigenvalues from the PCA derived from the expression matrix with data containing linear combinations of Gene_A and Gene_B
round(linear_combinations_PCA$sdev ** 2, digits = 3)[1] 6409.772 58.811 0.000
Again, we see a case where we have two non-zero eigenvalues and an eigenvalue with the value of 0.
Constants
The last case we will consider is a column of constants. Let’s consider the case where Gene_C has the value of 16 in every sample.
# Modify the expression matrix to have constant data in it
expression_matrix_constant <- expression_matrix %>%
cbind(Gene_C = c(16, 16, 16, 16))
# Run prcomp to get PCA results on the expression matrix with constant data
constant_PCA <- prcomp(expression_matrix_constant)
# Print the eigenvalues from the PCA derived from the expression matrix with constant data
round(constant_PCA$sdev ** 2, digits = 3)[1] 1255.543 50.040 0.000
This result can be a quite intuitive because let’s reflect on the process that got us here. First, we recentered our expression matrix:
# Determine the center of the data by:
# Finding the average expression of gene A
Gene_A_mean_constant <- mean(expression_matrix_constant[, "Gene_A"])
# Finding the average expression of gene B
Gene_B_mean_constant <- mean(expression_matrix_constant[, "Gene_B"])
# Finding the average expression of gene C
Gene_C_mean_constant <- mean(expression_matrix_constant[, "Gene_C"])
# Shift the data points so that they data is centered on the origin
recentered_expression_matrix_constant <- expression_matrix_constant %>%
as.data.frame() %>%
mutate(
Gene_A = Gene_A - Gene_A_mean_constant,
Gene_B = Gene_B - Gene_B_mean_constant,
Gene_C = Gene_C - Gene_C_mean_constant,
) %>%
as.matrix()
# Print out the re-centered data
recentered_expression_matrix_constant Gene_A Gene_B Gene_C
Sample_1 -25 -23.25 0
Sample_2 -13 2.75 0
Sample_3 40 29.75 0
Sample_4 -2 -9.25 0
From here, we can see that the re-centered value of Gene_C is just 0 for all samples and thus it offers no variance.
Conclusion
The number of PCs, or rank, of a PCA is at most the fewer of the number of samples or the number of genes.