# Load tidyverse
library(tidyverse)
Number of PCs
Approximate time: 45 minutes
Learning Objectives
- Derive the number of PCs for a PCA
Number of PCs
When doing a Principal Components Analysis (PCA), you will be reducing the dimensionality of your data into Principal Components (PCs). The maximum number of PCs, sometimes called rank of the PCA, for scRNA-seq analysis will at most either be the the number of genes or the number of cells, whichever is fewer. In other words, the number of PCs ≤ minimum(number_of_genes, number_of_cells). In the sections below we will discuss the reasons for this. For this lesson we will need to load tidyverse
.
Creation of the Covariance Matrix
Let’s consider the expression_tibble
that we created in the Theory of PCA lesson.
# Print the expression_tibble
expression_tibble
# A tibble: 4 × 3
cells Gene_A Gene_B
<chr> <dbl> <dbl>
1 Cell_1 0 4
2 Cell_2 12 30
3 Cell_3 65 57
4 Cell_4 23 18
Next, we re-centered the data using:
# Determine the center of the data by:
# Finding the average expression of gene A
<- mean(expression_tibble$Gene_A)
Gene_A_mean # Finding the average expression of gene B
<- mean(expression_tibble$Gene_B)
Gene_B_mean
# Shift the data points so that they data is centered on the origin
<- expression_tibble %>%
recentered_expression_tibble mutate(
Gene_A = Gene_A - Gene_A_mean,
Gene_B = Gene_B - Gene_B_mean
)
# Print out the re-centered data
recentered_expression_tibble
# A tibble: 4 × 3
cells Gene_A Gene_B
<chr> <dbl> <dbl>
1 Cell_1 -25 -23.2
2 Cell_2 -13 2.75
3 Cell_3 40 29.8
4 Cell_4 -2 -9.25
We followed this up by creating a gene-wise covariance matrix:
# Move the cell IDs to the rownames and convert the tibble to a matrix
<- recentered_expression_tibble %>%
recentered_expression_matrix column_to_rownames("cells") %>%
as.matrix()
# Create a covariance matrix
<- cov(recentered_expression_matrix)
cov_matrix
# Print out the covariance matrix
cov_matrix
Gene_A Gene_B
Gene_A 799.3333 584.6667
Gene_B 584.6667 506.2500
From this we can see that it is a 2x2 gene-wise covariance matrix. However, we could have alternatively created a 4x4 cell-wise covariance matrix using. Unfortunately, we will need to do this using some for loops. One small note before we start, since we re-centered the data on based on gene, the mean of Gene_A and Gene_B are both 0, so those gene mean terms can drop out of the covariance equation. We can confirm that with:
# Print the mean for Gene_A once it has been re-centered
mean(recentered_expression_matrix[,"Gene_A"])
[1] 0
# Print the mean for Gene_B once it has been re-centered
mean(recentered_expression_matrix[,"Gene_B"])
[1] 0
So now our covariance formula will look like:
# Obtain the number of cells
<- nrow(recentered_expression_matrix)
n_cells # Obtain the number of genes
<- ncol(recentered_expression_matrix)
n_genes
# Create an empty n_cells x n_cells matrix to hold cell-wise covariance values
<- matrix(NA, nrow = n_cells, ncol = n_cells)
cov_matrix_cells # Put the names of the cells as the row names of the cell-wise covariance matrix
rownames(cov_matrix_cells) <- rownames(recentered_expression_matrix)
# Put the names of the cells as the column names of the cell-wise covariance matrix
colnames(cov_matrix_cells) <- rownames(recentered_expression_matrix)
# Estimate the cell-wise covariance matrix
for (i in 1:n_cells) {
for (j in 1:n_cells) {
<- (1 / (n_cells - 1)) * sum(recentered_expression_matrix[i, ] * recentered_expression_matrix[j, ])
cov_matrix_cells[i, j]
}
}
# Print out the cell-wise covariance matrix
cov_matrix_cells
Cell_1 Cell_2 Cell_3 Cell_4
Cell_1 388.52083 87.02083 -563.8958 88.35417
Cell_2 87.02083 58.85417 -146.0625 0.18750
Cell_3 -563.89583 -146.06250 828.3542 -118.39583
Cell_4 88.35417 0.18750 -118.3958 29.85417
In order to avoid confusion, let’s go ahead and rename our gene-wise covariance matrix:
# Rename the gene-wise covariance matrix to avoid confusion
<- cov_matrix cov_matrix_genes
Comparing the gene-wise and cell-wise covariance matrices
Let’s remind ourselves of the eigenvalues returned by our gene-wise covariance matrix:
# Obtain the eigenvalues for the gene-wise covariance matrix
round(eigen(cov_matrix_genes)$values, digits = 3)
[1] 1255.543 50.040
And we can compare those eigenvalues to the eigenvalues returned by our cell-wise covariance matrix:
# Obtain the eigenvalues for the cell-wise covariance matrix and compare them to the eigenvalues for the gene-wise covariance matrix
round(eigen(cov_matrix_cells)$values, digits = 3)
[1] 1255.543 50.040 0.000 0.000
The eigenvalues are the same except the cell-wise covariance matrix returns two additonal zero value eigenvalues. We can remind ourselves that eigenvalues can be though of as influence, so eigenvalues with a value of zero have no influence and are not considered Principal Components. We can solve the cov_matrix_cells
’s \(\lambda\) values by hand and it will be a bit more clear why zero terms emerge.
Simplifying this would yield:
Therefore, the values, which satisfy \(\lambda\) would be 0
, 1255.54
and 50.04
.
If we go back to our number of genes, 2, and numbers of cells, 4, we can see that the number of PCs was equal to the whichever was fewer, n_cells
or n_genes
. However, we stated that the number of PCs ≤ minimum(number_of_genes, number_of_cells), not just equal to it. In the next section we will explore why this is the case.
In practice, we usually use gene-wise covariance in scRNA-seq PCA, but the cell-wise perspective helps us see why zeros eigenvalues emerge and how the PCA is fundamentally about the geometry of the data rather than its orientation.
Sources of Reducing the Number of PCs for a PCA
Now that we have demonstrated how number of PCs = minimum(number_of_genes, number_of_cells), let’s understand sources of data that reduce the number of PCs so that the number of PCs ≤ minimum(number_of_genes, number_of_cells). Four major sources of this come from:
We will provide examples of each below, but first let’s make the expression matrix that we will be using:
# Create an expression matrix to use
<- expression_tibble %>%
expression_matrix as.data.frame() %>%
column_to_rownames("cells") %>%
as.matrix()
Duplicated Data
In the first case we will duplicate Gene B
and create a new column called Gene_C
. We will use the prcomp()
function to expedite our PCA:
# Modify the expression matrix to have duplicated data in it
<- expression_matrix %>%
expression_matrix_duplicate cbind(Gene_C = expression_matrix[,"Gene_B"])
# Run prcomp to get PCA results on the expression matrix with duplicated data
<- prcomp(expression_matrix_duplicate)
duplicate_PCA
# Print the eigenvalues from the PCA derived from the expression matrix with duplicated data
round(duplicate_PCA$sdev ** 2, digits = 3)
[1] 1739.601 72.232 0.000
We note that we are returned 2 non-zero eigenvalues and an eigenvalue with the value of 0. We started with 4 rows (cells) and 3 columns (genes), but we were only returned a PCA with rank 2, because the third column was a duplicate of the second column.
Scaled Data
Another method of reduction is if a column of our data is scaled from data in another column. Let’s go ahead and see an example of this by adding a column for Gene_C that is double the value of Gene_B.
# Modify the expression matrix to have scaled data in it
<- expression_matrix %>%
expression_matrix_scaled cbind(Gene_C = expression_matrix[,"Gene_B"] * 2)
# Run prcomp to get PCA results on the expression matrix with scaled data
<- prcomp(expression_matrix_scaled)
scaled_PCA
# Print the eigenvalues from the PCA derived from the expression matrix with scaled data
round(scaled_PCA$sdev ** 2, digits = 3)
[1] 3233.430 97.153 0.000
Once again, we can see that we are returned 2 non-zero eigenvalues and one eigenvalue that has a value of 0. Thus, the rank of the PCA is less than the minimum(number_of_genes, number_of_cells).
Linear Combinations of Data
Linear combinations of the data can also reduce the rank of a PCA. Now let’s consider a case where Gene_C is the sum of Gene_A and two times Gene_B.
# Modify the expression matrix to have data that is linear combinations of Gene_A and Gene_B in it
<- expression_matrix %>%
expression_matrix_linear_combinations cbind(Gene_C = expression_matrix[,"Gene_A"] + expression_matrix[,"Gene_B"] * 2)
# Run prcomp to get PCA results on the expression matrix with data containing linear combinations of Gene_A and Gene_B
<- prcomp(expression_matrix_linear_combinations)
linear_combinations_PCA
# Print the eigenvalues from the PCA derived from the expression matrix with data containing linear combinations of Gene_A and Gene_B
round(linear_combinations_PCA$sdev ** 2, digits = 3)
[1] 6409.772 58.811 0.000
Again, we see a case where we have two non-zero eigenvalues and an eigenvalue with the value of 0.
Constants
The last case we will consider is a column of constants. Let’s consider the case where Gene_C
has the value of 16 in every cell.
# Modify the expression matrix to have constant data in it
<- expression_matrix %>%
expression_matrix_constant cbind(Gene_C = c(16, 16, 16, 16))
# Run prcomp to get PCA results on the expression matrix with constant data
<- prcomp(expression_matrix_constant)
constant_PCA
# Print the eigenvalues from the PCA derived from the expression matrix with constant data
round(constant_PCA$sdev ** 2, digits = 3)
[1] 1255.543 50.040 0.000
This result can be a quite intuitive because let’s reflect on the process that got us here. First, we recentered our expression matrix:
# Determine the center of the data by:
# Finding the average expression of gene A
<- mean(expression_matrix_constant[, "Gene_A"])
Gene_A_mean_constant # Finding the average expression of gene B
<- mean(expression_matrix_constant[, "Gene_B"])
Gene_B_mean_constant # Finding the average expression of gene C
<- mean(expression_matrix_constant[, "Gene_C"])
Gene_C_mean_constant
# Shift the data points so that they data is centered on the origin
<- expression_matrix_constant %>%
recentered_expression_matrix_constant as.data.frame() %>%
mutate(
Gene_A = Gene_A - Gene_A_mean_constant,
Gene_B = Gene_B - Gene_B_mean_constant,
Gene_C = Gene_C - Gene_C_mean_constant,
%>%
) as.matrix()
# Print out the re-centered data
recentered_expression_matrix_constant
Gene_A Gene_B Gene_C
Cell_1 -25 -23.25 0
Cell_2 -13 2.75 0
Cell_3 40 29.75 0
Cell_4 -2 -9.25 0
From here, we can see that the re-centered value of Gene_C
is just 0 for all cells and thus it offers no variance.
Conclusion
The number of PCs, or rank, of a PCA is at most the fewer of the number of cells or the number of genes.