Number of PCs

RNA-seq

Statistical methods

Data analysis

This lesson explores how to determine the number of Principal Components (PCs) in PCA with a focus on bulk RNA-seq data. Participants will learn why the number of PCs is limited to the smaller of the number of samples or genes, how gene-wise and sample-wise covariance matrices relate to this and how data characteristics, such as duplicated genes, scaling, linear dependence, and constants, reduce the effective dimensionality of the data.

Authors

Will Gammerdinger

Noor Sohail

Published

October 22, 2025

Keywords

Principal Components Analysis, Covariance matrix, Eigenvectors, Eigenvalues, Dimensionality reduction, Rank deficiency

Approximate time: 45 minutes

Learning Objectives

Derive the number of PCs for a PCA

Number of PCs

When doing a Principal Components Analysis (PCA), you will be reducing the dimensionality of your data into Principal Components (PCs). The maximum number of PCs, sometimes called rank of the PCA, for a bulk RNA-seq analysis will at most either be the the number of genes or the number of sample, whichever is fewer. In other words, the number of PCs ≤ minimum(number_of_genes, number_of_samples). Since the number for samples is almost always fewer than the number of genes for bulk RNA-seq experiments, then the number of PCs is usually at most the number of samples. In the sections below we will discuss the reasons for this. For this lesson we will need to load tidyverse.

# Load tidyverse
library(tidyverse)

Creation of the Covariance Matrix

Let’s consider the expression_tibble that we created in the Theory of PCA lesson.

# Print the expression_tibble
expression_tibble

# A tibble: 4 × 3
  samples  Gene_A Gene_B
  <chr>     <dbl>  <dbl>
1 Sample_1      0      4
2 Sample_2     12     30
3 Sample_3     65     57
4 Sample_4     23     18

Next, we re-centered the data using:

# Determine the center of the data by:
# Finding the average expression of gene A
Gene_A_mean <- mean(expression_tibble$Gene_A)
# Finding the average expression of gene B
Gene_B_mean <- mean(expression_tibble$Gene_B)

# Shift the data points so that they data is centered on the origin
recentered_expression_tibble <- expression_tibble %>%
  mutate(
    Gene_A = Gene_A - Gene_A_mean,
    Gene_B = Gene_B - Gene_B_mean
  )

# Print out the re-centered data
recentered_expression_tibble

# A tibble: 4 × 3
  samples  Gene_A Gene_B
  <chr>     <dbl>  <dbl>
1 Sample_1    -25 -23.2 
2 Sample_2    -13   2.75
3 Sample_3     40  29.8 
4 Sample_4     -2  -9.25

We followed this up by creating a gene-wise covariance matrix:

# Move the samples to the rownames and convert the tibble to a matrix
recentered_expression_matrix <- recentered_expression_tibble %>% 
  column_to_rownames("samples") %>% 
  as.matrix()

# Create a covariance matrix
cov_matrix <- cov(recentered_expression_matrix)

# Print out the covariance matrix
cov_matrix

         Gene_A   Gene_B
Gene_A 799.3333 584.6667
Gene_B 584.6667 506.2500

From this we can see that it is a 2x2 gene-wise covariance matrix. However, we could have alternatively created a 4x4 sample-wise covariance matrix using. Unfortunately, we will need to do this using some for loops. One small note before we start, since we re-centered the data based on gene, the mean of Gene_A and Gene_B are both 0, so those gene mean terms can drop out of the covariance equation. We can confirm that with:

# Print the mean for Gene_A once it has been re-centered
mean(recentered_expression_matrix[,"Gene_A"])

[1] 0

# Print the mean for Gene_B once it has been re-centered
mean(recentered_expression_matrix[,"Gene_B"])

[1] 0

So now our covariance formula will look like:

$\mathrm{cov}(X,Y)=\frac{1}{n_\text{samples}-1}\sum_{i=1}^{n_\text{samples}} x_i y_i$

# Obtain the number of samples
n_samples <- nrow(recentered_expression_matrix)
# Obtain the number of genes
n_genes <- ncol(recentered_expression_matrix)

# Create an empty n_samples x n_samples matrix to hold sample-wise covariance values
cov_matrix_samples <- matrix(NA, nrow = n_samples, ncol = n_samples)
# Put the names of the samples as the row names of the sample-wise covariance matrix
rownames(cov_matrix_samples) <- rownames(recentered_expression_matrix)
# Put the names of the samples as the column names of the sample-wise covariance matrix
colnames(cov_matrix_samples) <- rownames(recentered_expression_matrix)

# Estimate the sample-wise covariance matrix
for (i in 1:n_samples) {
  for (j in 1:n_samples) {
    cov_matrix_samples[i, j] <- (1 / (n_samples - 1)) * sum(recentered_expression_matrix[i, ] * recentered_expression_matrix[j, ])
  }
}

# Print out the sample-wise covariance matrix
cov_matrix_samples

           Sample_1   Sample_2  Sample_3   Sample_4
Sample_1  388.52083   87.02083 -563.8958   88.35417
Sample_2   87.02083   58.85417 -146.0625    0.18750
Sample_3 -563.89583 -146.06250  828.3542 -118.39583
Sample_4   88.35417    0.18750 -118.3958   29.85417

In order to avoid confusion, let’s go ahead and rename our gene-wise covariance matrix:

# Rename the gene-wise covariance matrix to avoid confusion
cov_matrix_genes <- cov_matrix

Comparing the gene-wise and sample-wise covariance matrices

Let’s remind ourselves of the eigenvalues returned by our gene-wise covariance matrix:

# Obtain the eigenvalues for the gene-wise covariance matrix
round(eigen(cov_matrix_genes)$values, digits = 3)

[1] 1255.543   50.040

And we can compare those eigenvalues to the eigenvalues returned by our sample-wise covariance matrix:

# Obtain the eigenvalues for the sample-wise covariance matrix and compare them to the eigenvalues for the gene-wise covariance matrix
round(eigen(cov_matrix_samples)$values, digits = 3)

[1] 1255.543   50.040    0.000    0.000

The eigenvalues are the same except the sample-wise covariance matrix returns two additional zero value eigenvalues. We can remind ourselves that eigenvalues can be though of as influence, so eigenvalues with a value of zero have no influence and are not considered Principal Components. We can solve the cov_matrix_samples’s $\lambda$ values by hand and it will be a bit more clear why zero terms emerge.

$\det\left( \begin{bmatrix}388.52083&87.02083&-563.8958&88.35417\\87.02083&58.85417&-146.0625&0.18750\\-563.89583&-146.06250&828.3542&-118.39583\\88.35417&0.18750&-118.3958&29.85417\end{bmatrix}- \lambda\begin{bmatrix}1&0&0&0\\0&1&0&0\\0&0&1&0\\0&0&0&1\end{bmatrix}\right)=0$

Simplifying this would yield:

$\lambda^2(\lambda-1255.54)(\lambda-50.04)=0$

Therefore, the values, which satisfy $\lambda$ would be 0, 1255.54 and 50.04.

If we go back to our number of genes, 2, and numbers of samples, 4, we can see that the number of PCs was equal to the whichever was fewer, n_samples or n_genes. However, we stated that the number of PCs ≤ minimum(number_of_genes, number_of_samples), not just equal to it. In the next section we will explore why this is the case.

Note

In practice, we usually use gene-wise covariance in bulk RNA-seq PCA, but the sample-wise perspective helps us see why zeros eigenvalues emerge and how the PCA is fundamentally about the geometry of the data rather than its orientation.

Sources of Reducing the Number of PCs for a PCA

Now that we have demonstrated how the number of PCs = minimum(number_of_genes, number_of_samples), let’s understand sources of data that reduce the number of PCs so that the number of PCs ≤ minimum(number_of_genes, number_of_samples). Four major sources of this come from:

Duplicated Data
Scaled Data
Linear Combinations of Data
Constants

We will provide examples of each below, but first let’s make the expression matrix that we will be using:

# Create an expression matrix to use
expression_matrix <- expression_tibble %>% 
  as.data.frame() %>% 
  column_to_rownames("samples") %>% 
  as.matrix()

Duplicated Data

In the first case we will duplicate Gene B and create a new column called Gene_C. We will use the prcomp() function to expedite our PCA:

# Modify the expression matrix to have duplicated data in it
expression_matrix_duplicate <- expression_matrix %>% 
  cbind(Gene_C = expression_matrix[,"Gene_B"])

# Run prcomp to get PCA results on the expression matrix with duplicated data
duplicate_PCA <- prcomp(expression_matrix_duplicate)

# Print the eigenvalues from the PCA derived from the expression matrix with duplicated data
round(duplicate_PCA$sdev ** 2, digits = 3)

[1] 1739.601   72.232    0.000

We note that we are returned 2 non-zero eigenvalues and an eigenvalue with the value of 0. We started with 4 rows (samples) and 3 columns (genes), but we were only returned a PCA with rank 2, because the third column was a duplicate of the second column.

Scaled Data

Another method of reduction is if a column of our data is scaled from data in another column. Let’s go ahead and see an example of this by adding a column for Gene_C that is double the value of Gene_B.

# Modify the expression matrix to have scaled data in it
expression_matrix_scaled <- expression_matrix %>% 
  cbind(Gene_C = expression_matrix[,"Gene_B"] * 2)

# Run prcomp to get PCA results on the expression matrix with scaled data
scaled_PCA <- prcomp(expression_matrix_scaled)

# Print the eigenvalues from the PCA derived from the expression matrix with scaled data
round(scaled_PCA$sdev ** 2, digits = 3)

[1] 3233.430   97.153    0.000

Once again, we can see that we are returned 2 non-zero eigenvalues and one eigenvalue that has a value of 0. Thus, the rank of the PCA is less than the minimum(number_of_genes, number_of_samples).

Linear Combinations of Data

Linear combinations of the data can also reduce the rank of a PCA. Now let’s consider a case where Gene_C is the sum of Gene_A and two times Gene_B.

# Modify the expression matrix to have data that is linear combinations of Gene_A and Gene_B in it
expression_matrix_linear_combinations <- expression_matrix %>% 
  cbind(Gene_C = expression_matrix[,"Gene_A"] + expression_matrix[,"Gene_B"] * 2)

# Run prcomp to get PCA results on the expression matrix with data containing linear combinations of Gene_A and Gene_B
linear_combinations_PCA <- prcomp(expression_matrix_linear_combinations)

# Print the eigenvalues from the PCA derived from the expression matrix with data containing linear combinations of Gene_A and Gene_B
round(linear_combinations_PCA$sdev ** 2, digits = 3)

[1] 6409.772   58.811    0.000

Again, we see a case where we have two non-zero eigenvalues and an eigenvalue with the value of 0.

Constants

The last case we will consider is a column of constants. Let’s consider the case where Gene_C has the value of 16 in every sample.

# Modify the expression matrix to have constant data in it
expression_matrix_constant <- expression_matrix %>% 
  cbind(Gene_C = c(16, 16, 16, 16))

# Run prcomp to get PCA results on the expression matrix with constant data
constant_PCA <- prcomp(expression_matrix_constant)

# Print the eigenvalues from the PCA derived from the expression matrix with constant data
round(constant_PCA$sdev ** 2, digits = 3)

[1] 1255.543   50.040    0.000

This result can be a quite intuitive because let’s reflect on the process that got us here. First, we recentered our expression matrix:

# Determine the center of the data by:
# Finding the average expression of gene A
Gene_A_mean_constant <- mean(expression_matrix_constant[, "Gene_A"])
# Finding the average expression of gene B
Gene_B_mean_constant <- mean(expression_matrix_constant[, "Gene_B"])
# Finding the average expression of gene C
Gene_C_mean_constant <- mean(expression_matrix_constant[, "Gene_C"])

# Shift the data points so that they data is centered on the origin
recentered_expression_matrix_constant <- expression_matrix_constant %>%
  as.data.frame() %>% 
  mutate(
    Gene_A = Gene_A - Gene_A_mean_constant,
    Gene_B = Gene_B - Gene_B_mean_constant,
    Gene_C = Gene_C - Gene_C_mean_constant,
  ) %>% 
  as.matrix()

# Print out the re-centered data
recentered_expression_matrix_constant

         Gene_A Gene_B Gene_C
Sample_1    -25 -23.25      0
Sample_2    -13   2.75      0
Sample_3     40  29.75      0
Sample_4     -2  -9.25      0

From here, we can see that the re-centered value of Gene_C is just 0 for all samples and thus it offers no variance.

Conclusion

The number of PCs, or rank, of a PCA is at most the fewer of the number of samples or the number of genes.

Reuse

CC-BY-4.0

--- title: "Number of PCs" description: | This lesson explores how to determine the number of Principal Components (PCs) in PCA with a focus on bulk RNA-seq data. Participants will learn why the number of PCs is limited to the smaller of the number of samples or genes, how gene-wise and sample-wise covariance matrices relate to this and how data characteristics, such as duplicated genes, scaling, linear dependence, and constants, reduce the effective dimensionality of the data. author: - Will Gammerdinger - Noor Sohail date: "2025-10-22" categories: - RNA-seq - Statistical methods - Data analysis keywords: - Principal Components Analysis - Covariance matrix - Eigenvectors - Eigenvalues - Dimensionality reduction - Rank deficiency license: "CC-BY-4.0" editor_options: markdown: wrap: 72 --- ```{r} #| label: load_libraries #| echo: false library(tidyverse) # Create a vector for samples samples <- c("Sample_1", "Sample_2", "Sample_3", "Sample_4") # Create a vector to hold expression values for Gene A across all of the samples Gene_A <- c(0, 12, 65, 23) # Create a vector to hold expression values for Gene B across all of the samples Gene_B <- c(4, 30, 57, 18) # Create a tibble to hold the sample names and expression values expression_tibble <- tibble(samples, Gene_A, Gene_B) ``` Approximate time: 45 minutes # Learning Objectives - Derive the number of PCs for a PCA # Number of PCs When doing a Principal Components Analysis (PCA), you will be reducing the dimensionality of your data into Principal Components (PCs). The maximum number of PCs, sometimes called rank of the PCA, for a bulk RNA-seq analysis will at most either be the the number of genes or the number of sample, whichever is fewer. In other words, **the number of PCs ≤ minimum(number_of_genes, number_of_samples)**. Since the number for samples is almost always fewer than the number of genes for bulk RNA-seq experiments, then the number of PCs is usually at most the number of samples. In the sections below we will discuss the reasons for this. For this lesson we will need to load `tidyverse`. ```{r} #| label: load_tidyverse # Load tidyverse library(tidyverse) ``` # Creation of the Covariance Matrix Let's consider the `expression_tibble` that we created in the [Theory of PCA](Aside_theory_of_PCA.qmd#Creating-an-example-data-set) lesson. ```{r} #| label: print_expression_tibble # Print the expression_tibble expression_tibble ``` Next, we re-centered the data using: ```{r} #| label: center_of_data # Determine the center of the data by: # Finding the average expression of gene A Gene_A_mean <- mean(expression_tibble$Gene_A) # Finding the average expression of gene B Gene_B_mean <- mean(expression_tibble$Gene_B) # Shift the data points so that they data is centered on the origin recentered_expression_tibble <- expression_tibble %>% mutate( Gene_A = Gene_A - Gene_A_mean, Gene_B = Gene_B - Gene_B_mean ) # Print out the re-centered data recentered_expression_tibble ``` We followed this up by creating a gene-wise covariance matrix: ```{r} #| label: create_gene_covariance_matrix # Move the samples to the rownames and convert the tibble to a matrix recentered_expression_matrix <- recentered_expression_tibble %>% column_to_rownames("samples") %>% as.matrix() # Create a covariance matrix cov_matrix <- cov(recentered_expression_matrix) # Print out the covariance matrix cov_matrix ``` From this we can see that it is a **2x2 gene-wise** covariance matrix. However, we could have alternatively created a **4x4 sample-wise** covariance matrix using. Unfortunately, we will need to do this using some for loops. One small note before we start, since we re-centered the data based on gene, the mean of Gene_A and Gene_B are both 0, so those gene mean terms can drop out of the covariance equation. We can confirm that with: ```{r} #| label: confirm_recented_means # Print the mean for Gene_A once it has been re-centered mean(recentered_expression_matrix[,"Gene_A"]) # Print the mean for Gene_B once it has been re-centered mean(recentered_expression_matrix[,"Gene_B"]) ``` So now our covariance formula will look like: <img src="https://latex.codecogs.com/svg.image?\mathrm{cov}(X,Y)=\frac{1}{n_\text{samples}-1}\sum_{i=1}^{n_\text{samples}} x_i y_i" width="300"/> ```{r} #| label: create_cell_covariance_matrix # Obtain the number of samples n_samples <- nrow(recentered_expression_matrix) # Obtain the number of genes n_genes <- ncol(recentered_expression_matrix) # Create an empty n_samples x n_samples matrix to hold sample-wise covariance values cov_matrix_samples <- matrix(NA, nrow = n_samples, ncol = n_samples) # Put the names of the samples as the row names of the sample-wise covariance matrix rownames(cov_matrix_samples) <- rownames(recentered_expression_matrix) # Put the names of the samples as the column names of the sample-wise covariance matrix colnames(cov_matrix_samples) <- rownames(recentered_expression_matrix) # Estimate the sample-wise covariance matrix for (i in 1:n_samples) { for (j in 1:n_samples) { cov_matrix_samples[i, j] <- (1 / (n_samples - 1)) * sum(recentered_expression_matrix[i, ] * recentered_expression_matrix[j, ]) } } # Print out the sample-wise covariance matrix cov_matrix_samples ``` In order to avoid confusion, let's go ahead and rename our gene-wise covariance matrix: ```{r} #| label: relabel_gene_covariance_matrix # Rename the gene-wise covariance matrix to avoid confusion cov_matrix_genes <- cov_matrix ``` # Comparing the gene-wise and sample-wise covariance matrices Let's remind ourselves of the eigenvalues returned by our gene-wise covariance matrix: ```{r} #| label: gene-wise_eigenvalues # Obtain the eigenvalues for the gene-wise covariance matrix round(eigen(cov_matrix_genes)$values, digits = 3) ``` And we can compare those eigenvalues to the eigenvalues returned by our sample-wise covariance matrix: ```{r} #| label: cell-wise_eigenvalues # Obtain the eigenvalues for the sample-wise covariance matrix and compare them to the eigenvalues for the gene-wise covariance matrix round(eigen(cov_matrix_samples)$values, digits = 3) ``` The eigenvalues are the same except the sample-wise covariance matrix returns two additional zero value eigenvalues. We can remind ourselves that eigenvalues can be though of as influence, so eigenvalues with a value of zero have no influence and are not considered Principal Components. We can solve the `cov_matrix_samples`'s $\lambda$ values by hand and it will be a bit more clear why zero terms emerge. <img src="https://latex.codecogs.com/svg.image?\det\left(%0A\begin{bmatrix}388.52083&87.02083&-563.8958&88.35417\\87.02083&58.85417&-146.0625&0.18750\\-563.89583&-146.06250&828.3542&-118.39583\\88.35417&0.18750&-118.3958&29.85417\end{bmatrix}-%20\lambda\begin{bmatrix}1&0&0&0\\0&1&0&0\\0&0&1&0\\0&0&0&1\end{bmatrix}\right)=0" width="600"/> Simplifying this would yield: <img src="https://latex.codecogs.com/svg.image?\lambda^2(\lambda-1255.54)(\lambda-50.04)=0"/> Therefore, the values, which satisfy $\lambda$ would be `0`, `1255.54` and `50.04`. If we go back to our number of genes, `r n_genes`, and numbers of samples, `r n_samples`, we can see that the number of PCs was equal to the whichever was fewer, `n_samples` or `n_genes`. However, we stated that the number of PCs ≤ minimum(number_of_genes, number_of_samples), not just equal to it. In the next section we will explore why this is the case. :::callout-note In practice, we usually use gene-wise covariance in bulk RNA-seq PCA, but the sample-wise perspective helps us see why zeros eigenvalues emerge and how the PCA is fundamentally about the geometry of the data rather than its orientation. ::: # Sources of Reducing the Number of PCs for a PCA Now that we have demonstrated how the number of PCs = minimum(number_of_genes, number_of_samples), let's understand sources of data that reduce the number of PCs so that the number of PCs ≤ minimum(number_of_genes, number_of_samples). Four major sources of this come from: - [Duplicated Data](#duplicated-data) - [Scaled Data](#scaled-data) - [Linear Combinations of Data](#linear-combinations-of-data) - [Constants](#constants) We will provide examples of each below, but first let's make the expression matrix that we will be using: ```{r} #| label: create_expression_matrix_for_reduced_rank # Create an expression matrix to use expression_matrix <- expression_tibble %>% as.data.frame() %>% column_to_rownames("samples") %>% as.matrix() ``` ## Duplicated Data In the first case we will duplicate `Gene B` and create a new column called `Gene_C`. We will use the `prcomp()` function to expedite our PCA: ```{r} #| label: duplicated_data # Modify the expression matrix to have duplicated data in it expression_matrix_duplicate <- expression_matrix %>% cbind(Gene_C = expression_matrix[,"Gene_B"]) # Run prcomp to get PCA results on the expression matrix with duplicated data duplicate_PCA <- prcomp(expression_matrix_duplicate) # Print the eigenvalues from the PCA derived from the expression matrix with duplicated data round(duplicate_PCA$sdev ** 2, digits = 3) ``` We note that we are returned 2 non-zero eigenvalues and an eigenvalue with the value of 0. We started with 4 rows (samples) and 3 columns (genes), but we were only returned a PCA with rank 2, because the third column was a duplicate of the second column. ## Scaled Data Another method of reduction is if a column of our data is scaled from data in another column. Let's go ahead and see an example of this by adding a column for Gene_C that is double the value of Gene_B. ```{r} #| label: scaled_data # Modify the expression matrix to have scaled data in it expression_matrix_scaled <- expression_matrix %>% cbind(Gene_C = expression_matrix[,"Gene_B"] * 2) # Run prcomp to get PCA results on the expression matrix with scaled data scaled_PCA <- prcomp(expression_matrix_scaled) # Print the eigenvalues from the PCA derived from the expression matrix with scaled data round(scaled_PCA$sdev ** 2, digits = 3) ``` Once again, we can see that we are returned 2 non-zero eigenvalues and one eigenvalue that has a value of 0. Thus, the rank of the PCA is less than the minimum(number_of_genes, number_of_samples). ## Linear Combinations of Data Linear combinations of the data can also reduce the rank of a PCA. Now let's consider a case where Gene_C is the sum of Gene_A and two times Gene_B. ```{r} #| label: linear_combinations_data # Modify the expression matrix to have data that is linear combinations of Gene_A and Gene_B in it expression_matrix_linear_combinations <- expression_matrix %>% cbind(Gene_C = expression_matrix[,"Gene_A"] + expression_matrix[,"Gene_B"] * 2) # Run prcomp to get PCA results on the expression matrix with data containing linear combinations of Gene_A and Gene_B linear_combinations_PCA <- prcomp(expression_matrix_linear_combinations) # Print the eigenvalues from the PCA derived from the expression matrix with data containing linear combinations of Gene_A and Gene_B round(linear_combinations_PCA$sdev ** 2, digits = 3) ``` Again, we see a case where we have two non-zero eigenvalues and an eigenvalue with the value of 0. ## Constants The last case we will consider is a column of constants. Let's consider the case where `Gene_C` has the value of 16 in every sample. ```{r} #| label: constants_data # Modify the expression matrix to have constant data in it expression_matrix_constant <- expression_matrix %>% cbind(Gene_C = c(16, 16, 16, 16)) # Run prcomp to get PCA results on the expression matrix with constant data constant_PCA <- prcomp(expression_matrix_constant) # Print the eigenvalues from the PCA derived from the expression matrix with constant data round(constant_PCA$sdev ** 2, digits = 3) ``` This result can be a quite intuitive because let's reflect on the process that got us here. First, we recentered our expression matrix: ```{r} #| label: constant_center_of_data # Determine the center of the data by: # Finding the average expression of gene A Gene_A_mean_constant <- mean(expression_matrix_constant[, "Gene_A"]) # Finding the average expression of gene B Gene_B_mean_constant <- mean(expression_matrix_constant[, "Gene_B"]) # Finding the average expression of gene C Gene_C_mean_constant <- mean(expression_matrix_constant[, "Gene_C"]) # Shift the data points so that they data is centered on the origin recentered_expression_matrix_constant <- expression_matrix_constant %>% as.data.frame() %>% mutate( Gene_A = Gene_A - Gene_A_mean_constant, Gene_B = Gene_B - Gene_B_mean_constant, Gene_C = Gene_C - Gene_C_mean_constant, ) %>% as.matrix() # Print out the re-centered data recentered_expression_matrix_constant ``` From here, we can see that the re-centered value of `Gene_C` is just 0 for all samples and thus it offers no variance. # Conclusion The number of PCs, or rank, of a PCA is at most the fewer of the number of samples or the number of genes.