scRNA-seq Data Normalization Methods Comparison & Derivation on General scRNA-seq Model

Author

Jiachen Ai

# load the necessary libraries
library(readr)
library(dplyr)
library(tidyverse)
library(Seurat)

Warning: package 'Seurat' was built under R version 4.3.3

Warning: package 'SeuratObject' was built under R version 4.3.3

library(patchwork)
library(ggVennDiagram)
library(ggplot2)
library(sctransform)
library(matrixStats)
library(purrr)
library(BayesTools)
library(VGAM)

Warning: package 'VGAM' was built under R version 4.3.3

library(knitr)

Warning: package 'knitr' was built under R version 4.3.3

Q1.

On the given count data, filter out cells with three different criteria respectively:

1) cells with total UMI counts smaller than 300 and larger than 10,000;

2) cells with the number of features smaller than 200;

3) cells with a percent of mitochondrial reads higher than 15%. Report how many cells are filtered out using different metrics.

Answer:

Let’s first load the data and take a look at the first few rows.

# load the data
pbmc_org <- read_csv("hw_data/pbmc.csv")
pbmc_meta <- pbmc_org[, -1, drop = FALSE]
rownames(pbmc_meta) <- pbmc_org$...1

Warning: Setting row names on a tibble is deprecated.

# create a Seurat object
if (file.exists("pbmc.rds")) {
  pbmc <- readRDS("pbmc.rds")
} else {
  pbmc <- CreateSeuratObject(counts = pbmc_meta)
  saveRDS(pbmc, "pbmc.rds")
}

# find the mitochondrial genes
pbmc[["percent.mt"]] <- PercentageFeatureSet(pbmc, pattern = "^MT-")

VlnPlot(pbmc, features = c("nFeature_RNA", 
                           "nCount_RNA", 
                           "percent.mt"), 
        ncol = 3)

Warning: Default search for "data" layer in "RNA" assay yielded no results;
utilizing "counts" layer instead.

# each column represents a cell and each row represents a gene

collect the cells with according to the three criteria respectively.

# Criterion 1: Total UMI Counts
# Filter cells with total UMI counts smaller than 300
# or larger than 10000
pbmc_1 <- pbmc@meta.data |>
  subset(nCount_RNA >= 300 & nCount_RNA <= 10000) |> 
  as.data.frame()

Criterion_1_cell_num <- nrow(pbmc_1)
Criterion_1_cell_name <- rownames(pbmc_1)

  
# Criterion 2: Number of Features
# Filter cells with the number of features smaller than 200
pbmc_2 <- pbmc@meta.data |>
  subset(nFeature_RNA >= 200) |> 
  as.data.frame()

Criterion_2_cell_num <- nrow(pbmc_2)
Criterion_2_cell_name <-rownames(pbmc_2)

# Criterion 3: Percent of Mitochondrial Reads
# Filter cells with a percent of mitochondrial reads higher than 15%
pbmc_3 <- pbmc@meta.data |>
  subset(percent.mt <= 15) |> 
  as.data.frame()

Criterion_3_cell_num <- nrow(pbmc_3)
Criterion_3_cell_name <-rownames(pbmc_3)


# print the number of cells filtered out using different metrics
cbind(Criterion_1_cell_num, 
      Criterion_2_cell_num, 
      Criterion_3_cell_num) |> 
  as_tibble() |>
  rename("Criterion 1" = Criterion_1_cell_num, 
         "Criterion 2" = Criterion_2_cell_num, 
         "Criterion 3" = Criterion_3_cell_num) |>
  knitr::kable(format = "markdown")

Criterion 1	Criterion 2	Criterion 3
4844	4900	4944

Q2.

From question1, you get three different sets of filtered cells. Plot the Venn diagram of these three sets. From the diagram, do the three different quality control methods filter out similar cells or not? Explain the reason why they have similar or dissimilar results.

Answer:

# plot the Venn diagram

# create a list of cell names
cell_names <- list(Criterion1 = Criterion_1_cell_name,
                   Criterion2 = Criterion_2_cell_name,
                   Criterion3 = Criterion_3_cell_name)

# create a Venn diagram
ggVennDiagram(cell_names, set_size = 3.5, edge_size = 0.5) +
  scale_fill_gradient(low = "grey90", high = "red") +
  labs(title = 
         "Venn Diagram of Three Different Quality Control Methods")

From the Venn diagram above, we can see that the three different quality control methods have 4792 cells in common. Besides those common cells of three methods, there are 0, 0, and 92 cells that are unique to criterion 1, criterion 2, and criterion 3, respectively. And there are 50 cells that are unique to criterion 1 and criterion 2, 2 cells that are unique to criterion 1 and criterion 3, and 58 cells that are unique to criterion 2 and criterion 3. The criterion 1 filters out cells with low or high UMI counts. The criterion 2 filters out cells with low feature counts (gene counts). The criterion 3 filters out cells with a high percentage of mitochondrial reads.

The reason why the three different quality control methods filter out similar cells is that the three criteria all filter out low-quality cells. nFeature_RNA is the number of genes detected in each cell. nCount_RNA is the total number of molecules detected within a cell. Low nFeature_RNA for a cell indicates that it may be dead/dying or an empty droplet. High nCount_RNA and/or nFeature _RNA indicates that the “cell” may in fact be a doublet (or multiplet).In combination with mitochondrial reads, removing outliers from these groups removes most doublets/dead cells/empty droplets. Therefore, the three different quality control methods are used to filter out low-quality cells, which is why they have similar results.

Criterion 1 and criterion 2 have 50 cells in common, which may be because cells with low total UMI counts are likely to be low-quality cells, and criterion 2 filters out cells with low feature counts, which are also likely to be low-quality cells. Criterion 3 has 92 unique cells, which may be because cells with a high percentage of mitochondrial reads are likely to be low-quality cells.

Q3.

Filtered out the cells using all three criteria. Then, also filtered out genes expressed in less than 10 cells. On the filtered data, perform four different normalization methods: Seurat default log-normalization (log1pCP10k), log1pCPM, SCTransform [you need to set variable.features.n = the number genes after filtering so that SCTransform doesn’t select highly variable genes], and log1pPF (Booeshaghi et al., 2022 and Github)

Answer:

# create a Seurat object
pbmc_small <- CreateSeuratObject(counts = pbmc_meta,
                                 
                                 # filter out genes expressed 
                                 # in less than 10 cells
                                 min.cells = 10,
                                 min.features = 200)

Warning: Feature names cannot have underscores ('_'), replacing with dashes
('-')

Warning: Setting row names on a tibble is deprecated.

Warning: Data is of class tbl_df. Coercing to dgCMatrix.

# Filter out the cells using all three criteria
pbmc_small <- subset(pbmc_small, 
                     cells = intersect(Criterion_1_cell_name, 
                                       intersect(
                                         Criterion_2_cell_name, 
                                         Criterion_3_cell_name)))


# perform the normalization

# Seurat default log-normalization (log1pCP10k)
pbmc_log1pCP10k <- NormalizeData(pbmc_small,
                                 normalization.method = 
                                   "LogNormalize",
                                 scale.factor = 10000)


# log1pCPM
pbmc_log1pCPM <- NormalizeData(pbmc_small, 
                               normalization.method = "LogNormalize", 
                               scale.factor = 1000000)

# SCTransform
pbmc_SCTransform <- Seurat::SCTransform(pbmc_small,
    variable.features.n = nrow(pbmc_small@assays$RNA@layers$counts), 
    return.only.var.genes = TRUE)

# log1pPF
log1pPF <- function(counts) {
  # count data is the #genes by #cells matrix
  ls <- colSums(counts) # library size for each cell
  meanLS <- mean(ls) # mean library size
  mat <- t(log1p(apply(counts, 1, function(x) x / ls * meanLS)))
  return(mat)
}

pbmc_log1pPF <- log1pPF(pbmc_small@assays$RNA@layers$counts)

Q4.

Calculate gene variances on (1) filtered count data and (2) four normalized data, respectively. Also, calculate the gene means before normalization. Draw five different plots showing the gene means vs. different variances. One of the plots should use the variances from the filtered count data. Each of the other four plots shows the variances derived from each normalized data. Compare the plots before and after normalization and explain whether the normalization methods could stabilize the variance.

Answer:

# Calculate gene variances on (1) filtered count data and
gene_var_filtered <- pbmc_small@assays$RNA@layers$counts |> 
  apply(1, var)

# (2) four normalized data, respectively.
gene_var_log1pCP10k <- pbmc_log1pCP10k@assays$RNA@layers$data |> 
  apply(1, var)

gene_var_log1pCPM <- pbmc_log1pCPM@assays$RNA@layers$data |>
  apply(1, var)

gene_var_SCTransform <- pbmc_SCTransform[["SCT"]]@scale.data |>
  apply(1, var)

gene_var_log1pPF <- pbmc_log1pPF |> 
  apply(1, var)

# Calculate the gene means before normalization
gene_mean_before <- pbmc_small@assays$RNA@layers$counts |>
  apply(1, mean)

# Calculate the gene means after normalization
gene_mean_log1pCP10k <- pbmc_log1pCP10k@assays$RNA@layers$data |>
  apply(1, mean)

gene_mean_log1pCPM <- pbmc_log1pCPM@assays$RNA@layers$data |>
  apply(1, mean)

gene_mean_SCTransform <- pbmc_SCTransform[["SCT"]]@scale.data |>
  apply(1, mean)


gene_mean_log1pPF <- pbmc_log1pPF |>
  apply(1, mean)


# Draw five different plots showing the 
# gene means vs. different variances

# Create a data frame for plotting
gene_var_mean_df <- 
  data.frame(Gene_Variance = c(gene_var_filtered, 
                               gene_var_log1pCP10k, 
                               gene_var_log1pCPM, 
                               gene_var_SCTransform, 
                               gene_var_log1pPF),
             Gene_Mean = c(gene_mean_before, 
                           gene_mean_log1pCP10k, 
                           gene_mean_log1pCPM, 
                           gene_mean_SCTransform, 
                           gene_mean_log1pPF),
             Normalization = c(rep("Filtered", 
                                   length(gene_var_filtered)), 
                               rep("log1pCP10k", 
                                   length(gene_var_log1pCP10k)), 
                               rep("log1pCPM", 
                                   length(gene_var_log1pCPM)), 
                               rep("SCTransform", 
                                   length(gene_var_SCTransform)), 
                               rep("log1pPF", 
                                   length(gene_var_log1pPF)))) 
               
# Plot the gene means vs. different variances
ggplot(gene_var_mean_df, aes(x = Gene_Mean, 
                             y = Gene_Variance, 
                             color = Normalization)) +
  geom_point(alpha = 0.3) +
  theme_grey() +
  labs(title = "Gene Means vs. Variances",
       x = "Gene Mean",
       y = "Gene Variance",
       color = "Normalization") +
  facet_wrap(~ Normalization, scales = "free") +
  theme(legend.position = "bottom") +
  scale_x_continuous(breaks = scales::pretty_breaks(n = 5))

# only plot for SCTransform

gene_var_mean_df |>
  filter(Normalization == "SCTransform") |>
  ggplot(aes(x = Gene_Mean, 
             y = Gene_Variance,
             color = Normalization)) +
  geom_point(alpha = 0.3) +
  theme_grey() +
  labs(title = "Gene Means vs. Variances (SCTransform)",
       x = "Gene Mean",
       y = "Gene Variance") +
  theme(legend.position = "bottom") +
  scale_x_continuous(breaks = scales::pretty_breaks(n = 7))

Compare: The plot using the filtered count data shows a positive nonlinear correlation between gene means and variances. This is consistent with the biological phenomenon that highly expressed genes tend to have higher variances. After normalization, the nonlinear correlation (looks like quadratic) between gene means and variances still exists, but the scales of variances and means across four normalization methods are different and reduced. So, the normalization methods could stabilize the variance to some extent. Among the four normalization methods, the SCTransform method has the best performance in stabilizing the variance.

Explanation: The reason why the normalization methods could stabilize the variance is that the normalization methods employ techniques such as log transformation, scaling by normalization factors, variance stabilizing transformations, and so on to adjust the library size and sequencing depth across samples, which could possibly reduce the technical noise and make the gene expression data more comparable across samples. As a result, the variance of gene expression data could be stabilized after normalization.

Q5.

Use the Spearman correlation between gene means and variances to evaluate the variance stabilization for each normalization method. And compare it to the Spearman correlation between gene means and variances using count data. Do you think normalization methods could stabilize variance, and which method has the best performance?

Answer:

# Calculate the Spearman correlation
cor_filtered <- cor(gene_mean_before, gene_var_filtered, 
                    method = "spearman")
cor_log1pCP10k <- cor(gene_mean_before, gene_var_log1pCP10k, 
                      method = "spearman")
cor_log1pCPM <- cor(gene_mean_before, gene_var_log1pCPM, 
                    method = "spearman")
cor_SCTransform <- cor(gene_mean_before, gene_var_SCTransform, 
                       method = "spearman")
cor_log1pPF <- cor(gene_mean_before, gene_var_log1pPF, 
                   method = "spearman")

# print the Spearman correlation results

cbind("Filtered Count Data" = cor_filtered, 
      "log1pCP10k" = cor_log1pCP10k, 
      "log1pCPM" = cor_log1pCPM, 
      "SCTransform" = cor_SCTransform, 
      "log1pPF" = cor_log1pPF) |>
  as_tibble() |>
  knitr::kable(format = "markdown")

Filtered Count Data	log1pCP10k	log1pCPM	SCTransform	log1pPF
0.9951933	0.995935	0.9973862	0.955483	0.9910752

# Create a data frame for plotting
cor_df <- data.frame(
  method = c("Filtered Count Data", "log1pCP10k", 
             "log1pCPM", "SCTransform", "log1pPF"),
  cor = c(cor_filtered, cor_log1pCP10k, 
          cor_log1pCPM, cor_SCTransform, cor_log1pPF)
)

# Plot the Spearman correlation
cor_df |> 
  ggplot(aes(x = method, y = cor)) +
  geom_bar(stat = "identity") +
  labs(title = "Spearman Correlation Between Gene Means and Variances",
       x = "Normalization Method",
       y = "Spearman Correlation") +
  theme_minimal()

Based on the Spearman correlation results above, I don’t think normalization methods could stabilize variance. The normalization method with the best performance is SCTransform.

Part 2: Simulated data

Q6.

1)

In the article Sarkar and Stephens, 2021, they proposed a general scRNA-seq model using the measurement model and expression model, i.e.,

Assume the expression model is the point mass distribution, i.e., [this means that is a constant of ; you could check the definition on page 7 of the slides]. Use the point-mass.csv [stores the point mass parameter for each gene] and library-size.csv files to simulate the count data. If the expression model is a point mass distribution, what is the distribution of simulated genes? [Hint: R function: rpois]

Answer:

create a function to simulate count data with different expression

set.seed(123)
ngenes <- 2000
ncells <- 1000

# library_size is a vector of 1000 cells; 
# rate is a vector of 2000 genes

# library_size = Xc+
library_size <- read.csv(
  "hw_data/library-size.csv", row.names = 1)[, 1]

# distribution of expression model is point mass
point_mass <- read.csv("hw_data/point-mass.csv", row.names = 1)[, 1]

# Simulate the count data
Simulate_counts <- matrix(0, ngenes, ncells)
for (i in 1:ncells) {
  Expression_model <- point_mass
  Simulate_counts[, i] <- rpois(ngenes,
                                lambda = 
                                  Expression_model * library_size[i])
}

The distribution of simulated genes is poisson distribution. The prove is as follows:

# Find the distribution of simulated genes
ggplot(Simulate_counts[9, ] |>
         as_tibble() |>
         rename(gene_counts = value),
       aes(x = gene_counts)) +
  geom_bar(fill = "skyblue", alpha = 0.7) +
  labs(title = "Distribution of Poisson Genes",
       x = "Gene Counts",
       y = "Frequency") +
  theme_minimal()

The reason why the distribution of simulated genes is Poisson distribution is that the expression model is a point mass distribution which means that the expression of gene j in cell i is a constant. And a constant times the library size of measurement model is the true expression of gene j in cell i, which still a constant. And the constant is the parameter of the Poisson distribution. Therefore, the distribution of simulated genes is Poisson distribution.

The reason why measurement model is Poisson distribution is: If we assume that the measured molecules in each cell are a simple random sample of all available molecules, then the observed molecule counts for each gene in sample i, \(x_{i1}, ..., x_{ip}\), follows a multivariate hypergeometric distribution. When the number of molecules sampled is much smaller than the total number of molecules, the multivariate hypergeometric distribution can be approximated by a multinomial distribution. And the multinomial distribution can be converted to a Poisson distribution since it does not affect either maximum likelihood or Bayesian inference. Therefore, the distribution of the measurement model is Poisson distribution.

2)

Assume the expression model is the Gamma distribution, i.e., . Use the shape+scale.csv [stores the shape and scale parameters of Gamma distribution for each gene] and library-size.csv files to simulate the count data. If the expression model is a Gamma distribution, what is the distribution of simulated genes? [Hint: R function: rgamma]

Answer:

create a function to simulate count data with gamma distribution

set.seed(123)
ngenes <- 2000
ncells <- 1000

# library_size is a vector of 1000 cells; 
# shape is a vector of 2000 genes; 
# scale is a vector of 2000 genes

# library_size = Xc+
library_size_2 <- read.csv(
  "hw_data/library-size.csv", row.names = 1)[, 1]

# shape = alpha; scale = beta
shape <- read.csv("hw_data/shape+scale.csv", row.names = 1)[, 2]
scale <- read.csv("hw_data/shape+scale.csv", row.names = 1)[, 1]

# Simulate the count data
Simulate_counts_2 <- matrix(0, ngenes, ncells)
for (i in 1:ncells) {
  Expression_model <- rgamma(ngenes, 
                             shape = shape,
                             scale = scale)
  
  Simulate_counts_2[, i] <- 
    rpois(ngenes,
          lambda = 
            Expression_model * library_size_2[i])
}

The distribution of simulated genes is Negative Binomial distribution. The prove is as follows:

# Find the distribution of simulated genes
Simulate_counts_2[1800, ] |>
  as_tibble() |>
  rename(gene_counts = value) |>
  ggplot(aes(x = gene_counts)) +
  geom_bar(fill = "skyblue", alpha = 0.7) +
  labs(title = "Distribution of Negative Binomial Genes",
       x = "Gene Counts",
       y = "Frequency") +
  theme_minimal() +
  scale_x_continuous(limits = c(0, 10)) +
  scale_y_continuous(limits = c(0, 20))

Warning: Removed 98 rows containing non-finite outside the scale range
(`stat_count()`).

Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_bar()`).

The reason why the distribution of simulated genes is Negative Binomial distribution is that the expression model is a Gamma distribution. The expression of gene j in cell i is \(\lambda_{ij} = m_{ij} / m_{i+}\), where \(m_{ij}\) is the expression of gene j in cell i, and \(m_{i}^{+}\) is the library size of cell i. And the expression of gene j in cell i is a Gamma distribution. Also, the measurement of gene expression is a poisson distribution. Therefore, the observation of gene expression is a negative binomial distribution. Because the Gamma-Poisson mixture is a Negative Binomial distribution. In conclusion, the distribution of simulated genes is Negative Binomial distribution.

Here is the proof:

Q7:

Answering the following questions

1)

Apply four different normalization methods in question 3 on each simulated dataset.

Answer:

normalize the simulated poission dataset

# create Seurat objects for the simulated poission dataset
poission <- CreateSeuratObject(counts = Simulate_counts,
                               
                               # filter out genes expressed 
                               # in less than 10 cells
                               min.cells = 10,
                               min.features = 200)

Warning: Data is of class matrix. Coercing to dgCMatrix.

# perform the normalization

# Seurat default log-normalization (log1pCP10k)
poission_log1pCP10k <- NormalizeData(
  poission@assays$RNA@layers$counts,
  normalization.method = "LogNormalize", 
  scale.factor = 10000)

# log1pCPM
poission_log1pCPM <- NormalizeData(poission@assays$RNA@layers$counts, 
                                   normalization.method = 
                                     "LogNormalize", 
                                   scale.factor = 1e6)


# SCTransform
poission_SCTransform <- Seurat::SCTransform(
  poission,
  variable.features.n = nrow(poission@assays$RNA@layers$counts),
  return.only.var.genes = TRUE)

# log1pPF
poission_log1pPF <- log1pPF(poission@assays$RNA@layers$counts)

normalize the simulated gamma dataset

# create Seurat objects for the simulated gamma dataset
gamma <- CreateSeuratObject(counts = Simulate_counts_2,
                            
                            # filter out genes expressed 
                            # in less than 10 cells
                            min.cells = 10,
                            min.features = 200)

Warning: Data is of class matrix. Coercing to dgCMatrix.

# Seurat default log-normalization (log1pCP10k)
gamma_log1pCP10k <- NormalizeData(gamma@assays$RNA@layers$counts, 
                                  normalization.method = 
                                    "LogNormalize", 
                                  scale.factor = 10000)

# log1pCPM
gamma_log1pCPM <- NormalizeData(gamma@assays$RNA@layers$counts, 
                                normalization.method = 
                                  "LogNormalize", 
                                scale.factor = 1e6)

# SCTransform
gamma_SCTransform <- SCTransform(
  gamma,
  variable.features.n = nrow(gamma@assays$RNA@layers$counts),
  return.only.var.genes = TRUE)

# log1pPF
gamma_log1pPF <- log1pPF(gamma@assays$RNA@layers$counts)

2)

Draw 5 figures with gene means vs variances as in question 4 for each simulated dataset.

Answer:

draw the figures for the poission dataset

# Calculate the gene variance on the original data
poission_gene_variances <- poission@assays$RNA@layers$counts |>
  apply(1, var)

# Calculate the gene variance on the normalized data
poission_gene_variances_log1pCP10k <- poission_log1pCP10k |>
  apply(1, var) 

poission_gene_variances_log1pCPM <- poission_log1pCPM |>
  apply(1, var)

poission_gene_variances_SCTransform <- 
  poission_SCTransform[["SCT"]]@scale.data |>
  apply(1, var)

poission_gene_variances_log1pPF <- poission_log1pPF |>
  apply(1, var)

# Calculate the gene means on the original data
poission_gene_means <- poission@assays$RNA@layers$counts |>
  apply(1, mean)

# calculate the gene means on the normalized data
poission_gene_means_log1pCP10k <- poission_log1pCP10k |>
  apply(1, mean)

poission_gene_means_log1pCPM <- poission_log1pCPM |>
  apply(1, mean)

poission_gene_means_SCTransform <-
  poission_SCTransform[["SCT"]]@scale.data |>
  apply(1, mean)

poission_gene_means_log1pPF <- poission_log1pPF |>
  apply(1, mean)


# Draw five different plots showing the 
# gene means vs. different variances

# Create a data frame for plotting
poission_gene_variances_df <- data.frame(
  Gene_Variance = c(poission_gene_variances,
                    poission_gene_variances_log1pCP10k,
                    poission_gene_variances_log1pCPM,
                    poission_gene_variances_SCTransform,
                    poission_gene_variances_log1pPF),
  poission_gene_means = c(poission_gene_means,
                          poission_gene_means_log1pCP10k,
                          poission_gene_means_log1pCPM,
                          poission_gene_means_SCTransform,
                          poission_gene_means_log1pPF),
  Normalization_Method = rep(c("Original", "log1pCP10k", 
                               "log1pCPM", "SCTransform", 
                               "log1pPF"),
                             each = length(poission_gene_means)))

# Plot the gene means vs. variances
poission_gene_variances_df |>
  ggplot(aes(x = poission_gene_means, 
             y = Gene_Variance, 
             color = Normalization_Method)) +
  geom_point(alpha = 0.3) +
  labs(title = "Gene Means vs. Variances for the Poission Dataset",
       x = "Gene Means",
       y = "Gene Variances") +
  facet_wrap(~ Normalization_Method, scales = "free") +
  theme_gray()+
  theme(legend.position = "bottom") +
  scale_x_continuous(breaks = scales::pretty_breaks(n = 7))

# only plot for SCTransform
poission_gene_variances_df |>
  filter(Normalization_Method == "SCTransform") |>
  ggplot(aes(x = poission_gene_means, 
             y = Gene_Variance, 
             color = Normalization_Method)) +
  geom_point(alpha = 0.3) +
  labs(title = "Gene Means vs. Variances for the Poission Dataset",
       x = "Gene Means",
       y = "Gene Variances") +
  theme_gray()+
  theme(legend.position = "bottom") +
  scale_x_continuous(breaks = scales::pretty_breaks(n = 7))

draw the figures for the gamma dataset

# Calculate the gene variance on the original data
gamma_gene_variances <- gamma@assays$RNA@layers$counts |>
  apply(1, var)

# Calculate the gene variance on the normalized data
gamma_gene_variances_log1pCP10k <- gamma_log1pCP10k |>
  apply(1, var)

gamma_gene_variances_log1pCPM <- gamma_log1pCPM |>
  apply(1, var)

gamma_gene_variances_SCTransform <-
  gamma_SCTransform[["SCT"]]@scale.data |>
  apply(1, var)

gamma_gene_variances_log1pPF <- gamma_log1pPF |>
  apply(1, var)


# Calculate the gene means on the original data
gamma_gene_means <- gamma@assays$RNA@layers$counts |>
  apply(1, mean)

# calculate the gene means on the normalized data
gamma_gene_means_log1pCP10k <- gamma_log1pCP10k |>
  apply(1, mean)

gamma_gene_means_log1pCPM <- gamma_log1pCPM |>
  apply(1, mean)

gamma_gene_means_SCTransform <-
  gamma_SCTransform[["SCT"]]@scale.data |>
  apply(1, mean)

gamma_gene_means_log1pPF <- gamma_log1pPF |>
  apply(1, mean)


# Draw five different plots showing the
# gene means vs. different variances

# Create a data frame for plotting

gamma_gene_variances_df <- data.frame(
  Gene_Variance = c(gamma_gene_variances,
                    gamma_gene_variances_log1pCP10k,
                    gamma_gene_variances_log1pCPM,
                    gamma_gene_variances_SCTransform,
                    gamma_gene_variances_log1pPF),
  gamma_gene_means = c(gamma_gene_means,
                       gamma_gene_means_log1pCP10k,
                       gamma_gene_means_log1pCPM,
                       gamma_gene_means_SCTransform,
                       gamma_gene_means_log1pPF),
  Normalization_Method = rep(c("Original", "log1pCP10k",
                               "log1pCPM", "SCTransform", 
                               "log1pPF"),
                             each = length(gamma_gene_means)))

# Plot the gene means vs. variances
gamma_gene_variances_df |>
  ggplot(aes(x = gamma_gene_means, 
             y = Gene_Variance,
             color = Normalization_Method)) +
  geom_point(alpha = 0.3) +
  labs(title = "Gene Means vs. Variances for the Gamma Dataset",
       x = "Gene Means",
       y = "Gene Variances") +
  facet_wrap(~ Normalization_Method, scales = "free") +
  theme_gray() +
  theme(legend.position = "bottom") +
  scale_x_continuous(breaks = scales::pretty_breaks(n = 7))

# only plot for SCTransform
gamma_gene_variances_df |>
  filter(Normalization_Method == "SCTransform") |>
  ggplot(aes(x = gamma_gene_means, 
             y = Gene_Variance,
             color = Normalization_Method)) +
  geom_point(alpha = 0.3) +
  labs(title = "Gene Means vs. Variances for the Gamma Dataset",
       x = "Gene Means",
       y = "Gene Variances") +
  theme_gray()+
  theme(legend.position = "bottom") +
  scale_x_continuous(breaks = scales::pretty_breaks(n = 7))

3)

Use Spearman correlation to evaluate normalization methods as in question 5 for each simulated dataset.

Answer:

calculate the Spearman correlation for the poission dataset

# Calculate the Spearman correlation

cor_poission <- cor(poission_gene_variances, poission_gene_means,
                    method = "spearman")

cor_poission_log1pCP10k <- cor(poission_gene_variances_log1pCP10k,
                               poission_gene_means,
                               method = "spearman")

cor_poission_log1pCPM <- cor(poission_gene_variances_log1pCPM,
                             poission_gene_means,
                             method = "spearman")

cor_poission_SCTransform <- cor(poission_gene_variances_SCTransform,
                                poission_gene_means,
                                method = "spearman")

cor_poission_log1pPF <- cor(poission_gene_variances_log1pPF,
                            poission_gene_means,
                            method = "spearman")

# print the Spearman correlation results

cbind("Original" = cor_poission,
      "log1pCP10k" = cor_poission_log1pCP10k,
      "log1pCPM" = cor_poission_log1pCPM,
      "SCTransform" = cor_poission_SCTransform,
      "log1pPF" = cor_poission_log1pPF) |>
  as_tibble() |>
  kable(format = "markdown")

Original	log1pCP10k	log1pCPM	SCTransform	log1pPF
0.9992452	0.9063637	0.8364113	0.6539515	0.9675133

# Create a data frame for plotting

poission_correlation_df <- data.frame(
  Normalization_Method = c("Original", "log1pCP10k",
                           "log1pCPM", "SCTransform", "log1pPF"),
  
  Spearman_Correlation
  = c(cor_poission, cor_poission_log1pCP10k,
      cor_poission_log1pCPM, cor_poission_SCTransform,
      cor_poission_log1pPF))

# Plot the Spearman correlation

poission_correlation_df |>
  ggplot(aes(x = Normalization_Method, 
             y = Spearman_Correlation
             )) +
  geom_bar(stat = "identity") +
  labs(title = "Spearman Correlation for Poission Dataset",
       x = "Normalization Method",
       y = "Spearman Correlation") +
  theme_minimal()

calculate the Spearman correlation for the gamma dataset

# Calculate the Spearman correlation
cor_gamma <- cor(gamma_gene_variances, gamma_gene_means,
                 method = "spearman")

cor_gamma_log1pCP10k <- cor(gamma_gene_variances_log1pCP10k,
                            gamma_gene_means,
                            method = "spearman")

cor_gamma_log1pCPM <- cor(gamma_gene_variances_log1pCPM,
                          gamma_gene_means,
                          method = "spearman")

cor_gamma_SCTransform <- cor(gamma_gene_variances_SCTransform,
                             gamma_gene_means,
                             method = "spearman")

cor_gamma_log1pPF <- cor(gamma_gene_variances_log1pPF,
                         gamma_gene_means,
                         method = "spearman")

# print the Spearman correlation results
cbind("Original" = cor_gamma,
      "log1pCP10k" = cor_gamma_log1pCP10k,
      "log1pCPM" = cor_gamma_log1pCPM,
      "SCTransform" = cor_gamma_SCTransform,
      "log1pPF" = cor_gamma_log1pPF) |>
  as_tibble() |>
  kable(format = "markdown")

Original	log1pCP10k	log1pCPM	SCTransform	log1pPF
0.967265	0.9895215	0.9564906	0.9596552	0.9650282

# Create a data frame for plotting
gamma_correlation_df <- data.frame(
  Normalization_Method = c("Original", "log1pCP10k",
                           "log1pCPM", "SCTransform", 
                           "log1pPF"),
  
  Spearman_Correlation
  = c(cor_gamma, cor_gamma_log1pCP10k,
      cor_gamma_log1pCPM, cor_gamma_SCTransform,
      cor_gamma_log1pPF))

# Plot the Spearman correlation
gamma_correlation_df |>
  ggplot(aes(x = Normalization_Method, 
             y = Spearman_Correlation
             )) +
  geom_bar(stat = "identity") +
  labs(title = "Spearman Correlation for Gamma Dataset",
       x = "Normalization Method",
       y = "Spearman Correlation") +
  theme_minimal()