Warning: package 'VGAM' was built under R version 4.3.3
library(knitr)
Warning: package 'knitr' was built under R version 4.3.3
Q1.
On the given count data, filter out cells with three different criteria respectively:
1) cells with total UMI counts smaller than 300 and larger than 10,000;
2) cells with the number of features smaller than 200;
3) cells with a percent of mitochondrial reads higher than 15%. Report how many cells are filtered out using different metrics.
Answer:
Let’s first load the data and take a look at the first few rows.
# load the datapbmc_org <-read_csv("hw_data/pbmc.csv")pbmc_meta <- pbmc_org[, -1, drop =FALSE]rownames(pbmc_meta) <- pbmc_org$...1
Warning: Setting row names on a tibble is deprecated.
# create a Seurat objectif (file.exists("pbmc.rds")) { pbmc <-readRDS("pbmc.rds")} else { pbmc <-CreateSeuratObject(counts = pbmc_meta)saveRDS(pbmc, "pbmc.rds")}# find the mitochondrial genespbmc[["percent.mt"]] <-PercentageFeatureSet(pbmc, pattern ="^MT-")VlnPlot(pbmc, features =c("nFeature_RNA", "nCount_RNA", "percent.mt"), ncol =3)
Warning: Default search for "data" layer in "RNA" assay yielded no results;
utilizing "counts" layer instead.
# each column represents a cell and each row represents a gene
collect the cells with according to the three criteria respectively.
# Criterion 1: Total UMI Counts# Filter cells with total UMI counts smaller than 300# or larger than 10000pbmc_1 <- pbmc@meta.data |>subset(nCount_RNA >=300& nCount_RNA <=10000) |>as.data.frame()Criterion_1_cell_num <-nrow(pbmc_1)Criterion_1_cell_name <-rownames(pbmc_1)# Criterion 2: Number of Features# Filter cells with the number of features smaller than 200pbmc_2 <- pbmc@meta.data |>subset(nFeature_RNA >=200) |>as.data.frame()Criterion_2_cell_num <-nrow(pbmc_2)Criterion_2_cell_name <-rownames(pbmc_2)# Criterion 3: Percent of Mitochondrial Reads# Filter cells with a percent of mitochondrial reads higher than 15%pbmc_3 <- pbmc@meta.data |>subset(percent.mt <=15) |>as.data.frame()Criterion_3_cell_num <-nrow(pbmc_3)Criterion_3_cell_name <-rownames(pbmc_3)# print the number of cells filtered out using different metricscbind(Criterion_1_cell_num, Criterion_2_cell_num, Criterion_3_cell_num) |>as_tibble() |>rename("Criterion 1"= Criterion_1_cell_num, "Criterion 2"= Criterion_2_cell_num, "Criterion 3"= Criterion_3_cell_num) |> knitr::kable(format ="markdown")
Criterion 1
Criterion 2
Criterion 3
4844
4900
4944
Q2.
From question1, you get three different sets of filtered cells. Plot the Venn diagram of these three sets. From the diagram, do the three different quality control methods filter out similar cells or not? Explain the reason why they have similar or dissimilar results.
Answer:
# plot the Venn diagram# create a list of cell namescell_names <-list(Criterion1 = Criterion_1_cell_name,Criterion2 = Criterion_2_cell_name,Criterion3 = Criterion_3_cell_name)# create a Venn diagramggVennDiagram(cell_names, set_size =3.5, edge_size =0.5) +scale_fill_gradient(low ="grey90", high ="red") +labs(title ="Venn Diagram of Three Different Quality Control Methods")
From the Venn diagram above, we can see that the three different quality control methods have 4792 cells in common. Besides those common cells of three methods, there are 0, 0, and 92 cells that are unique to criterion 1, criterion 2, and criterion 3, respectively. And there are 50 cells that are unique to criterion 1 and criterion 2, 2 cells that are unique to criterion 1 and criterion 3, and 58 cells that are unique to criterion 2 and criterion 3. The criterion 1 filters out cells with low or high UMI counts. The criterion 2 filters out cells with low feature counts (gene counts). The criterion 3 filters out cells with a high percentage of mitochondrial reads.
The reason why the three different quality control methods filter out similar cells is that the three criteria all filter out low-quality cells. nFeature_RNA is the number of genes detected in each cell. nCount_RNA is the total number of molecules detected within a cell. Low nFeature_RNA for a cell indicates that it may be dead/dying or an empty droplet. High nCount_RNA and/or nFeature _RNA indicates that the “cell” may in fact be a doublet (or multiplet).In combination with mitochondrial reads, removing outliers from these groups removes most doublets/dead cells/empty droplets. Therefore, the three different quality control methods are used to filter out low-quality cells, which is why they have similar results.
Criterion 1 and criterion 2 have 50 cells in common, which may be because cells with low total UMI counts are likely to be low-quality cells, and criterion 2 filters out cells with low feature counts, which are also likely to be low-quality cells. Criterion 3 has 92 unique cells, which may be because cells with a high percentage of mitochondrial reads are likely to be low-quality cells.
Q3.
Filtered out the cells using all three criteria. Then, also filtered out genes expressed in less than 10 cells. On the filtered data, perform four different normalization methods: Seurat default log-normalization (log1pCP10k), log1pCPM, SCTransform [you need to set variable.features.n = the number genes after filtering so that SCTransform doesn’t select highly variable genes], and log1pPF (Booeshaghi et al., 2022 and Github)
Answer:
# create a Seurat objectpbmc_small <-CreateSeuratObject(counts = pbmc_meta,# filter out genes expressed # in less than 10 cellsmin.cells =10,min.features =200)
Warning: Feature names cannot have underscores ('_'), replacing with dashes
('-')
Warning: Setting row names on a tibble is deprecated.
Warning: Data is of class tbl_df. Coercing to dgCMatrix.
# Filter out the cells using all three criteriapbmc_small <-subset(pbmc_small, cells =intersect(Criterion_1_cell_name, intersect( Criterion_2_cell_name, Criterion_3_cell_name)))# perform the normalization# Seurat default log-normalization (log1pCP10k)pbmc_log1pCP10k <-NormalizeData(pbmc_small,normalization.method ="LogNormalize",scale.factor =10000)# log1pCPMpbmc_log1pCPM <-NormalizeData(pbmc_small, normalization.method ="LogNormalize", scale.factor =1000000)# SCTransformpbmc_SCTransform <- Seurat::SCTransform(pbmc_small,variable.features.n =nrow(pbmc_small@assays$RNA@layers$counts), return.only.var.genes =TRUE)# log1pPFlog1pPF <-function(counts) {# count data is the #genes by #cells matrix ls <-colSums(counts) # library size for each cell meanLS <-mean(ls) # mean library size mat <-t(log1p(apply(counts, 1, function(x) x / ls * meanLS)))return(mat)}pbmc_log1pPF <-log1pPF(pbmc_small@assays$RNA@layers$counts)
Q4.
Calculate gene variances on (1) filtered count data and (2) four normalized data, respectively. Also, calculate the gene means before normalization. Draw five different plots showing the gene means vs. different variances. One of the plots should use the variances from the filtered count data. Each of the other four plots shows the variances derived from each normalized data. Compare the plots before and after normalization and explain whether the normalization methods could stabilize the variance.
Answer:
# Calculate gene variances on (1) filtered count data andgene_var_filtered <- pbmc_small@assays$RNA@layers$counts |>apply(1, var)# (2) four normalized data, respectively.gene_var_log1pCP10k <- pbmc_log1pCP10k@assays$RNA@layers$data |>apply(1, var)gene_var_log1pCPM <- pbmc_log1pCPM@assays$RNA@layers$data |>apply(1, var)gene_var_SCTransform <- pbmc_SCTransform[["SCT"]]@scale.data |>apply(1, var)gene_var_log1pPF <- pbmc_log1pPF |>apply(1, var)# Calculate the gene means before normalizationgene_mean_before <- pbmc_small@assays$RNA@layers$counts |>apply(1, mean)# Calculate the gene means after normalizationgene_mean_log1pCP10k <- pbmc_log1pCP10k@assays$RNA@layers$data |>apply(1, mean)gene_mean_log1pCPM <- pbmc_log1pCPM@assays$RNA@layers$data |>apply(1, mean)gene_mean_SCTransform <- pbmc_SCTransform[["SCT"]]@scale.data |>apply(1, mean)gene_mean_log1pPF <- pbmc_log1pPF |>apply(1, mean)# Draw five different plots showing the # gene means vs. different variances# Create a data frame for plottinggene_var_mean_df <-data.frame(Gene_Variance =c(gene_var_filtered, gene_var_log1pCP10k, gene_var_log1pCPM, gene_var_SCTransform, gene_var_log1pPF),Gene_Mean =c(gene_mean_before, gene_mean_log1pCP10k, gene_mean_log1pCPM, gene_mean_SCTransform, gene_mean_log1pPF),Normalization =c(rep("Filtered", length(gene_var_filtered)), rep("log1pCP10k", length(gene_var_log1pCP10k)), rep("log1pCPM", length(gene_var_log1pCPM)), rep("SCTransform", length(gene_var_SCTransform)), rep("log1pPF", length(gene_var_log1pPF)))) # Plot the gene means vs. different variancesggplot(gene_var_mean_df, aes(x = Gene_Mean, y = Gene_Variance, color = Normalization)) +geom_point(alpha =0.3) +theme_grey() +labs(title ="Gene Means vs. Variances",x ="Gene Mean",y ="Gene Variance",color ="Normalization") +facet_wrap(~ Normalization, scales ="free") +theme(legend.position ="bottom") +scale_x_continuous(breaks = scales::pretty_breaks(n =5))
# only plot for SCTransformgene_var_mean_df |>filter(Normalization =="SCTransform") |>ggplot(aes(x = Gene_Mean, y = Gene_Variance,color = Normalization)) +geom_point(alpha =0.3) +theme_grey() +labs(title ="Gene Means vs. Variances (SCTransform)",x ="Gene Mean",y ="Gene Variance") +theme(legend.position ="bottom") +scale_x_continuous(breaks = scales::pretty_breaks(n =7))
Compare: The plot using the filtered count data shows a positive nonlinear correlation between gene means and variances. This is consistent with the biological phenomenon that highly expressed genes tend to have higher variances. After normalization, the nonlinear correlation (looks like quadratic) between gene means and variances still exists, but the scales of variances and means across four normalization methods are different and reduced. So, the normalization methods could stabilize the variance to some extent. Among the four normalization methods, the SCTransform method has the best performance in stabilizing the variance.
Explanation: The reason why the normalization methods could stabilize the variance is that the normalization methods employ techniques such as log transformation, scaling by normalization factors, variance stabilizing transformations, and so on to adjust the library size and sequencing depth across samples, which could possibly reduce the technical noise and make the gene expression data more comparable across samples. As a result, the variance of gene expression data could be stabilized after normalization.
Q5.
Use the Spearman correlation between gene means and variances to evaluate the variance stabilization for each normalization method. And compare it to the Spearman correlation between gene means and variances using count data. Do you think normalization methods could stabilize variance, and which method has the best performance?
# Create a data frame for plottingcor_df <-data.frame(method =c("Filtered Count Data", "log1pCP10k", "log1pCPM", "SCTransform", "log1pPF"),cor =c(cor_filtered, cor_log1pCP10k, cor_log1pCPM, cor_SCTransform, cor_log1pPF))# Plot the Spearman correlationcor_df |>ggplot(aes(x = method, y = cor)) +geom_bar(stat ="identity") +labs(title ="Spearman Correlation Between Gene Means and Variances",x ="Normalization Method",y ="Spearman Correlation") +theme_minimal()
Based on the Spearman correlation results above, I don’t think normalization methods could stabilize variance. The normalization method with the best performance is SCTransform.
Part 2: Simulated data
Q6.
1)
In the article Sarkar and Stephens, 2021, they proposed a general scRNA-seq model using the measurement model and expression model, i.e.,
Assume the expression model is the point mass distribution, i.e., [this means that is a constant of ; you could check the definition on page 7 of the slides]. Use the point-mass.csv [stores the point mass parameter for each gene] and library-size.csv files to simulate the count data. If the expression model is a point mass distribution, what is the distribution of simulated genes? [Hint: R function: rpois]
Answer:
create a function to simulate count data with different expression
set.seed(123)ngenes <-2000ncells <-1000# library_size is a vector of 1000 cells; # rate is a vector of 2000 genes# library_size = Xc+library_size <-read.csv("hw_data/library-size.csv", row.names =1)[, 1]# distribution of expression model is point masspoint_mass <-read.csv("hw_data/point-mass.csv", row.names =1)[, 1]# Simulate the count dataSimulate_counts <-matrix(0, ngenes, ncells)for (i in1:ncells) { Expression_model <- point_mass Simulate_counts[, i] <-rpois(ngenes,lambda = Expression_model * library_size[i])}
The distribution of simulated genes is poisson distribution. The prove is as follows:
# Find the distribution of simulated genesggplot(Simulate_counts[9, ] |>as_tibble() |>rename(gene_counts = value),aes(x = gene_counts)) +geom_bar(fill ="skyblue", alpha =0.7) +labs(title ="Distribution of Poisson Genes",x ="Gene Counts",y ="Frequency") +theme_minimal()
The reason why the distribution of simulated genes is Poisson distribution is that the expression model is a point mass distribution which means that the expression of gene j in cell i is a constant. And a constant times the library size of measurement model is the true expression of gene j in cell i, which still a constant. And the constant is the parameter of the Poisson distribution. Therefore, the distribution of simulated genes is Poisson distribution.
The reason why measurement model is Poisson distribution is: If we assume that the measured molecules in each cell are a simple random sample of all available molecules, then the observed molecule counts for each gene in sample i, \(x_{i1}, ..., x_{ip}\), follows a multivariate hypergeometric distribution. When the number of molecules sampled is much smaller than the total number of molecules, the multivariate hypergeometric distribution can be approximated by a multinomial distribution. And the multinomial distribution can be converted to a Poisson distribution since it does not affect either maximum likelihood or Bayesian inference. Therefore, the distribution of the measurement model is Poisson distribution.
2)
Assume the expression model is the Gamma distribution, i.e., . Use the shape+scale.csv [stores the shape and scale parameters of Gamma distribution for each gene] and library-size.csv files to simulate the count data. If the expression model is a Gamma distribution, what is the distribution of simulated genes? [Hint: R function: rgamma]
Answer:
create a function to simulate count data with gamma distribution
set.seed(123)ngenes <-2000ncells <-1000# library_size is a vector of 1000 cells; # shape is a vector of 2000 genes; # scale is a vector of 2000 genes# library_size = Xc+library_size_2 <-read.csv("hw_data/library-size.csv", row.names =1)[, 1]# shape = alpha; scale = betashape <-read.csv("hw_data/shape+scale.csv", row.names =1)[, 2]scale <-read.csv("hw_data/shape+scale.csv", row.names =1)[, 1]# Simulate the count dataSimulate_counts_2 <-matrix(0, ngenes, ncells)for (i in1:ncells) { Expression_model <-rgamma(ngenes, shape = shape,scale = scale) Simulate_counts_2[, i] <-rpois(ngenes,lambda = Expression_model * library_size_2[i])}
The distribution of simulated genes is Negative Binomial distribution. The prove is as follows:
# Find the distribution of simulated genesSimulate_counts_2[1800, ] |>as_tibble() |>rename(gene_counts = value) |>ggplot(aes(x = gene_counts)) +geom_bar(fill ="skyblue", alpha =0.7) +labs(title ="Distribution of Negative Binomial Genes",x ="Gene Counts",y ="Frequency") +theme_minimal() +scale_x_continuous(limits =c(0, 10)) +scale_y_continuous(limits =c(0, 20))
Warning: Removed 98 rows containing non-finite outside the scale range
(`stat_count()`).
Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_bar()`).
The reason why the distribution of simulated genes is Negative Binomial distribution is that the expression model is a Gamma distribution. The expression of gene j in cell i is \(\lambda_{ij} = m_{ij} / m_{i+}\), where \(m_{ij}\) is the expression of gene j in cell i, and \(m_{i}^{+}\) is the library size of cell i. And the expression of gene j in cell i is a Gamma distribution. Also, the measurement of gene expression is a poisson distribution. Therefore, the observation of gene expression is a negative binomial distribution. Because the Gamma-Poisson mixture is a Negative Binomial distribution. In conclusion, the distribution of simulated genes is Negative Binomial distribution.
Here is the proof:
Q7:
Answering the following questions
1)
Apply four different normalization methods in question 3 on each simulated dataset.
Answer:
normalize the simulated poission dataset
# create Seurat objects for the simulated poission datasetpoission <-CreateSeuratObject(counts = Simulate_counts,# filter out genes expressed # in less than 10 cellsmin.cells =10,min.features =200)
Warning: Data is of class matrix. Coercing to dgCMatrix.
# create Seurat objects for the simulated gamma datasetgamma <-CreateSeuratObject(counts = Simulate_counts_2,# filter out genes expressed # in less than 10 cellsmin.cells =10,min.features =200)
Warning: Data is of class matrix. Coercing to dgCMatrix.
Draw 5 figures with gene means vs variances as in question 4 for each simulated dataset.
Answer:
draw the figures for the poission dataset
# Calculate the gene variance on the original datapoission_gene_variances <- poission@assays$RNA@layers$counts |>apply(1, var)# Calculate the gene variance on the normalized datapoission_gene_variances_log1pCP10k <- poission_log1pCP10k |>apply(1, var) poission_gene_variances_log1pCPM <- poission_log1pCPM |>apply(1, var)poission_gene_variances_SCTransform <- poission_SCTransform[["SCT"]]@scale.data |>apply(1, var)poission_gene_variances_log1pPF <- poission_log1pPF |>apply(1, var)# Calculate the gene means on the original datapoission_gene_means <- poission@assays$RNA@layers$counts |>apply(1, mean)# calculate the gene means on the normalized datapoission_gene_means_log1pCP10k <- poission_log1pCP10k |>apply(1, mean)poission_gene_means_log1pCPM <- poission_log1pCPM |>apply(1, mean)poission_gene_means_SCTransform <- poission_SCTransform[["SCT"]]@scale.data |>apply(1, mean)poission_gene_means_log1pPF <- poission_log1pPF |>apply(1, mean)# Draw five different plots showing the # gene means vs. different variances# Create a data frame for plottingpoission_gene_variances_df <-data.frame(Gene_Variance =c(poission_gene_variances, poission_gene_variances_log1pCP10k, poission_gene_variances_log1pCPM, poission_gene_variances_SCTransform, poission_gene_variances_log1pPF),poission_gene_means =c(poission_gene_means, poission_gene_means_log1pCP10k, poission_gene_means_log1pCPM, poission_gene_means_SCTransform, poission_gene_means_log1pPF),Normalization_Method =rep(c("Original", "log1pCP10k", "log1pCPM", "SCTransform", "log1pPF"),each =length(poission_gene_means)))# Plot the gene means vs. variancespoission_gene_variances_df |>ggplot(aes(x = poission_gene_means, y = Gene_Variance, color = Normalization_Method)) +geom_point(alpha =0.3) +labs(title ="Gene Means vs. Variances for the Poission Dataset",x ="Gene Means",y ="Gene Variances") +facet_wrap(~ Normalization_Method, scales ="free") +theme_gray()+theme(legend.position ="bottom") +scale_x_continuous(breaks = scales::pretty_breaks(n =7))
# only plot for SCTransformpoission_gene_variances_df |>filter(Normalization_Method =="SCTransform") |>ggplot(aes(x = poission_gene_means, y = Gene_Variance, color = Normalization_Method)) +geom_point(alpha =0.3) +labs(title ="Gene Means vs. Variances for the Poission Dataset",x ="Gene Means",y ="Gene Variances") +theme_gray()+theme(legend.position ="bottom") +scale_x_continuous(breaks = scales::pretty_breaks(n =7))
draw the figures for the gamma dataset
# Calculate the gene variance on the original datagamma_gene_variances <- gamma@assays$RNA@layers$counts |>apply(1, var)# Calculate the gene variance on the normalized datagamma_gene_variances_log1pCP10k <- gamma_log1pCP10k |>apply(1, var)gamma_gene_variances_log1pCPM <- gamma_log1pCPM |>apply(1, var)gamma_gene_variances_SCTransform <- gamma_SCTransform[["SCT"]]@scale.data |>apply(1, var)gamma_gene_variances_log1pPF <- gamma_log1pPF |>apply(1, var)# Calculate the gene means on the original datagamma_gene_means <- gamma@assays$RNA@layers$counts |>apply(1, mean)# calculate the gene means on the normalized datagamma_gene_means_log1pCP10k <- gamma_log1pCP10k |>apply(1, mean)gamma_gene_means_log1pCPM <- gamma_log1pCPM |>apply(1, mean)gamma_gene_means_SCTransform <- gamma_SCTransform[["SCT"]]@scale.data |>apply(1, mean)gamma_gene_means_log1pPF <- gamma_log1pPF |>apply(1, mean)# Draw five different plots showing the# gene means vs. different variances# Create a data frame for plottinggamma_gene_variances_df <-data.frame(Gene_Variance =c(gamma_gene_variances, gamma_gene_variances_log1pCP10k, gamma_gene_variances_log1pCPM, gamma_gene_variances_SCTransform, gamma_gene_variances_log1pPF),gamma_gene_means =c(gamma_gene_means, gamma_gene_means_log1pCP10k, gamma_gene_means_log1pCPM, gamma_gene_means_SCTransform, gamma_gene_means_log1pPF),Normalization_Method =rep(c("Original", "log1pCP10k","log1pCPM", "SCTransform", "log1pPF"),each =length(gamma_gene_means)))# Plot the gene means vs. variancesgamma_gene_variances_df |>ggplot(aes(x = gamma_gene_means, y = Gene_Variance,color = Normalization_Method)) +geom_point(alpha =0.3) +labs(title ="Gene Means vs. Variances for the Gamma Dataset",x ="Gene Means",y ="Gene Variances") +facet_wrap(~ Normalization_Method, scales ="free") +theme_gray() +theme(legend.position ="bottom") +scale_x_continuous(breaks = scales::pretty_breaks(n =7))
# only plot for SCTransformgamma_gene_variances_df |>filter(Normalization_Method =="SCTransform") |>ggplot(aes(x = gamma_gene_means, y = Gene_Variance,color = Normalization_Method)) +geom_point(alpha =0.3) +labs(title ="Gene Means vs. Variances for the Gamma Dataset",x ="Gene Means",y ="Gene Variances") +theme_gray()+theme(legend.position ="bottom") +scale_x_continuous(breaks = scales::pretty_breaks(n =7))
3)
Use Spearman correlation to evaluate normalization methods as in question 5 for each simulated dataset.
Answer:
calculate the Spearman correlation for the poission dataset