Model Comparison with Microbiome High-Throughput Sequencing Data

Author

Jiachen Ai

Published

September 19, 2024

1 Introduction

This data package contains the information used to run the analyses found in “Diarrhea in young children from low-income countries leads to large-scale alterations in intestinal microbiota composition”.
Measurements are the number of reads annotated for a particular cluster within a given sample followed by filtering. Sequencing was performed on the 454 Flex platform.
Data is stored as an MRexperiment-class object. The count matrix was generated using DNAclust (http://dnaclust.sourceforge.net/). For more details please refer to the paper. Included in the MRexperiment object are the counts, phenotype, and feature information.

2 Microbiome data background

The human microbiome is the collection of all the microorganisms living in association with the human body. These communities consist of a variety of microorganisms including eukaryotes, archaea, bacteria, and viruses.
The human body, consisting of about 10 trillion cells, carries about ten times as many microorganisms in the intestines. The metabolic activities performed by these bacteria resemble those of an organ, leading some to liken gut bacteria to a “forgotten” organ.
High-throughput sequencing technologies have enabled the study of the human microbiome through the use of metagenomic data.
- OTU (Operational Taxonomic Unit) is a cluster of sequences that are at least 97% similar. OTUs are used to classify sequences into groups that are similar to each other.
- OTU counts are used to represent the abundance of each OTU in each sample.
- OTU counts can be aggregated at higher phylogentic levels, e.g., species, genus, etc.
- The higher the level the less zero-inflated the data is.

Features of microbiome data include
- Zero-inflated: many OTUs are not observed in many samples
- Overdispersed: the variance is larger than the mean
- Compositional: the sum of all OTUs in a sample is constant
- High-dimensional: large number of OTUs

3 Goal of this excercise

One goal of microbiome data analysis is to identify the association between microbiome composition and clinical outcomes. In this study, it is case-control status: diarrhea vs non-diarrhea.
Try the techniques we learned so far to analyze the microbiome data, at genus and species levels, to identify the association between microbiome composition and case-control status. Use the following models:
- Quasi-Poisson
- Negative Binomial
- Zero Inflated Poisson
- Zero Inflated Negative Binomial

3.1 Load required packages

# if (!require("BiocManager", quietly = TRUE))
#    install.packages("BiocManager")
# BiocManager::install("msd16s")

# BiocManager::install("metagenomeSeq")
suppressMessages(library(metagenomeSeq))
library(msd16s)
library(tidyverse)
library(broom)

Warning: package 'broom' was built under R version 4.3.3

library(pscl)
library(VennDiagram)
library(ggVennDiagram)
library(MASS)

3.2 Load data and process

data(msd16s)
msd16s

MRexperiment (storageMode: environment)
assayData: 26044 features, 992 samples 
  element names: counts 
protocolData: none
phenoData
  sampleNames: 100259 100262 ... 602385 (992 total)
  varLabels: Type Country ... Dysentery (5 total)
  varMetadata: labelDescription
featureData
  featureNames: 54 94 ... 276421 (26044 total)
  fvarLabels: superkingdom phylum ... clusterCenter (10 total)
  fvarMetadata: labelDescription
experimentData: use 'experimentData(object)'
Annotation:

The phenotype information can be accessed with the phenoData and pData methods:

phenoData(msd16s)

An object of class 'AnnotatedDataFrame'
  sampleNames: 100259 100262 ... 602385 (992 total)
  varLabels: Type Country ... Dysentery (5 total)
  varMetadata: labelDescription

pheno_tbl <- rownames_to_column(pData(msd16s), var = "ID") %>% as_tibble()
pheno_tbl

# A tibble: 992 × 6
   ID     Type    Country   Age AgeFactor Dysentery
   <chr>  <chr>   <fct>   <dbl> <fct>         <dbl>
 1 100259 Case    Gambia     14 [12,18)           1
 2 100262 Control Gambia     24 [24,60)           0
 3 100267 Case    Gambia     17 [12,18)           0
 4 100274 Case    Gambia     36 [24,60)           0
 5 100275 Case    Gambia     29 [24,60)           0
 6 100277 Case    Gambia     29 [24,60)           0
 7 100291 Control Gambia     27 [24,60)           0
 8 100292 Control Gambia     30 [24,60)           0
 9 100293 Case    Gambia     20 [18,24)           1
10 100294 Control Gambia     33 [24,60)           0
# ℹ 982 more rows

p_case_control = pheno_tbl %>% 
  count(Type) %>% 
  mutate(prop = n/sum(n))

p_case <- with(p_case_control, prop[Type == "Case"])
p_control <- with(p_case_control, prop[Type == "Control"])

The feature information including cluster representative (e.g., OTUs) sequence can be accessed with the featureData and fData methods:

featureData(msd16s)

An object of class 'AnnotatedDataFrame'
  featureNames: 54 94 ... 276421 (26044 total)
  varLabels: superkingdom phylum ... clusterCenter (10 total)
  varMetadata: labelDescription

features <- fData(msd16s)

counts <- MRcounts(msd16s, norm = TRUE)
dim(counts)

[1] 26044   992

The raw or normalized counts matrix can be accessed with the MRcounts function. We get normalized counts. A normalization method avoids biases due to uneven sequencing depth.

otu_id <- rownames(counts)
counts_tbl <- bind_cols(otu_id = otu_id, counts %>% as_tibble())
counts_tbl

# A tibble: 26,044 × 993
   otu_id `100259` `100262` `100267` `100274` `100275` `100277` `100291`
   <chr>     <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>
 1 54            0        0        0        0        0        0        0
 2 94            0        0        0        0        0        0        0
 3 113           0        0        0        0        0        0        0
 4 117           0        0        0        0        0        0        0
 5 145           0        0        0        0        0        0        0
 6 202           0        0        0        0        0        0        0
 7 217           0        0        0        0        0        0        0
 8 237           0        0        0        0        0        0        0
 9 245           0        0        0        0        0        0        0
10 248           0        0        0        0        0        0        0
# ℹ 26,034 more rows
# ℹ 985 more variables: `100292` <dbl>, `100293` <dbl>, `100294` <dbl>,
#   `100298` <dbl>, `100303` <dbl>, `100317` <dbl>, `100320` <dbl>,
#   `100322` <dbl>, `100341` <dbl>, `100353` <dbl>, `100356` <dbl>,
#   `100361` <dbl>, `100365` <dbl>, `100395` <dbl>, `100401` <dbl>,
#   `100403` <dbl>, `100437` <dbl>, `100457` <dbl>, `100462` <dbl>,
#   `100470` <dbl>, `100476` <dbl>, `100489` <dbl>, `100491` <dbl>, …

3.2.1 Filter data at OTU level

OTU was abundant (≥12 normalized counts per sample) in cases or controls;

control_to_select <- pheno_tbl %>% 
  filter(Type == "Control") %>% 
  dplyr::select(ID) %>% 
  pull()

con_sum <- counts_tbl %>%
  dplyr::select(all_of(control_to_select)) %>% 
  rowSums() 

case_to_select <- pheno_tbl %>% 
  filter(Type == "Case") %>% 
  dplyr::select(ID) %>% 
  pull()

case_sum <- counts_tbl %>%
  dplyr::select(all_of(case_to_select)) %>% 
  rowSums() 

counts_tbl_filt <- bind_cols(counts_tbl, 
                             con_sum = con_sum/length(control_to_select), 
                             case_sum = case_sum/length(case_to_select)) %>%
  filter(con_sum >= 12 | case_sum >= 12)

OTU was prevalent (present in ≥10 cases and controls);

otu_prevalence <- counts_tbl_filt %>%
  dplyr:: select(-otu_id) %>%
  mutate(across(everything(), ~ as.integer(. > 0))) %>%
  rowSums()
  
counts_tbl <- bind_cols(counts_tbl_filt, otu_prevalence = otu_prevalence) %>%
  filter(otu_prevalence >= 10) %>%
  dplyr:: select(-con_sum, -case_sum, -otu_prevalence)

counts_tbl_t <- counts_tbl %>% 
  pivot_longer(cols= -1) %>% 
  pivot_wider(names_from = "otu_id",values_from = "value") %>%
  rename(ID = name) %>% 
  left_join(pheno_tbl, by = "ID")

3.2.2 Filter data at Genus level

Aggregated counts at genus level

genus_counts = aggTax(msd16s, lvl = "genus", out = "matrix", norm = T)
genus_id <- rownames(genus_counts)
genus_counts_tbl <- bind_cols(genus_id = genus_id, genus_counts %>% as_tibble())
genus_counts_tbl

# A tibble: 164 × 993
   genus_id       `100259` `100262` `100267` `100274` `100275` `100277` `100291`
   <chr>             <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>
 1 Abiotrophia         0          0     0         0          0     0        1.80
 2 Acetobacter         0          0     0         0          0     0        0   
 3 Acetonema           0          0     0         0          0     0        0   
 4 Acidaminococc…     10.6        0     0        73.9        0    80.7      0   
 5 Acidithiobaci…      0          0   141.        0          0     0        0   
 6 Acidovorax          0          0     0         0          0     0        0   
 7 Acinetobacter       0          0     0         0          0     2.60     0   
 8 Actinobacillus      0          0     0         0          0     0        0   
 9 Actinomyces        19.4        0     1.79      0          0     0        0   
10 Aeromonas           0          0     0         0          0     0        0   
# ℹ 154 more rows
# ℹ 985 more variables: `100292` <dbl>, `100293` <dbl>, `100294` <dbl>,
#   `100298` <dbl>, `100303` <dbl>, `100317` <dbl>, `100320` <dbl>,
#   `100322` <dbl>, `100341` <dbl>, `100353` <dbl>, `100356` <dbl>,
#   `100361` <dbl>, `100365` <dbl>, `100395` <dbl>, `100401` <dbl>,
#   `100403` <dbl>, `100437` <dbl>, `100457` <dbl>, `100462` <dbl>,
#   `100470` <dbl>, `100476` <dbl>, `100489` <dbl>, `100491` <dbl>, …

Genus was abundant (≥12 normalized counts per sample) in cases or controls;

con_mean <- genus_counts_tbl %>%
  dplyr:: select(all_of(control_to_select)) %>% 
  rowMeans() 

case_mean <- genus_counts_tbl %>%
  dplyr:: select(all_of(case_to_select)) %>% 
  rowMeans() 

genus_tbl_filt <- bind_cols(genus_counts_tbl, 
                             con_mean = con_mean, 
                             case_mean = case_mean) %>%
  filter(con_mean >= 12 | case_mean >= 12)

Genus was prevalent (present in ≥10 cases and controls);

otu_prevalence <- genus_tbl_filt %>%
  dplyr:: select(-genus_id) %>%
  mutate(across(everything(), ~ as.integer(. > 0))) %>%
  rowSums()
  
genus_tbl <- bind_cols(genus_tbl_filt, otu_prevalence = otu_prevalence) %>%
  filter(otu_prevalence >= 10) %>%
  dplyr:: select(-con_mean, -case_mean, -otu_prevalence) %>% 
  filter(genus_id != "NA")

rm(genus_tbl_filt)

genusnames = genus_tbl$genus_id 

# save the genus names as txt file
write.table(genusnames, file = "genusnames.txt", 
            row.names = FALSE, col.names = FALSE)

Create a tidy table with pheno information

genus_tbl_t <- genus_tbl %>% 
  pivot_longer(cols= -1) %>% 
  pivot_wider(names_from = "genus_id",values_from = "value") %>%
  rename(ID = name) %>% 
  left_join(pheno_tbl, by = "ID")  # %>%
  #dplyr::select(-`NA`)

write.table(genus_tbl_t, file = "genus_tbl_t.txt", 
            row.names = FALSE, col.names = TRUE)

3.2.3 Filter data at Species level

Aggregated counts at genus level

species_counts = aggTax(msd16s, lvl = "species", out = "matrix", norm = T)
species_id <- rownames(species_counts)
species_counts_tbl <- bind_cols(species_id = species_id, species_counts %>% as_tibble())
species_counts_tbl

# A tibble: 754 × 993
   species_id     `100259` `100262` `100267` `100274` `100275` `100277` `100291`
   <chr>             <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>
 1 [Leptotrichia…      0          0       0       0          0      0       0   
 2 Abiotrophia d…      0          0       0       0          0      0       1.80
 3 Acetobacter i…      0          0       0       0          0      0       0   
 4 Acetobacter s…      0          0       0       0          0      0       0   
 5 Acetonema lon…      0          0       0       0          0      0       0   
 6 Acidaminococc…     10.6        0       0      73.9        0     80.7     0   
 7 Acidaminococc…      0          0       0       0          0      0       0   
 8 Acidithiobaci…      0          0     141.      0          0      0       0   
 9 Acidithiobaci…      0          0       0       0          0      0       0   
10 Acidovorax ca…      0          0       0       0          0      0       0   
# ℹ 744 more rows
# ℹ 985 more variables: `100292` <dbl>, `100293` <dbl>, `100294` <dbl>,
#   `100298` <dbl>, `100303` <dbl>, `100317` <dbl>, `100320` <dbl>,
#   `100322` <dbl>, `100341` <dbl>, `100353` <dbl>, `100356` <dbl>,
#   `100361` <dbl>, `100365` <dbl>, `100395` <dbl>, `100401` <dbl>,
#   `100403` <dbl>, `100437` <dbl>, `100457` <dbl>, `100462` <dbl>,
#   `100470` <dbl>, `100476` <dbl>, `100489` <dbl>, `100491` <dbl>, …

Species was abundant (≥12 normalized counts per sample) in cases or controls;

con_mean <- species_counts_tbl %>%
  dplyr:: select(all_of(control_to_select)) %>% 
  rowMeans() 

case_mean <- species_counts_tbl %>%
  dplyr:: select(all_of(case_to_select)) %>% 
  rowMeans() 

species_tbl_filt <- bind_cols(species_counts_tbl, 
                             con_mean = con_mean, 
                             case_mean = case_mean) %>%
  filter(con_mean >= 12 | case_mean >= 12)

Genus was prevalent (present in ≥10 cases and controls);

otu_prevalence <- species_tbl_filt %>%
  dplyr:: select(-species_id) %>%
  mutate(across(everything(), ~ as.integer(. > 0))) %>%
  rowSums()
  
species_tbl <- bind_cols(species_tbl_filt, otu_prevalence = otu_prevalence) %>%
  filter(otu_prevalence >= 10) %>%
  dplyr:: select(-con_mean, -case_mean, -otu_prevalence)

rm(species_tbl_filt)
speciesnames = species_tbl$species_id[-1*which(species_tbl$species_id == "NA")]
speciesnames |> as_tibble() |> 
  write.table("speciesnames.txt", row.names = FALSE, col.names = FALSE)

Create a tidy table with pheno information

species_tbl_t <- species_tbl %>% 
  pivot_longer(cols= -1) %>% 
  pivot_wider(names_from = "species_id",values_from = "value") %>%
  rename(ID = name) %>% 
  left_join(pheno_tbl, by = "ID") %>%
  dplyr:: select(-`NA`)

# save the results as txt file
write.table(species_tbl_t, 
            file = "species_tbl_t.txt", 
            sep = "\t")

3.3 Differential abundance analysis

3.3.1 Genus level

3.3.1.1 Quasipoisson regression model

genusnames <- read.table("genusnames.txt", header = FALSE)$V1
genus_tbl_t <- read.table("genus_tbl_t.txt", header = TRUE)

results_genus_qpoisson <- 
  map_df(genusnames, function(response) {
    model <- glm(as.formula(paste(response, "~ Type")), 
                 data = genus_tbl_t, 
                 family = quasipoisson)
    tidy(model, conf.int = TRUE) %>%
      # Add a column for the response variable
      mutate(response_variable = response) %>%
      mutate(AIC = AIC(model))
    }) %>% 
  filter(term == "TypeControl") %>% 
  arrange(p.value) 

Quasipoisson_genus <- 
  results_genus_qpoisson %>% filter(p.value < 0.05/41) %>%
  dplyr::select(response_variable, p.value) %>%
  print(n = 20)

# A tibble: 12 × 2
   response_variable  p.value
   <chr>                <dbl>
 1 Escherichia       1.67e-17
 2 Prevotella        6.43e-16
 3 Streptococcus     8.92e-12
 4 Haemophilus       4.33e- 9
 5 Granulicatella    1.78e- 7
 6 Eubacterium       3.26e- 7
 7 Lactobacillus     2.69e- 6
 8 Enterobacter      3.70e- 6
 9 Shigella          9.08e- 6
10 Bacteroides       1.02e- 5
11 Neisseria         2.10e- 4
12 Rothia            8.74e- 4

mean(results_genus_qpoisson$AIC)

[1] NA

3.3.1.2 Negative binomial regression model

results_genus_nb2 <- 
  map_df(genusnames, function(response) {
  #  print(response)
    model <- glm(as.formula(paste(response, "~ Type")), 
                 family = negative.binomial(20),
                 data = genus_tbl_t)
    tidy(model, conf.int = TRUE) %>%
      # Add a column for the response variable
      mutate(response_variable = response) %>%
      mutate(AIC = AIC(model))
    }) %>% 
  filter(term == "TypeControl") %>% 
  arrange(p.value) 

Negative_binomial_genus <-
results_genus_nb2 %>% filter(p.value < 0.05/41) %>%
  dplyr::select(response_variable, p.value) %>%
  print(n = Inf)

# A tibble: 17 × 2
   response_variable  p.value
   <chr>                <dbl>
 1 Escherichia       1.80e-21
 2 Streptococcus     1.19e-16
 3 Haemophilus       4.72e-16
 4 Prevotella        1.22e-15
 5 Granulicatella    3.85e-11
 6 Eubacterium       1.11e- 7
 7 Lactobacillus     1.72e- 7
 8 Enterobacter      1.96e- 6
 9 Shigella          3.67e- 6
10 Bacteroides       4.00e- 6
11 Neisseria         8.11e- 6
12 Blautia           1.03e- 4
13 Weissella         1.51e- 4
14 Campylobacter     4.08e- 4
15 Rothia            5.75e- 4
16 Acinetobacter     1.09e- 3
17 Sarcina           1.11e- 3

mean(results_genus_nb2$AIC)

[1] 55640.53

3.3.1.3 Zero-inflated Poisson model

results_genus_zip <- 
  map_df(genusnames, function(response) {
    model <- zeroinfl(as.formula(paste(response, "~ Type")), 
                 data = genus_tbl_t |>
                 mutate(across(all_of(genusnames), ~ floor(.))),
                 dist = "poisson")
    summary(model)$coefficients |>
      as.data.frame() |>
      filter(row_number() == 2) |>
      rename(count.p.value = `count.Pr...z..`, 
             zero.p.value = `zero.Pr...z..`) |>
      # Add a column for the response variable
      mutate(response_variable = response) |>
      mutate(AIC = AIC(model))
    }) %>% 
  arrange(count.p.value, zero.p.value)

Zero_inflated_poisson_genus <-
results_genus_zip %>% 
  filter(count.p.value < 0.05/41
         & zero.p.value < 0.05/41) %>%
  as_tibble() %>%
  dplyr::select(response_variable, 
                count.p.value, 
                zero.p.value) %>%
  print(n = Inf)

# A tibble: 22 × 3
   response_variable count.p.value zero.p.value
   <chr>                     <dbl>        <dbl>
 1 Clostridium           0             1.09e-20
 2 Bacteroides           0             8.91e-12
 3 Eubacterium           0             1.35e-11
 4 Blautia               0             1.96e-11
 5 Granulicatella        0             4.83e- 9
 6 Collinsella           0             4.90e- 9
 7 Acinetobacter         0             5.69e- 9
 8 Campylobacter         0             2.29e- 7
 9 Catenibacterium       0             1.21e- 4
10 Neisseria             0             7.92e- 4
11 Streptococcus         0             1.07e- 3
12 Faecalibacterium      1.72e-303     1.40e- 7
13 Shigella              3.12e-283     1.18e- 9
14 Enterobacter          2.55e-251     5.57e-10
15 Dorea                 1.17e-195     5.08e- 7
16 Roseburia             1.32e-191     2.32e- 8
17 Coriobacterium        8.42e-119     3.73e- 6
18 Citrobacter           1.76e- 99     8.41e- 9
19 Coprococcus           4.42e- 88     2.81e- 6
20 Rothia                2.90e- 85     9.29e-12
21 Ruminococcus          3.17e- 32     5.43e-15
22 Parabacteroides       4.97e- 16     1.04e- 7

mean(results_genus_zip$AIC)

[1] 681531.1

3.3.1.4 Zero-inflated negative binomial regression model

results_genus_zinb <- 
  map_df(genusnames, function(response) {
    model <- zeroinfl(as.formula(paste(response, "~ Type")), 
                 data = genus_tbl_t |>
                 mutate(across(all_of(genusnames), ~ floor(.))),
                 dist = "negbin")
    cbind(summary(model)$coefficients$count[2,4],
          summary(model)$coefficients$zero[2,4]) %>%
      as_tibble() %>%
      rename(count.p.value = V1,
             zero.p.value = V2)  |>
      # Add a column for the response variable
      mutate(response_variable = response) |>
      mutate(AIC = AIC(model))
    }) %>% 
  arrange(count.p.value, zero.p.value)

Warning: The `x` argument of `as_tibble.matrix()` must have unique column names if
`.name_repair` is omitted as of tibble 2.0.0.
ℹ Using compatibility `.name_repair`.

Warning in sqrt(diag(vc)[np]): NaNs produced

Warning in sqrt(diag(object$vcov)): NaNs produced
Warning in sqrt(diag(object$vcov)): NaNs produced
Warning in sqrt(diag(object$vcov)): NaNs produced
Warning in sqrt(diag(object$vcov)): NaNs produced

Zero_inflated_negative_binomial_genus <-
results_genus_zinb %>%
  filter(count.p.value < 0.05/41
         & zero.p.value < 0.05/41) %>%
  as_tibble() %>%
  dplyr::select(response_variable, 
                count.p.value, 
                zero.p.value) %>%
  print(n = Inf)

# A tibble: 3 × 3
  response_variable count.p.value zero.p.value
  <chr>                     <dbl>        <dbl>
1 Shigella               5.06e-10 0.000400    
2 Eubacterium            1.40e- 5 0.0000000501
3 Enterobacter           9.65e- 4 0.0000257

mean(results_genus_zinb$AIC)

[1] 6788.241

3.3.1.5 Comparison

Compare the results from these models and discuss the pros and cons of each model.
- Create a table to summarize the significant microbiome clusters from these models.

tibble::tibble(
  Model = c("Quasi-Poisson", "Negative Binomial", "Zero Inflated Poisson", "Zero Inflated Negative Binomial"),
  Significant = c(12, 17, 22, 3)
) |>
  knitr::kable()

Model	Significant
Quasi-Poisson	12
Negative Binomial	17
Zero Inflated Poisson	22
Zero Inflated Negative Binomial	3

Create a Venn Diagram to summarize the significant microbiome clusters overlaps from these models.

# create a list of significant genera from each model
significant_genera <- list(
  Quasi_Poisson = 
    Quasipoisson_genus$response_variable,
  Negative_Binomial = 
    Negative_binomial_genus$response_variable,
  Zero_Inflated_Poisson = 
    Zero_inflated_poisson_genus$response_variable,
  Zero_Inflated_Negative_Binomial = 
    Zero_inflated_negative_binomial_genus$response_variable
)

# create a venn diagram
ggVennDiagram(significant_genera, set_size = 3.5, edge_size = 0.5) +
  scale_fill_gradient(low = "grey90", high = "red") +
  labs(title = "Significant Genera from Different Models")

3.3.2 Species level

3.3.2.1 Quasipoisson regression model

species_tbl_t <- read.table("species_tbl_t.txt", header = TRUE) |>
  as_tibble()

results_species_qpoisson <- 
  map_df(names(species_tbl_t)[2:102], function(response) {
    model <- glm(as.formula(paste("`", response, "`", "~ Type", sep = "")), 
                 data = species_tbl_t, 
                 family = quasipoisson)
    tidy(model, conf.int = TRUE) %>%
      # Add a column for the response variable
      mutate(response_variable = response) |>
      mutate(AIC = AIC(model))
    }) %>% 
  filter(term == "TypeControl") %>% 
  arrange(p.value) 

Quasipoisson_species <-
results_species_qpoisson %>% 
  filter(p.value < 0.05/101) %>%
  dplyr::select(response_variable, p.value) %>%
  print(n = 20)

# A tibble: 20 × 2
   response_variable                    p.value
   <chr>                                  <dbl>
 1 Escherichia.coli                    1.71e-17
 2 Prevotella.copri                    1.89e-11
 3 Prevotella.sp..DJF_B112             7.02e-11
 4 Prevotella.sp..DJF_RP53             5.24e- 9
 5 Prevotella.sp..oral.clone.BP1.28    4.53e- 8
 6 Haemophilus.parainfluenzae          3.03e- 7
 7 Acinetobacter.sp..SF6               3.06e- 7
 8 Enterobacter.cloacae                4.05e- 7
 9 Prevotella.sp..BI.42                1.19e- 6
10 Streptococcus.sp..oral.clone.ASCC04 1.30e- 6
11 Streptococcus.peroris               7.89e- 6
12 Streptococcus.mitis                 9.02e- 6
13 Streptococcus.sp..C101              1.78e- 5
14 Streptococcus.equinus               8.94e- 5
15 Streptococcus.salivarius            9.86e- 5
16 Streptococcus.sp..oral.clone.ASCE09 1.94e- 4
17 Prevotella.sp..DJF_B116             2.08e- 4
18 Bacteroides.fragilis                2.59e- 4
19 Megasphaera.elsdenii                3.05e- 4
20 Eubacterium.rectale                 4.56e- 4

mean(results_species_qpoisson$AIC)

[1] NA

3.3.2.2 Negative binomial regression model

speciesnames <- names(species_tbl_t)[2:102]

results_species_nb2 <- 
  map_df(speciesnames, function(response) {
    # print(response)
    model <- glm(as.formula(paste("`", response, "`", "~ Type", sep = "")), 
                 family = negative.binomial(20),
                 data = species_tbl_t)
    tidy(model, conf.int = TRUE) %>%
      # Add a column for the response variable
      mutate(response_variable = response) |>
      mutate(AIC = AIC(model))
    }) %>% 
  filter(term == "TypeControl") %>% 
  arrange(p.value)

Negative_binomial_species <-
results_species_nb2 %>% 
  filter(p.value < 0.05/101) %>%
  dplyr::select(response_variable, p.value) %>%
  print(n = 20)

# A tibble: 27 × 2
   response_variable                    p.value
   <chr>                                  <dbl>
 1 Escherichia.coli                    1.77e-21
 2 Prevotella.copri                    2.66e-11
 3 Haemophilus.parainfluenzae          5.26e-11
 4 Prevotella.sp..DJF_B112             1.18e-10
 5 Prevotella.sp..DJF_RP53             1.29e- 8
 6 Streptococcus.mitis                 3.19e- 8
 7 Prevotella.sp..oral.clone.BP1.28    3.34e- 8
 8 Enterobacter.cloacae                5.74e- 8
 9 Acinetobacter.sp..SF6               1.26e- 7
10 Streptococcus.pasteurianus          3.27e- 7
11 Streptococcus.sp..oral.clone.ASCC04 3.82e- 7
12 Streptococcus.equinus               8.12e- 7
13 Streptococcus.peroris               2.19e- 6
14 Streptococcus.sp..C101              2.25e- 6
15 Prevotella.sp..BI.42                2.54e- 6
16 Streptococcus.salivarius            4.13e- 6
17 Streptococcus.sp..oral.clone.ASCE09 4.67e- 6
18 Haemophilus.haemolyticus            2.64e- 5
19 Lactobacillus.sp..KLDS.1.0713       5.56e- 5
20 Megasphaera.elsdenii                7.93e- 5
# ℹ 7 more rows

mean(results_species_nb2$AIC)

[1] 50974.4

3.3.2.3 Zero-inflated poisson regression model

results_species_zip <- 
  map_df(speciesnames, function(response) {
    model <- zeroinfl(as.formula(paste("`", response, "`", "~ Type", 
                                       sep = "")), 
                 data = species_tbl_t |>
                 mutate(across(all_of(speciesnames), ~ floor(.))),
                 dist = "poisson")
    summary(model)$coefficients |>
      as.data.frame() |>
      filter(row_number() == 2) |>
      rename(count.p.value = `count.Pr...z..`, 
             zero.p.value = `zero.Pr...z..`) |>
      # Add a column for the response variable
      mutate(response_variable = response) |>
      mutate(AIC = AIC(model))
    }) %>% 
  arrange(count.p.value, zero.p.value)

Zero_inflated_poisson_species <-
results_species_zip %>%
  filter(count.p.value <= 0.05/101
         & zero.p.value <= 0.05/101) %>%
  as_tibble() %>%
  dplyr::select(response_variable, 
                count.p.value, 
                zero.p.value) %>%
  print(n = Inf)

# A tibble: 48 × 3
   response_variable                   count.p.value zero.p.value
   <chr>                                       <dbl>        <dbl>
 1 Ruminococcus.gnavus                     0             1.06e-16
 2 Streptococcus.mitis                     0             3.17e-15
 3 Haemophilus.haemolyticus                0             5.87e-15
 4 Streptococcus.sp..oral.clone.ASCE09     0             2.12e-13
 5 Blautia.wexlerae                        0             7.20e-11
 6 Collinsella.sp..CB20                    0             7.26e-10
 7 Prevotella.sp..DJF_RP53                 0             6.54e- 9
 8 Citrobacter.freundii                    0             1.06e- 7
 9 Prevotella.sp..BI.42                    0             2.65e- 7
10 Bacteroides.fragilis                    0             2.97e- 7
11 Prevotella.copri                        0             6.84e- 7
12 Enterococcus.sp..L2                     0             2.67e- 6
13 Campylobacter.jejuni                    0             5.67e- 6
14 Collinsella.aerofaciens                 0             6.73e- 6
15 Bacteroides.vulgatus                    0             2.34e- 5
16 Ruminococcus.lactaris                   0             2.99e- 5
17 Eubacterium.hallii                      0             6.85e- 5
18 Catenibacterium.mitsuokai               0             1.21e- 4
19 Prevotella.sp..oral.clone.BP1.28        0             1.95e- 4
20 Campylobacter.upsaliensis               0             2.00e- 4
21 Prevotella.sp..DJF_B112                 0             2.16e- 4
22 Clostridium.difficile                   0             2.49e- 4
23 Lactobacillus.sp..KLDS.1.0713           0             3.02e- 4
24 Streptococcus.sp..C151                  0             3.72e- 4
25 Megasphaera.sp..TrE9262                 2.30e-283     2.85e- 5
26 Eubacterium.rectale                     6.93e-256     3.19e- 7
27 Acinetobacter.sp..SF6                   6.45e-229     3.82e- 8
28 Enterobacter.cloacae                    6.96e-213     4.19e-11
29 Faecalibacterium.prausnitzii            4.03e-207     5.21e- 9
30 Dorea.longicatena                       1.17e-195     5.08e- 7
31 Streptococcus.parasanguinis             4.24e-170     2.44e- 6
32 Clostridium.lituseburense               8.78e-170     4.85e-20
33 Streptococcus.sp..C101                  2.80e-142     2.11e-16
34 Clostridium.hathewayi                   6.53e-134     1.23e- 8
35 Escherichia.sp..oral.clone.3RH.30       6.89e-117     3.19e- 9
36 Clostridium.paraputrificum              1.95e-105     2.40e-16
37 Streptococcus.peroris                   4.95e- 88     4.21e-11
38 Rothia.mucilaginosa                     7.77e- 85     4.12e-11
39 Fusobacterium.sp..CO6                   1.61e- 75     1.87e- 6
40 Streptococcus.sp..oral.clone.ASCC04     6.89e- 66     4.23e-12
41 Coprococcus.eutactus                    8.43e- 57     1.69e- 5
42 Ruminococcus.sp..M76                    1.06e- 30     1.17e- 4
43 Faecalibacterium.sp..DJF_VR20           9.60e- 27     2.36e-11
44 Bacteroides.uniformis                   6.28e- 23     5.18e- 5
45 Parabacteroides.distasonis              8.57e- 16     9.06e- 6
46 Clostridium.neonatale                   3.55e- 15     1.50e- 4
47 Bacteroides.ovatus                      4.79e- 12     1.00e- 6
48 Collinsella.sp..HA6                     2.46e-  5     2.91e- 4

mean(results_species_zip$AIC)

[1] 299218.2

3.3.2.4 Zero-inflated negative binomial regression

results_species_zinb <- 
  map_df(speciesnames, function(response) {
    model <- zeroinfl(as.formula(paste("`", response, "`", "~ Type", 
                                       sep = "")), 
                 data = species_tbl_t |>
                 mutate(across(all_of(speciesnames), ~ floor(.))),
                 dist = "negbin")
    cbind(summary(model)$coefficients$count[2,4],
          summary(model)$coefficients$zero[2,4]) %>%
      as_tibble() %>%
      rename(count.p.value = V1,
             zero.p.value = V2)  |>
      # Add a column for the response variable
      mutate(response_variable = response) |>
      mutate(AIC = AIC(model))
    }) %>% 
  arrange(count.p.value, zero.p.value)

Warning in sqrt(diag(vc)[np]): NaNs produced

Warning in sqrt(diag(object$vcov)): NaNs produced
Warning in sqrt(diag(object$vcov)): NaNs produced
Warning in sqrt(diag(object$vcov)): NaNs produced
Warning in sqrt(diag(object$vcov)): NaNs produced

Zero_inflated_negative_binomial_species <-
results_species_zinb %>%
  filter(count.p.value < 0.05/101
         & zero.p.value < 0.05/101) %>%
  as_tibble() %>%
  dplyr::select(response_variable, 
                count.p.value, 
                zero.p.value) %>%
  print(n = Inf)

# A tibble: 6 × 3
  response_variable                 count.p.value zero.p.value
  <chr>                                     <dbl>        <dbl>
1 Prevotella.copri                    0.000000141 0.00000973  
2 Acinetobacter.sp..SF6               0.00000129  0.00000875  
3 Enterobacter.cloacae                0.00000184  0.0000000505
4 Prevotella.sp..DJF_RP53             0.00000449  0.000000743 
5 Prevotella.sp..BI.42                0.0000428   0.0000173   
6 Escherichia.sp..oral.clone.3RH.30   0.000381    0.00000456

mean(results_species_zinb$AIC)

[1] 4761.426

3.3.2.5 Comparison

Compare the results from these models and discuss the pros and cons of each model.
- Create a table to summarize the significant microbiome clusters from these models.

tibble::tibble(
  Model = c("Quasi-Poisson", "Negative Binomial", "Zero Inflated Poisson", "Zero Inflated Negative Binomial"),
  Significant = c(20, 27, 48, 6)
) |>
  knitr::kable(caption = "Significant Species from Different Models")

Significant Species from Different Models
Model	Significant
Quasi-Poisson	20
Negative Binomial	27
Zero Inflated Poisson	48
Zero Inflated Negative Binomial	6

Create a Venn Diagram to summarize the significant microbiome clusters overlaps from these models.

# create a list of significant genera from each model
significant_species <- list(
  Quasi_Poisson = 
    Quasipoisson_species$response_variable,
  Negative_Binomial = 
    Negative_binomial_species$response_variable,
  Zero_Inflated_Poisson = 
    Zero_inflated_poisson_species$response_variable,
  Zero_Inflated_Negative_Binomial = 
    Zero_inflated_negative_binomial_species$response_variable
)

# create a venn diagram
ggVennDiagram(significant_species, set_size = 3.5, edge_size = 0.5) +
  scale_fill_gradient(low = "grey90", high = "red") +
  labs(title = "Significant Species Overlaps")

3.3.3 Discussion

Quasi-Poisson model:
- Pros:
  - Handles overdispersion: Quasi-Poisson models can account for overdispersion (variance greater than the mean) by introducing a dispersion parameter, which makes it more flexible than the standard Poisson model.
  - The quasi-likelihood approach does not require specifying a full probability distribution, only the mean and variance relationship.
- Cons:
  - The quasi-likelihood approach does not provide a full probability distribution, which can limit the interpretation of the model.
  - The quasi-likelihood approach can be less efficient than the full likelihood approach, leading to less precise estimates.
  - While it handles overdispersion, it might not perform as well as other models in cases of severe overdispersion or excess zeros.
Negative Binomial model:
- Pros:
  - Handles overdispersion and excess zeros: The negative binomial model can account for both overdispersion and excess zeros, making it a versatile choice for count data with these characteristics.
  - Provides a full probability distribution: The negative binomial model provides a full probability distribution, allowing for more detailed interpretation of the model.
  - More efficient than the quasi-Poisson model: The negative binomial model is more efficient than the quasi-Poisson model in cases of overdispersion, leading to more precise estimates.
- Cons:
  - Can be computationally intensive: The negative binomial model can be computationally intensive, especially for large datasets or complex models.
  - Requires specifying a full probability distribution: The negative binomial model requires specifying a full probability distribution, which can be more complex than the quasi-likelihood approach.
Zero-Inflated Poisson model:
- Pros:
  - Handles excess zeros: The zero-inflated Poisson model can account for excess zeros in the data, making it suitable for count data with many zero values.
  - Provides a full probability distribution: The zero-inflated Poisson model provides a full probability distribution, allowing for detailed interpretation of the model.
  - More efficient than the quasi-Poisson model: The zero-inflated Poisson model is more efficient than the quasi-Poisson model in cases of excess zeros, leading to more precise estimates.
- Cons:
  - Can be computationally intensive: The zero-inflated Poisson model can be computationally intensive, especially for large datasets or complex models.
  - Requires specifying a full probability distribution: The zero-inflated Poisson model requires specifying a full probability distribution, which can be more complex than the quasi-likelihood approach.
Zero-Inflated Negative Binomial model:
- Pros:
  - Handles overdispersion and excess zeros: The zero-inflated negative binomial model can account for both overdispersion and excess zeros, making it a versatile choice for count data with these characteristics.
  - Provides a full probability distribution: The zero-inflated negative binomial model provides a full probability distribution, allowing for detailed interpretation of the model.
  - More efficient than the quasi-Poisson model: The zero-inflated negative binomial model is more efficient than the quasi-Poisson model in cases of overdispersion and excess zeros, leading to more precise estimates.
- Cons:
  - Can be computationally intensive: The zero-inflated negative binomial model can be computationally intensive, especially for large datasets or complex models.
  - Requires specifying a full probability distribution: The zero-inflated negative binomial model requires specifying a full probability distribution, which can be more complex than the quasi-likelihood approach.

3.4 Evaluate Type I Error rate

Evaluate Type I Error rate

This section evaluates the Type I error rate of the models by simulating data under the null hypothesis of no association between the microbiome and the outcome variable.
We simulate binary labels 100 times using a Bernoulli distribution with a probability of 0.512 (estimated from the data) for each sample.
We evaluate the association between each genus and the outcome variable using different regression models, document the proportion of significant associations using threshold of 0.05 over 100 replicates (e.g., empirical type I error).
We repeat the same process for the species level data.
If the empirical type I error rate is close to the nominal type I error rate (0.05), then the model is considered to have good control over the Type I error rate.
If the empirical type I error rate is much lower than the nominal type I error rate, then the model is considered to be conservative.
If the empirical type I error rate is much higher than the nominal type I error rate, then the model is considered to be inflated (i.e., false positive rate is higher than expected).

3.4.1 Genus level

3.4.1.1 Quasipoisson regression model

# Set the number of columns 
num_columns <- 100
num_rows <- dim(genus_tbl_t)[1] 
set.seed(10)

# Generate the tibble with 100 columns of random normal values
type1e <- function(genusname) {
  sim_tbl <- tibble(
    y = genus_tbl_t %>% dplyr::select(all_of(genusname)) %>% pull,
    as_tibble(
    matrix(sample(0:1, num_columns * num_rows, replace=T, prob=c(p_control, p_case)), 
           nrow = num_rows, ncol = num_columns, 
         dimnames = list(NULL, paste0("x", 1:num_columns)))))

  results <- map_df(names(sim_tbl)[-1], 
                  ~ tidy(glm(reformulate(.x, response = "y"), 
                             data = sim_tbl,                              
                             family = quasipoisson)), 
                  .id = "variable") %>%
              filter(term != "(Intercept)")
  return(mean(results$p.value < 0.05))
}

type1e_tbl = tibble(genus_name = genusnames, 
       etype_1_error = map_dbl(genusnames, type1e)) 
median(type1e_tbl$etype_1_error)
IQR(type1e_tbl$etype_1_error)

3.4.1.2 Negative binomial regression model

# Set the number of columns 
num_columns <- 100
num_rows <- dim(genus_tbl_t)[1] 
set.seed(10)

# Generate the tibble with 100 columns of random normal values
type1e <- function(genusname) {
  sim_tbl <- tibble(
    y = genus_tbl_t %>% dplyr::select(all_of(genusname)) %>% pull,
    as_tibble(
    matrix(sample(0:1, num_columns * num_rows, replace=T, prob=c(p_control, p_case)), 
           nrow = num_rows, ncol = num_columns, 
         dimnames = list(NULL, paste0("x", 1:num_columns)))))

  results <- map_df(names(sim_tbl)[-1], 
                  ~ tidy(glm(reformulate(.x, response = "y"), 
                             data = sim_tbl, 
                             family = negative.binomial(20),
                             control = glm.control(maxit = 100))), 
                  .id = "variable") %>%
              filter(term != "(Intercept)")
  return(mean(results$p.value < 0.05))
}

type1e_tbl = tibble(genus_name = genusnames, 
       etype_1_error = map_dbl(genusnames, type1e)) 
median(type1e_tbl$etype_1_error)
# calculate the IQR for the median
IQR(type1e_tbl$etype_1_error)

3.4.1.3 Zero-inflated Poisson regression model

# Set the number of columns
num_columns <- 100
num_rows <- dim(genus_tbl_t)[1]
set.seed(10)

# Generate the tibble with 100 columns of random normal values
type1e <- function(genusname) {
  sim_tbl <- tibble(
    y = genus_tbl_t %>% dplyr::select(all_of(genusname)) %>% pull,
    as_tibble(
      matrix(sample(0:1, num_columns * num_rows, replace = T, 
                    prob = c(p_control, p_case)),
             nrow = num_rows, ncol = num_columns,
             dimnames = list(NULL, paste0("x", 1:num_columns)))))

  results <- map_df(names(sim_tbl)[-1], 
                    ~ {
                      model <- zeroinfl(reformulate(.x, response = "y"), 
                                        data = sim_tbl %>%
                                          mutate(across(where(is.numeric), 
                                                        ~ floor(.))),
                                        dist = "poisson")
                      coef_table <- coef(summary(model))
                      tidy_results <- as.data.frame(coef_table)
                      filter(tidy_results, row_number() == 2) 
                    },
                    .id = "variable")
  return(mean(results$zero.Pr...z.. < 0.05))
}

if (file.exists("zero_inflated_poisson_type1e.RDS")) {
  type1e_tbl <- readRDS("zero_inflated_poisson_type1e.RDS")
} else {
  type1e_tbl = tibble(genus_name = genusnames, 
                      etype_1_error = map_dbl(genusnames, type1e))
  saveRDS(type1e_tbl, "zero_inflated_poisson_type1e.RDS")
}
median(type1e_tbl$etype_1_error)
IQR(type1e_tbl$etype_1_error)

3.4.1.4 Zero-inflated negative binomial regression

# Set the number of columns
num_columns <- 100
num_rows <- dim(genus_tbl_t)[1]
set.seed(10)

# Generate the tibble with 100 columns of random normal values
type1e <- function(genusname) {
  sim_tbl <- tibble(
    y = genus_tbl_t %>% dplyr::select(all_of(genusname)) %>% pull,
    as_tibble(
      matrix(sample(0:1, num_columns * num_rows, replace = T, 
                    prob = c(p_control, p_case)),
             nrow = num_rows, ncol = num_columns,
             dimnames = list(NULL, paste0("x", 1:num_columns)))))

  results <- map_df(names(sim_tbl)[-1], 
                    ~ {
                      model <- zeroinfl(reformulate(.x, response = "y"), 
                                        data = sim_tbl %>%
                                          mutate(across(where(is.numeric), 
                                                        ~ floor(.))),
                                        dist = "negbin")
                      cbind(summary(model)$coefficients$count[2,4],
                            summary(model)$coefficients$zero[2,4]) %>%
                        as_tibble() %>%
                        rename(count.p.value = V1,
                               zero.p.value = V2)
                    },
                    .id = "variable")
  return(mean(results$zero.p.value < 0.05))
}

if (file.exists("zero_inflated_negbin_type1e.RDS")) {
  type1e_tbl <- readRDS("zero_inflated_negbin_type1e.RDS")
} else {
  type1e_tbl = tibble(genus_name = genusnames, 
                      etype_1_error = map_dbl(genusnames, type1e))
  saveRDS(type1e_tbl, "zero_inflated_negbin_type1e.RDS")
}
median(type1e_tbl |> pull(etype_1_error) |> na.omit())
IQR(type1e_tbl |> pull(etype_1_error) |> na.omit())

3.4.2 Species level

3.4.2.1 Quasipoisson regression model

# Set the number of columns 
num_columns <- 100
num_rows <- 992 
set.seed(10)

# Generate the tibble with 100 columns of random normal values
type1e <- function(speciesnames) {
  sim_tbl <- tibble(
    y = species_tbl_t %>% dplyr::select(all_of(speciesnames)) %>% pull,
    as_tibble(
    matrix(sample(0:1, num_columns * num_rows, replace=T, prob=c(p_control, p_case)), 
          nrow = num_rows, ncol = num_columns, 
         dimnames = list(NULL, paste0("x", 1:num_columns)))))

  results <- map_df(names(sim_tbl)[-1], 
                  ~ tidy(glm(reformulate(.x, response = "y"), 
                             data = sim_tbl), family = quasipoisson), 
                  .id = "variable") %>%
              filter(term != "(Intercept)")
  return(mean(results$p.value < 0.05))
}

type1e_tbl = tibble(species_name = names(species_tbl_t)[2:102], 
       etype_1_error = map_dbl(names(species_tbl_t)[2:102], type1e))
median(type1e_tbl$etype_1_error)

[1] 0.03

IQR(type1e_tbl$etype_1_error)

[1] 0.04

3.4.2.2 Negative binomial regression model

# Set the number of columns
num_columns <- 100
num_rows <- 992


# Generate the tibble with 100 columns of random normal values
type1e <- function(speciesname) {
  sim_tbl <- tibble(
    y = species_tbl_t %>% 
      dplyr::select(all_of(speciesname)) %>% pull, 
    as_tibble(
    matrix(sample(0:1, num_columns * num_rows, replace=T, 
                  prob=c(p_control, p_case)),
           nrow = num_rows, ncol = num_columns, 
           dimnames = list(NULL, paste0("x", 1:num_columns)))))

    results <- map_df(names(sim_tbl)[-1], 
                 ~ tidy(glm(reformulate(.x, response = "y"), 
                             data = sim_tbl, 
                             family = negative.binomial(30),
                             control = glm.control(maxit = 100))), 
                  .id = "variable") %>%
              filter(term != "(Intercept)")
  return(mean(results$p.value < 0.05))
}

type1e_tbl = tibble(
       species_name = speciesnames, 
       etype_1_error = map_dbl(speciesnames, type1e))
median(type1e_tbl$etype_1_error)

[1] 0.07

IQR(type1e_tbl$etype_1_error)

[1] 0.05

3.4.2.3 Zero-inflated poisson regression model

# Set the number of columns
num_columns <- 100
num_rows <- 992
set.seed(10)

# Generate the tibble with 100 columns of random normal values
type1e <- function(speciesname) {
  sim_tbl <- tibble(
    y = species_tbl_t %>% 
      dplyr::select(all_of(speciesname)) %>% pull, 
    as_tibble(
    matrix(sample(0:1, num_columns * num_rows, replace=T, 
                  prob=c(p_control, p_case)),
           nrow = num_rows, ncol = num_columns, 
           dimnames = list(NULL, paste0("x", 1:num_columns)))))

    results <- map_df(names(sim_tbl)[-1], 
                 ~ {
                   model <- zeroinfl(reformulate(.x, response = "y"), 
                                     data = sim_tbl %>%
                                       mutate(across(where(is.numeric), 
                                                     ~ floor(.))),
                                     dist = "poisson")
                   cbind(summary(model)$coefficients$count[2,4],
                         summary(model)$coefficients$zero[2,4]) %>%
                     as_tibble() %>%
                     rename(count.p.value = V1,
                            zero.p.value = V2)
                 },
                  .id = "variable")
  return(mean(results$zero.p.value < 0.05))
}

if (file.exists("zero_inflated_poisson_species_type1e.RDS")) {
  type1e_tbl <- readRDS("zero_inflated_poisson_species_type1e.RDS")
} else {
  type1e_tbl = tibble(species_name = speciesnames, 
                      etype_1_error = map_dbl(speciesnames, type1e))
  saveRDS(type1e_tbl, "zero_inflated_poisson_species_type1e.RDS")
}

median(type1e_tbl$etype_1_error, na.rm = T)

[1] 0.05

IQR(type1e_tbl$etype_1_error, na.rm = T)

[1] 0.02

3.4.2.4 Zero-inflated negative binomial regression model

# Set the number of columns
num_columns <- 100
num_rows <- 992
set.seed(10)

# Generate the tibble with 100 columns of random normal values
type1e <- function(speciesname) {
  sim_tbl <- tibble(
    y = species_tbl_t %>% 
      dplyr::select(all_of(speciesname)) %>% pull, 
    as_tibble(
    matrix(sample(0:1, num_columns * num_rows, replace=T, 
                  prob=c(p_control, p_case)),
           nrow = num_rows, ncol = num_columns, 
           dimnames = list(NULL, paste0("x", 1:num_columns)))))

    results <- map_df(names(sim_tbl)[-1], 
                 ~ {
                   model <- zeroinfl(reformulate(.x, response = "y"), 
                                     data = sim_tbl %>%
                                       mutate(across(where(is.numeric), 
                                                     ~ floor(.))),
                                     dist = "negbin")
                   cbind(summary(model)$coefficients$count[2,4],
                         summary(model)$coefficients$zero[2,4]) %>%
                     as_tibble() %>%
                     rename(count.p.value = V1,
                            zero.p.value = V2)
                 },
                  .id = "variable")
  return(mean(results$zero.p.value < 0.05))
}

if (file.exists("zero_inflated_negbin_species_type1e.RDS")) {
  type1e_tbl <- readRDS("zero_inflated_negbin_species_type1e.RDS")
} else {
  type1e_tbl = tibble(species_name = speciesnames, 
                      etype_1_error = map_dbl(speciesnames, type1e))
  saveRDS(type1e_tbl, "zero_inflated_negbin_species_type1e.RDS")
}

median(type1e_tbl |> pull(etype_1_error) |> na.omit())

[1] 0.04

IQR(type1e_tbl |> pull(etype_1_error) |> na.omit())

[1] 0.045

sum(is.na(type1e_tbl$etype_1_error))

[1] 70

3.5 Comments

In part 1, we have fitted four models: Quasi-Poisson, Negative Binomial, Zero Inflated Poisson, and Zero Inflated Negative Binomial. The AIC values for these models are as follows:

tibble::tibble(
  Model = c("Quasi-Poisson", "Negative Binomial", "Zero Inflated Poisson", "Zero Inflated Negative Binomial"),
  Genus.AIC = c("NA", "55640.53", "681531.1", "6788.241"),
  Species.AIC = c("NA", "50974.4", "299218.2", "4761.426")
) |> knitr::kable()

Model	Genus.AIC	Species.AIC
Quasi-Poisson	NA	NA
Negative Binomial	55640.53	50974.4
Zero Inflated Poisson	681531.1	299218.2
Zero Inflated Negative Binomial	6788.241	4761.426

In part 2, we have calculated the type I error rate for the four models. The median and IQR of the type I error rate for the four models are as follows:

tibble::tibble(
  Method = c("Quasi-Poisson", "Negative Binomial", "Zero Inflated Poisson", "Zero Inflated Negative Binomial"),
  Genus = c("0.03 (0.04)" , "0.07 (0.05)", "0.04 (0.03)", "0.01 (0.04)"),
  Species = c("0.03 (0.04)", "0.07 (0.05)", "0.05 (0.02)", "0.04 (0.045)")
) |> knitr::kable()

Method	Genus	Species
Quasi-Poisson	0.03 (0.04)	0.03 (0.04)
Negative Binomial	0.07 (0.05)	0.07 (0.05)
Zero Inflated Poisson	0.04 (0.03)	0.05 (0.02)
Zero Inflated Negative Binomial	0.01 (0.04)	0.04 (0.045)

From the above results, we can see that the Zero Inflated Negative Binomial model has the lowest AIC value for both Genus and Species data. The Zero Inflated Negative Binomial model also relatively performs better in terms of the type I error rate. Therefore, Zero Inflated Negative Binomial model seems to be good model for the given data. However, zero-inflated negative binomial regression models are computationally expensive and this model might fail to converge for some species, leading to NA values in the coefficients or p-values.

For a reliable Type I error rate and fewer NA issues, the Zero Inflated Poisson model is a safer choice. The Zero Inflated Poisson model has a higher AIC value compared to the Zero Inflated Negative Binomial model, but it is more robust and less computationally expensive. Therefore, the Zero Inflated Poisson model is a good alternative to the Zero Inflated Negative Binomial model. Given the importance of reliability in published results, I would recommend the Zero Inflated Poisson model for the given data due to its balance between fit (though not the best AIC) and robustness in Type I error rate estimation.