Biostat 203B Homework 5

Due Mar 22 @ 11:59PM

Author

Jiachen Ai and UID: 206182615

Predicting ICU duration

Using the ICU cohort mimiciv_icu_cohort.rds you built in Homework 4, develop at least three machine learning approaches (logistic regression with enet regularization, random forest, boosting, SVM, MLP, etc) plus a model stacking approach for predicting whether a patient’s ICU stay will be longer than 2 days. You should use the los_long variable as the outcome. You algorithms can use patient demographic information (gender, age at ICU intime, marital status, race), ICU admission information (first care unit), the last lab measurements before the ICU stay, and first vital measurements during ICU stay as features. You are welcome to use any feature engineering techniques you think are appropriate; but make sure to not use features that are not available at an ICU stay’s intime. For instance, last_careunit cannot be used in your algorithms.

Answer

  • Data preprocessing and feature engineering.
# Load libraries
library(gtsummary)
library(tidyverse)
library(tidymodels)
library(dplyr)
library(haven)
library(recipes)
library(GGally)
library(ranger)
library(xgboost)
library(stacks)
library(yardstick)
library(purrr)
library(vip)

First, load the data from Homework 4 and check the missing values of the data. Then, check the distribution of each variable. For categorical variables, I use a new category “Unknown” to get rid of missing values. For continuous variables, I use median to impute missing values. Finally, I summarize the data to check the distribution of each variable.

# Load "mimiciv_icu_cohort.rds" data
mimic_icu_cohort <- read_rds("mimic_icu_cohort.rds") |>
  select(
    -last_careunit,
    -dod,
    -discharge_location,
    -hospital_expire_flag,
    -los
  )

# Check missing values of the data
colSums(is.na(mimic_icu_cohort))
                           subject_id                               hadm_id 
                                    0                                     0 
                              stay_id                        first_careunit 
                                    0                                     0 
                       admission_type                    admission_location 
                                    0                                     0 
                            insurance                              language 
                                    0                                     0 
                       marital_status                                  race 
                                 5113                                     0 
                               gender                                sodium 
                                    0                                  8872 
                             chloride                            creatinine 
                                 8883                                  5770 
                            potassium                               glucose 
                                 8901                                  9099 
                           hematocrit                white_blood_cell_count 
                                 5017                                  5094 
                          bicarbonate             temperature_in_Fahrenheit 
                                 9050                                  1365 
diastolic_non_invasive_blood_pressure                      respiratory_rate 
                                  984                                    98 
 systolic_non_invasive_blood_pressure                            heart_rate 
                                  980                                    20 
                           age_intime                              los_long 
                                    0                                     0 
# Define a vector with the names of the continuous variables
colnames <- c(
  "sodium", "chloride", "creatinine",
  "potassium", "glucose", "hematocrit",
  "white_blood_cell_count", "bicarbonate",
  "temperature_in_Fahrenheit", 
  "diastolic_non_invasive_blood_pressure",
  "systolic_non_invasive_blood_pressure",
  "respiratory_rate", "heart_rate",
  "age_intime"
)

# Check the distribution of each continuous variable
# and I found that most of the continuous variables are right-skewed
# so I decide to use median to impute missing values later
hist_data <- lapply(mimic_icu_cohort[, colnames], function(x) {
  data.frame(value = x)
})

names(hist_data) <- colnames

hist_data <- bind_rows(hist_data, .id = "variable")

ggplot(hist_data, aes(x = value)) +
  geom_density(kernel = "cosine", fill = "skyblue", alpha = 0.5) +
  facet_wrap(~ variable, scales = "free") +
  labs(x = "Value") +
  theme_minimal()
Warning: Removed 64133 rows containing non-finite outside the scale range
(`stat_density()`).

# Deal with categorical variables
mimic_icu_cohort <- mimic_icu_cohort |>
  
  # For categorical variables "marital_status", 
  # I use a new category "Unknown" to get rid of missing values
  mutate(
    marital_status = ifelse(
      is.na(marital_status), "Unknown", marital_status)
  ) |>
  
  # Make the categorical variables as factors 
  # to make it easier for modeling
  mutate(marital_status = as.factor(marital_status)) |>
  mutate(los_long = as.factor(los_long))

# summarize the data
mimic_icu_cohort |>
  tbl_summary()
Characteristic N = 73,1811
subject_id 14,998,941 (12,491,257, 17,513,267)
hadm_id 24,969,636 (22,483,349, 27,471,796)
stay_id 34,993,892 (32,489,353, 37,488,401)
first_careunit
    Cardiac Vascular Intensive Care Unit (CVICU) 11,582 (16%)
    Medical Intensive Care Unit (MICU) 15,898 (22%)
    Medical/Surgical Intensive Care Unit (MICU/SICU) 12,733 (17%)
    Surgical Intensive Care Unit (SICU) 11,161 (15%)
    Other 21,807 (30%)
admission_type
    EW EMER. 38,672 (53%)
    OBSERVATION ADMIT 8,974 (12%)
    SURGICAL SAME DAY ADMISSION 7,373 (10%)
    URGENT 12,453 (17%)
    Other 5,709 (7.8%)
admission_location
    EMERGENCY ROOM 35,881 (49%)
    PHYSICIAN REFERRAL 16,398 (22%)
    TRANSFER FROM HOSPITAL 15,798 (22%)
    Other 5,104 (7.0%)
insurance
    Medicaid 5,528 (7.6%)
    Medicare 33,091 (45%)
    Other 34,562 (47%)
language
    ? 7,443 (10%)
    ENGLISH 65,738 (90%)
marital_status
    DIVORCED 5,404 (7.4%)
    MARRIED 32,768 (45%)
    SINGLE 20,858 (29%)
    Unknown 5,113 (7.0%)
    WIDOWED 9,038 (12%)
race
    ASIAN 2,155 (2.9%)
    BLACK 7,960 (11%)
    HISPANIC 2,741 (3.7%)
    Other 10,756 (15%)
    WHITE 49,569 (68%)
gender
    F 32,363 (44%)
    M 40,818 (56%)
sodium 139.0 (136.0, 141.0)
    Unknown 8,872
chloride 102 (98, 105)
    Unknown 8,883
creatinine 1.00 (0.80, 1.50)
    Unknown 5,770
potassium 4.20 (3.80, 4.60)
    Unknown 8,901
glucose 120 (99, 156)
    Unknown 9,099
hematocrit 35 (30, 40)
    Unknown 5,017
white_blood_cell_count 9.3 (6.8, 13.2)
    Unknown 5,094
bicarbonate 25.0 (22.0, 27.0)
    Unknown 9,050
temperature_in_Fahrenheit 98.20 (97.60, 98.70)
    Unknown 1,365
diastolic_non_invasive_blood_pressure 67 (57, 79)
    Unknown 984
respiratory_rate 18.0 (15.0, 22.0)
    Unknown 98
systolic_non_invasive_blood_pressure 121 (106, 139)
    Unknown 980
heart_rate 86 (74, 101)
    Unknown 20
age_intime 66 (54, 77)
los_long
    FALSE 38,050 (52%)
    TRUE 35,131 (48%)
1 Median (IQR); n (%)
  • Partition data into 50% training set and 50% test set. Stratify partitioning according to los_long. For grading purpose, sort the data by subject_id, hadm_id, and stay_id and use the seed 203 for the initial data split. Below is the sample code.
  1. Partition data into 50% training set and 50% test set.
set.seed(203)

# sort the data by subject_id, hadm_id, and stay_id
mimic_icu_cohort <- mimic_icu_cohort |>
  arrange(subject_id, hadm_id, stay_id) |>
  select(-subject_id, -hadm_id, -stay_id)

# partition data into 50% training set and 50% test set
data_split <- initial_split(
  mimic_icu_cohort, 
  # stratify by los_long
  strata = "los_long", 
  prop = 0.5
  )

data_split
<Training/Testing/Total>
<36590/36591/73181>
# check the training set
train_data <- training(data_split)
dim(train_data)
[1] 36590    23
# check the test set
test_data <- testing(data_split)
dim(test_data)
[1] 36591    23
  1. Pre-processing of data
# impute missing values
recipe <- recipe(los_long ~ ., data = train_data) |>
  
  # For continuous variables, I use median to impute missing values
  step_impute_median(sodium, potassium, 
                     glucose, hematocrit,
                     chloride, creatinine,
                     glucose, hematocrit,
                     white_blood_cell_count,
                     bicarbonate,
                     temperature_in_Fahrenheit,
                     diastolic_non_invasive_blood_pressure,
                     systolic_non_invasive_blood_pressure,
                     respiratory_rate,
                     heart_rate
                     ) |>
  
  # For categorical variables, I create traditional dummy variables
  step_dummy(all_nominal_predictors()) |>
  
  # zero-variance filter to remove predictors with zero variance
  step_zv(all_numeric_predictors()) |>
  
  # center and scale numeric data 
  # to have a mean of 0 and a standard deviation of 1
  # so that the model can be trained more efficiently
  step_normalize(all_numeric_predictors()) 
  • Train and tune the models using the training set.

Logistic regression with elastic net regularization

First, define the model specification and the workflow for the logistic regression model. Then, tune the model using 5-fold cross-validation.

  1. Define the model specification
# define the recipe
logit_recipe <- recipe
# define the model specification is to 
# use the glmnet package to do classification
logit_model <- logistic_reg(
  mode = "classification",
  penalty = tune(),
  mixture = tune()) |>
  
  # set the engine to glmnet
  set_engine("glmnet", standardize = FALSE) |>
  print()
Logistic Regression Model Specification (classification)

Main Arguments:
  penalty = tune()
  mixture = tune()

Engine-Specific Arguments:
  standardize = FALSE

Computational engine: glmnet 
  1. define the workflow
logit_workflow <- workflow() |>
  add_recipe(logit_recipe) |>
  add_model(logit_model) |>
  print()
══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: logistic_reg()

── Preprocessor ────────────────────────────────────────────────────────────────
4 Recipe Steps

• step_impute_median()
• step_dummy()
• step_zv()
• step_normalize()

── Model ───────────────────────────────────────────────────────────────────────
Logistic Regression Model Specification (classification)

Main Arguments:
  penalty = tune()
  mixture = tune()

Engine-Specific Arguments:
  standardize = FALSE

Computational engine: glmnet 
  1. tuning grid
# I try to tune the model using penalty, mixture, and levels
logit_param_grid <- grid_regular(
  penalty(range = c(-4, 1)), 
  mixture(),
  levels = c(50, 3)
  ) |>
  print()
# A tibble: 150 × 2
    penalty mixture
      <dbl>   <dbl>
 1 0.0001         0
 2 0.000126       0
 3 0.000160       0
 4 0.000202       0
 5 0.000256       0
 6 0.000324       0
 7 0.000409       0
 8 0.000518       0
 9 0.000655       0
10 0.000829       0
# ℹ 140 more rows
  1. cross-validation

Define the cross-validation folds

set.seed(203)

# define the number of folds for cross-validation is 5
logit_folds <- vfold_cv(train_data, v = 5)

logit_folds
#  5-fold cross-validation 
# A tibble: 5 × 2
  splits               id   
  <list>               <chr>
1 <split [29272/7318]> Fold1
2 <split [29272/7318]> Fold2
3 <split [29272/7318]> Fold3
4 <split [29272/7318]> Fold4
5 <split [29272/7318]> Fold5

Fit cross-validated models

if (file.exists("logit_fit.rds")) {
  logit_fit <- read_rds("logit_fit.rds")
  logit_fit
  
} else {
  (logit_fit <- logit_workflow |>
    tune_grid(
      resamples = logit_folds,
      grid = logit_param_grid,
      
      # use roc_auc and accuracy as the evaluation metrics
      metrics = metric_set(roc_auc, accuracy),
      
      # prepare for later stacking
      control = control_stack_grid()
    )) |>
  system.time()

  logit_fit |>
    write_rds("logit_fit.rds")
  
  logit_fit
}
# Tuning results
# 5-fold cross-validation 
# A tibble: 5 × 5
  splits               id    .metrics           .notes           .predictions
  <list>               <chr> <list>             <list>           <list>      
1 <split [29272/7318]> Fold1 <tibble [300 × 6]> <tibble [0 × 3]> <tibble>    
2 <split [29272/7318]> Fold2 <tibble [300 × 6]> <tibble [0 × 3]> <tibble>    
3 <split [29272/7318]> Fold3 <tibble [300 × 6]> <tibble [0 × 3]> <tibble>    
4 <split [29272/7318]> Fold4 <tibble [300 × 6]> <tibble [0 × 3]> <tibble>    
5 <split [29272/7318]> Fold5 <tibble [300 × 6]> <tibble [0 × 3]> <tibble>    
  1. visualize CV results and select the best model
logit_fit |>
  
  # aggregate metrics from 5-fold cross-validation
  collect_metrics() |>
  print(width = Inf) |>
  filter(.metric == "roc_auc") |>
  ggplot(mapping = aes(x = penalty, y = mean, 
                       color = factor(mixture))) +
  geom_point() +
  labs(x = "Penalty", y = "CV AUC") +
  scale_x_log10()
# A tibble: 300 × 8
    penalty mixture .metric  .estimator  mean     n std_err
      <dbl>   <dbl> <chr>    <chr>      <dbl> <int>   <dbl>
 1 0.0001         0 accuracy binary     0.581     5 0.00310
 2 0.0001         0 roc_auc  binary     0.615     5 0.00264
 3 0.000126       0 accuracy binary     0.581     5 0.00310
 4 0.000126       0 roc_auc  binary     0.615     5 0.00264
 5 0.000160       0 accuracy binary     0.581     5 0.00310
 6 0.000160       0 roc_auc  binary     0.615     5 0.00264
 7 0.000202       0 accuracy binary     0.581     5 0.00310
 8 0.000202       0 roc_auc  binary     0.615     5 0.00264
 9 0.000256       0 accuracy binary     0.581     5 0.00310
10 0.000256       0 roc_auc  binary     0.615     5 0.00264
   .config               
   <chr>                 
 1 Preprocessor1_Model001
 2 Preprocessor1_Model001
 3 Preprocessor1_Model002
 4 Preprocessor1_Model002
 5 Preprocessor1_Model003
 6 Preprocessor1_Model003
 7 Preprocessor1_Model004
 8 Preprocessor1_Model004
 9 Preprocessor1_Model005
10 Preprocessor1_Model005
# ℹ 290 more rows

show top 5 models

logit_fit |>
  show_best("roc_auc")
# A tibble: 5 × 8
   penalty mixture .metric .estimator  mean     n std_err .config               
     <dbl>   <dbl> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                 
1 0.00543      0   roc_auc binary     0.615     5 0.00265 Preprocessor1_Model018
2 0.00687      0   roc_auc binary     0.615     5 0.00266 Preprocessor1_Model019
3 0.000655     1   roc_auc binary     0.615     5 0.00270 Preprocessor1_Model109
4 0.000829     0.5 roc_auc binary     0.615     5 0.00268 Preprocessor1_Model060
5 0.000518     1   roc_auc binary     0.615     5 0.00269 Preprocessor1_Model108

select the best model

best_logit <- logit_fit |>
  select_best("roc_auc")
best_logit
# A tibble: 1 × 3
  penalty mixture .config               
    <dbl>   <dbl> <chr>                 
1 0.00543       0 Preprocessor1_Model018
  1. finalize the model

Set the final workflow

# Final workflow
logit_final_workflow <- logit_workflow |>
  finalize_workflow(best_logit)
logit_final_workflow
══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: logistic_reg()

── Preprocessor ────────────────────────────────────────────────────────────────
4 Recipe Steps

• step_impute_median()
• step_dummy()
• step_zv()
• step_normalize()

── Model ───────────────────────────────────────────────────────────────────────
Logistic Regression Model Specification (classification)

Main Arguments:
  penalty = 0.00542867543932386
  mixture = 0

Engine-Specific Arguments:
  standardize = FALSE

Computational engine: glmnet 

Fit on the whole training set, then predict on the test set

logit_final_fit <- logit_final_workflow |>
  last_fit(data_split)

logit_final_fit
# Resampling results
# Manual resampling 
# A tibble: 1 × 6
  splits                id             .metrics .notes   .predictions .workflow 
  <list>                <chr>          <list>   <list>   <list>       <list>    
1 <split [36590/36591]> train/test sp… <tibble> <tibble> <tibble>     <workflow>

Output the test set predictions

# Test metrics
logit_final_fit |> 
  collect_metrics()
# A tibble: 2 × 4
  .metric  .estimator .estimate .config             
  <chr>    <chr>          <dbl> <chr>               
1 accuracy binary         0.578 Preprocessor1_Model1
2 roc_auc  binary         0.606 Preprocessor1_Model1

Plot the variable importance

# plot the variable importance
logit_final_fit |>
  extract_fit_parsnip() |>
  vip::vip() |>
  print()

  1. Interpret the model

First, the AUC of the final logistic regression model is 0.6064933, meaning that the model can correctly classify 60.6% of the test set. The AUC is a measure of the ability of the model to distinguish between classes, and the AUC value ranges from 0 to 1, with 0.5 indicating random guessing and 1 indicating perfect classification. Therefore, the AUC of the final logistic regression model is not very high. Also, the accuracy of the final logistic regression model is 0.5777924, which means that the model can only correctly predict 57.8% of the test set.

The variable importance plot shows the top 10 important predictors, and the most important predictor is the first care unit are important predictors.

Random forest

First, define the model specification and the workflow for the random forest model. Then, tune the model using 5-fold cross-validation.

  1. define the model specification
# define the recipe
rf_recipe <- recipe
  1. define random forest model
rf_model <- rand_forest(
  mode = "classification",
  
  # Number of predictors randomly sampled in each split
  mtry = tune(),
  
  # Number of trees in ensemble
  trees = tune()
) |>
  
  # use the ranger package to do classification
  set_engine("ranger", importance = "impurity")

rf_model
Random Forest Model Specification (classification)

Main Arguments:
  mtry = tune()
  trees = tune()

Engine-Specific Arguments:
  importance = impurity

Computational engine: ranger 
  1. define the workflow
rf_workflow <- workflow() |>
  add_recipe(rf_recipe) |>
  add_model(rf_model) |>
  print()
══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: rand_forest()

── Preprocessor ────────────────────────────────────────────────────────────────
4 Recipe Steps

• step_impute_median()
• step_dummy()
• step_zv()
• step_normalize()

── Model ───────────────────────────────────────────────────────────────────────
Random Forest Model Specification (classification)

Main Arguments:
  mtry = tune()
  trees = tune()

Engine-Specific Arguments:
  importance = impurity

Computational engine: ranger 
  1. tuning grid
# define the tuning grid

# try 5 different values for mtry and 3 different values for trees
rf_param_grid <- grid_regular(
  trees(range = c(200L, 800L)), 
  mtry(range = c(2L, 5L)),
  levels = c(5, 3)
  )

rf_param_grid
# A tibble: 15 × 2
   trees  mtry
   <int> <int>
 1   200     2
 2   350     2
 3   500     2
 4   650     2
 5   800     2
 6   200     3
 7   350     3
 8   500     3
 9   650     3
10   800     3
11   200     5
12   350     5
13   500     5
14   650     5
15   800     5
  1. cross-validation

Define the cross-validation folds

set.seed(203)

# define the number of folds for cross-validation is 5
rf_folds <- vfold_cv(train_data, v = 5)

rf_folds
#  5-fold cross-validation 
# A tibble: 5 × 2
  splits               id   
  <list>               <chr>
1 <split [29272/7318]> Fold1
2 <split [29272/7318]> Fold2
3 <split [29272/7318]> Fold3
4 <split [29272/7318]> Fold4
5 <split [29272/7318]> Fold5

Fit cross-validated models

if (file.exists("rf_fit.rds")) {
  rf_fit <- read_rds("rf_fit.rds")
  rf_fit
  
} else {
  (rf_fit <- rf_workflow |>
    tune_grid(
      resamples = rf_folds,
      grid = rf_param_grid,
      
      # use roc_auc and accuracy as the evaluation metrics
      metrics = metric_set(roc_auc, accuracy),
      
      # prepare for later stacking
      control = control_stack_grid()
    )) |>
  system.time()

  rf_fit |>
    write_rds("rf_fit.rds")
  
  rf_fit
}
# Tuning results
# 5-fold cross-validation 
# A tibble: 5 × 5
  splits               id    .metrics          .notes           .predictions
  <list>               <chr> <list>            <list>           <list>      
1 <split [29272/7318]> Fold1 <tibble [30 × 6]> <tibble [0 × 3]> <tibble>    
2 <split [29272/7318]> Fold2 <tibble [30 × 6]> <tibble [0 × 3]> <tibble>    
3 <split [29272/7318]> Fold3 <tibble [30 × 6]> <tibble [0 × 3]> <tibble>    
4 <split [29272/7318]> Fold4 <tibble [30 × 6]> <tibble [0 × 3]> <tibble>    
5 <split [29272/7318]> Fold5 <tibble [30 × 6]> <tibble [0 × 3]> <tibble>    
  1. visualize CV results and select the best model
rf_fit |>
  collect_metrics() |>
  print(width = Inf) |>
  filter(.metric == "roc_auc") |>
  ggplot(mapping = aes(x = trees, y = mean, 
                       color = factor(mtry))) +
  geom_point() +
  labs(x = "Num. of Trees", y = "CV AUC")
# A tibble: 30 × 8
    mtry trees .metric  .estimator  mean     n std_err .config              
   <int> <int> <chr>    <chr>      <dbl> <int>   <dbl> <chr>                
 1     2   200 accuracy binary     0.600     5 0.00405 Preprocessor1_Model01
 2     2   200 roc_auc  binary     0.640     5 0.00245 Preprocessor1_Model01
 3     2   350 accuracy binary     0.601     5 0.00225 Preprocessor1_Model02
 4     2   350 roc_auc  binary     0.643     5 0.00213 Preprocessor1_Model02
 5     2   500 accuracy binary     0.601     5 0.00328 Preprocessor1_Model03
 6     2   500 roc_auc  binary     0.644     5 0.00266 Preprocessor1_Model03
 7     2   650 accuracy binary     0.600     5 0.00200 Preprocessor1_Model04
 8     2   650 roc_auc  binary     0.644     5 0.00231 Preprocessor1_Model04
 9     2   800 accuracy binary     0.602     5 0.00342 Preprocessor1_Model05
10     2   800 roc_auc  binary     0.644     5 0.00262 Preprocessor1_Model05
# ℹ 20 more rows

show top 5 models

rf_fit |>
  show_best("roc_auc")
# A tibble: 5 × 8
   mtry trees .metric .estimator  mean     n std_err .config              
  <int> <int> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                
1     3   650 roc_auc binary     0.645     5 0.00252 Preprocessor1_Model09
2     3   800 roc_auc binary     0.645     5 0.00257 Preprocessor1_Model10
3     2   800 roc_auc binary     0.644     5 0.00262 Preprocessor1_Model05
4     5   800 roc_auc binary     0.644     5 0.00297 Preprocessor1_Model15
5     2   500 roc_auc binary     0.644     5 0.00266 Preprocessor1_Model03

select the best model

best_rf <- rf_fit |>
  select_best("roc_auc")
best_rf
# A tibble: 1 × 3
   mtry trees .config              
  <int> <int> <chr>                
1     3   650 Preprocessor1_Model09
  1. finalize the model

Set the final workflow

rf_final_workflow <- rf_workflow |>
  finalize_workflow(best_rf)
rf_final_workflow
══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: rand_forest()

── Preprocessor ────────────────────────────────────────────────────────────────
4 Recipe Steps

• step_impute_median()
• step_dummy()
• step_zv()
• step_normalize()

── Model ───────────────────────────────────────────────────────────────────────
Random Forest Model Specification (classification)

Main Arguments:
  mtry = 3
  trees = 650

Engine-Specific Arguments:
  importance = impurity

Computational engine: ranger 

Fit on the whole training set, then predict on the test set

if (file.exists("rf_final_fit.rds")) {
  rf_final_fit <- read_rds("rf_final_fit.rds")
  rf_final_fit
  
} else {
  
  # fit the final model on the whole training set
  rf_final_fit <- rf_final_workflow |>
    last_fit(data_split)
  
  rf_final_fit |>
    write_rds("rf_final_fit.rds")
  
  rf_final_fit
}
# Resampling results
# Manual resampling 
# A tibble: 1 × 6
  splits                id             .metrics .notes   .predictions .workflow 
  <list>                <chr>          <list>   <list>   <list>       <list>    
1 <split [36590/36591]> train/test sp… <tibble> <tibble> <tibble>     <workflow>

Output the test set predictions

rf_final_fit |> 
  collect_metrics()
# A tibble: 2 × 4
  .metric  .estimator .estimate .config             
  <chr>    <chr>          <dbl> <chr>               
1 accuracy binary         0.604 Preprocessor1_Model1
2 roc_auc  binary         0.644 Preprocessor1_Model1

Predictor importance

# plot the variable importance 
rf_final_fit |>
  extract_fit_engine() |>
  vip::vip() |>
  print()

  1. Interpret the model

First, the AUC of the final random forest model is 0.6437507, meaning that the model can correctly classify 64.4% of the test set. The AUC is a measure of the ability of the model to distinguish between classes, and the AUC value ranges from 0 to 1, with 0.5 indicating random guessing and 1 indicating perfect classification. Therefore, the AUC of the final random forest model is higher than the AUC of the final logistic regression model. Also, the accuracy of the final random forest model is 0.6041103, which means that the model can correctly predict 60.4% of the test set.

The variable importance plot shows that the top 10 important predictors, and the most important predictor is the systolic non invasive blood pressure.

Boosting (XGBoost)

First, define the model specification and the workflow for the boosting model. Then, tune the model using 5-fold cross-validation.

  1. define the model specification
# define the recipe
gb_recipe <- recipe
  1. define boosting model
gb_model <- boost_tree(
  mode = "classification",
  
  # Number of trees in ensemble
  trees = 1000,
  
  # Shrinkage factor
  learn_rate = tune(),
  
  # Maximum depth of trees
  tree_depth = tune()
) |>
  
  # use the xgboost package to do classification
  set_engine("xgboost")

gb_model
Boosted Tree Model Specification (classification)

Main Arguments:
  trees = 1000
  tree_depth = tune()
  learn_rate = tune()

Computational engine: xgboost 
  1. define the workflow
gb_workflow <- workflow() |>
  add_recipe(gb_recipe) |>
  add_model(gb_model) 
gb_workflow
══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: boost_tree()

── Preprocessor ────────────────────────────────────────────────────────────────
4 Recipe Steps

• step_impute_median()
• step_dummy()
• step_zv()
• step_normalize()

── Model ───────────────────────────────────────────────────────────────────────
Boosted Tree Model Specification (classification)

Main Arguments:
  trees = 1000
  tree_depth = tune()
  learn_rate = tune()

Computational engine: xgboost 
  1. tuning grid
# set the tuning grid
gb_param_grid <- grid_regular(
  
  # try 3 different values for tree_depth, 
  # 5 different values for learn_rate
  tree_depth(range = c(1L, 3L)),
  learn_rate(range = c(-5, 2), trans = log10_trans()),
  levels = c(3, 5)
  )

gb_param_grid
# A tibble: 15 × 2
   tree_depth learn_rate
        <int>      <dbl>
 1          1   0.00001 
 2          2   0.00001 
 3          3   0.00001 
 4          1   0.000562
 5          2   0.000562
 6          3   0.000562
 7          1   0.0316  
 8          2   0.0316  
 9          3   0.0316  
10          1   1.78    
11          2   1.78    
12          3   1.78    
13          1 100       
14          2 100       
15          3 100       
  1. cross-validation set cross-validation patitions
set.seed(203)

# define the number of folds for cross-validation is 5
gb_folds <- vfold_cv(train_data, v = 5)

gb_folds
#  5-fold cross-validation 
# A tibble: 5 × 2
  splits               id   
  <list>               <chr>
1 <split [29272/7318]> Fold1
2 <split [29272/7318]> Fold2
3 <split [29272/7318]> Fold3
4 <split [29272/7318]> Fold4
5 <split [29272/7318]> Fold5

Fit cross-validated models

if (file.exists("gb_fit.rds")) {
  gb_fit <- read_rds("gb_fit.rds")
  gb_fit
  
} else {
  (gb_fit <- gb_workflow |>
    tune_grid(
      resamples = gb_folds,
      grid = gb_param_grid,
      
      # use roc_auc and accuracy as the evaluation metrics
      metrics = metric_set(roc_auc, accuracy),
      
      # prepare for later stacking
      control = control_stack_grid()
      )) |>
    system.time()
  
  gb_fit |>
    write_rds("gb_fit.rds")
  
  gb_fit
}
# Tuning results
# 5-fold cross-validation 
# A tibble: 5 × 5
  splits               id    .metrics          .notes           .predictions
  <list>               <chr> <list>            <list>           <list>      
1 <split [29272/7318]> Fold1 <tibble [30 × 6]> <tibble [0 × 3]> <tibble>    
2 <split [29272/7318]> Fold2 <tibble [30 × 6]> <tibble [0 × 3]> <tibble>    
3 <split [29272/7318]> Fold3 <tibble [30 × 6]> <tibble [0 × 3]> <tibble>    
4 <split [29272/7318]> Fold4 <tibble [30 × 6]> <tibble [0 × 3]> <tibble>    
5 <split [29272/7318]> Fold5 <tibble [30 × 6]> <tibble [0 × 3]> <tibble>    
  1. visualize CV results and select the best model
gb_fit |>
  collect_metrics() |>
  print(width = Inf) |>
  filter(.metric == "roc_auc") |>
  ggplot(mapping = aes(x = learn_rate, y = mean, 
                       color = factor(tree_depth))) +
  geom_point() +
  labs(x = "Learning Rate", y = "CV AUC") +
  scale_x_log10()
# A tibble: 30 × 8
   tree_depth learn_rate .metric  .estimator  mean     n std_err
        <int>      <dbl> <chr>    <chr>      <dbl> <int>   <dbl>
 1          1   0.00001  accuracy binary     0.550     5 0.00338
 2          1   0.00001  roc_auc  binary     0.539     5 0.00172
 3          2   0.00001  accuracy binary     0.557     5 0.00236
 4          2   0.00001  roc_auc  binary     0.562     5 0.00347
 5          3   0.00001  accuracy binary     0.561     5 0.00386
 6          3   0.00001  roc_auc  binary     0.574     5 0.00543
 7          1   0.000562 accuracy binary     0.554     5 0.00354
 8          1   0.000562 roc_auc  binary     0.577     5 0.00308
 9          2   0.000562 accuracy binary     0.562     5 0.00367
10          2   0.000562 roc_auc  binary     0.592     5 0.00322
   .config              
   <chr>                
 1 Preprocessor1_Model01
 2 Preprocessor1_Model01
 3 Preprocessor1_Model02
 4 Preprocessor1_Model02
 5 Preprocessor1_Model03
 6 Preprocessor1_Model03
 7 Preprocessor1_Model04
 8 Preprocessor1_Model04
 9 Preprocessor1_Model05
10 Preprocessor1_Model05
# ℹ 20 more rows

show top 5 models

gb_fit |>
  show_best("roc_auc")
# A tibble: 5 × 8
  tree_depth learn_rate .metric .estimator  mean     n std_err .config          
       <int>      <dbl> <chr>   <chr>      <dbl> <int>   <dbl> <chr>            
1          3   0.0316   roc_auc binary     0.645     5 0.00320 Preprocessor1_Mo…
2          2   0.0316   roc_auc binary     0.642     5 0.00351 Preprocessor1_Mo…
3          1   0.0316   roc_auc binary     0.633     5 0.00306 Preprocessor1_Mo…
4          1   1.78     roc_auc binary     0.619     5 0.00241 Preprocessor1_Mo…
5          3   0.000562 roc_auc binary     0.602     5 0.00358 Preprocessor1_Mo…

select the best model

best_gb <- gb_fit |>
  select_best("roc_auc")
best_gb
# A tibble: 1 × 3
  tree_depth learn_rate .config              
       <int>      <dbl> <chr>                
1          3     0.0316 Preprocessor1_Model09
  1. finalize the model
gb_final_workflow <- gb_workflow |>
  finalize_workflow(best_gb)
gb_final_workflow
══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: boost_tree()

── Preprocessor ────────────────────────────────────────────────────────────────
4 Recipe Steps

• step_impute_median()
• step_dummy()
• step_zv()
• step_normalize()

── Model ───────────────────────────────────────────────────────────────────────
Boosted Tree Model Specification (classification)

Main Arguments:
  trees = 1000
  tree_depth = 3
  learn_rate = 0.0316227766016838

Computational engine: xgboost 

Fit on the whole training set, then predict on the test set

if (file.exists("gb_final_fit.rds")) {
  gb_final_fit <- read_rds("gb_final_fit.rds")
  gb_final_fit
  
} else {
  
  # fit the final model on the whole training set
  gb_final_fit <- gb_final_workflow |>
    last_fit(data_split)
  
  gb_final_fit |>
    write_rds("gb_final_fit.rds")
  
  gb_final_fit
}
# Resampling results
# Manual resampling 
# A tibble: 1 × 6
  splits                id             .metrics .notes   .predictions .workflow 
  <list>                <chr>          <list>   <list>   <list>       <list>    
1 <split [36590/36591]> train/test sp… <tibble> <tibble> <tibble>     <workflow>

Output the test set predictions

gb_final_fit |> 
  collect_metrics()
# A tibble: 2 × 4
  .metric  .estimator .estimate .config             
  <chr>    <chr>          <dbl> <chr>               
1 accuracy binary         0.605 Preprocessor1_Model1
2 roc_auc  binary         0.645 Preprocessor1_Model1

Predictor importance

# plot the variable importance
gb_final_fit |>
  extract_fit_engine() |>
  vip::vip() |>
  print()

  1. Interpret the model

First, the AUC of the final gradient boosting model is 0.6449823, meaning that the model can correctly classify 64.5% of the test set. The AUC is a measure of the ability of the model to distinguish between classes, and the AUC value ranges from 0 to 1, with 0.5 indicating random guessing and 1 indicating perfect classification. Therefore, the AUC of the final gradient boosting model is higher than the AUC of the final random forest model. Also, the accuracy of the final gradient boosting model is 0.6045476, which means that the model can correctly predict 60.5% of the test set.

Second, the variable importance plot shows that the top 10 important predictors, and the most important predictor is the temperature in Fahrenheit. And the second important predictor is the systolic non invasive blood pressure, which is consistent with the random forest model.

  • Compare model classification performance on the test set. Report both the area under ROC curve and accuracy for each machine learning algorithm and the model stacking. Interpret the results. What are the most important features in predicting long ICU stays? How do the models compare in terms of performance and interpretability?

Model comparison

I used three machine learning algorithms: logistic regression, gradient boosting, and random forest. Here, I compare their classification performance on the test set using the area under the ROC curve (AUC) and accuracy.

# collect the metrics of the three models
logit_final_fit |> collect_metrics()
# A tibble: 2 × 4
  .metric  .estimator .estimate .config             
  <chr>    <chr>          <dbl> <chr>               
1 accuracy binary         0.578 Preprocessor1_Model1
2 roc_auc  binary         0.606 Preprocessor1_Model1
rf_final_fit |> collect_metrics()
# A tibble: 2 × 4
  .metric  .estimator .estimate .config             
  <chr>    <chr>          <dbl> <chr>               
1 accuracy binary         0.604 Preprocessor1_Model1
2 roc_auc  binary         0.644 Preprocessor1_Model1
gb_final_fit |> collect_metrics()
# A tibble: 2 × 4
  .metric  .estimator .estimate .config             
  <chr>    <chr>          <dbl> <chr>               
1 accuracy binary         0.605 Preprocessor1_Model1
2 roc_auc  binary         0.645 Preprocessor1_Model1

From the output, we can see that the gradient boosting model has the highest AUC (0.6449823) and accuracy (0.6045476) on the test set. The logistic regression model has the lowest AUC (0.6065411) and accuracy (0.5776284). The random forest model has an AUC of 0.6437507 and accuracy of 0.6041103. Therefore, the gradient boosting model is the best-performing model among the three.

Model stacking

Build the stacked ensemble.

if (file.exists("stacks.rds")) {
  model_st <- read_rds("stacks.rds")
  model_st
} else {
  model_st <- 
    stacks() |>
    
    # add candidate models
    add_candidates(logit_fit) |>
    add_candidates(gb_fit) |>
    add_candidates(rf_fit) |>
    
    # determine how to combine their predictions
    blend_predictions(
      penalty = 10^(-3:1),
      metrics = c("roc_auc"),
      
      # set the number of resamples to 3 to reduce computation time
      times = 3) |>
    
    # fit the candidates with nonzero stacking coefficients
    fit_members()
  
  model_st |> write_rds("stacks.rds")
  
  model_st
}
# A tibble: 10 × 3
   member                 type        weight
   <chr>                  <chr>        <dbl>
 1 .pred_TRUE_gb_fit_1_09 boost_tree   1.80 
 2 .pred_TRUE_rf_fit_1_09 rand_forest  0.691
 3 .pred_TRUE_rf_fit_1_07 rand_forest  0.590
 4 .pred_TRUE_rf_fit_1_02 rand_forest  0.468
 5 .pred_TRUE_rf_fit_1_15 rand_forest  0.443
 6 .pred_TRUE_rf_fit_1_06 rand_forest  0.427
 7 .pred_TRUE_rf_fit_1_13 rand_forest  0.358
 8 .pred_TRUE_rf_fit_1_03 rand_forest  0.299
 9 .pred_TRUE_rf_fit_1_10 rand_forest  0.163
10 .pred_TRUE_gb_fit_1_10 boost_tree   0.116

Plot the stacked results

autoplot(model_st)

To show the relationship more directly

autoplot(model_st, type = "members")

To see the top models

autoplot(model_st, type = "weights")

To identify coefficients of random forest model

collect_parameters(model_st, "rf_fit")
# A tibble: 15 × 5
   member       mtry trees terms                   coef
   <chr>       <int> <int> <chr>                  <dbl>
 1 rf_fit_1_01     2   200 .pred_TRUE_rf_fit_1_01 0    
 2 rf_fit_1_02     2   350 .pred_TRUE_rf_fit_1_02 0.468
 3 rf_fit_1_03     2   500 .pred_TRUE_rf_fit_1_03 0.299
 4 rf_fit_1_04     2   650 .pred_TRUE_rf_fit_1_04 0    
 5 rf_fit_1_05     2   800 .pred_TRUE_rf_fit_1_05 0    
 6 rf_fit_1_06     3   200 .pred_TRUE_rf_fit_1_06 0.427
 7 rf_fit_1_07     3   350 .pred_TRUE_rf_fit_1_07 0.590
 8 rf_fit_1_08     3   500 .pred_TRUE_rf_fit_1_08 0    
 9 rf_fit_1_09     3   650 .pred_TRUE_rf_fit_1_09 0.691
10 rf_fit_1_10     3   800 .pred_TRUE_rf_fit_1_10 0.163
11 rf_fit_1_11     5   200 .pred_TRUE_rf_fit_1_11 0    
12 rf_fit_1_12     5   350 .pred_TRUE_rf_fit_1_12 0    
13 rf_fit_1_13     5   500 .pred_TRUE_rf_fit_1_13 0.358
14 rf_fit_1_14     5   650 .pred_TRUE_rf_fit_1_14 0    
15 rf_fit_1_15     5   800 .pred_TRUE_rf_fit_1_15 0.443

To identify coefficients of gradient boosting model

collect_parameters(model_st, "gb_fit")
# A tibble: 15 × 5
   member      tree_depth learn_rate terms                    coef
   <chr>            <int>      <dbl> <chr>                   <dbl>
 1 gb_fit_1_01          1   0.00001  .pred_TRUE_gb_fit_1_01 0     
 2 gb_fit_1_02          2   0.00001  .pred_TRUE_gb_fit_1_02 0     
 3 gb_fit_1_03          3   0.00001  .pred_TRUE_gb_fit_1_03 0     
 4 gb_fit_1_04          1   0.000562 .pred_TRUE_gb_fit_1_04 0     
 5 gb_fit_1_05          2   0.000562 .pred_TRUE_gb_fit_1_05 0     
 6 gb_fit_1_06          3   0.000562 .pred_TRUE_gb_fit_1_06 0     
 7 gb_fit_1_07          1   0.0316   .pred_TRUE_gb_fit_1_07 0     
 8 gb_fit_1_08          2   0.0316   .pred_TRUE_gb_fit_1_08 0     
 9 gb_fit_1_09          3   0.0316   .pred_TRUE_gb_fit_1_09 1.80  
10 gb_fit_1_10          1   1.78     .pred_TRUE_gb_fit_1_10 0.116 
11 gb_fit_1_11          2   1.78     .pred_TRUE_gb_fit_1_11 0.0455
12 gb_fit_1_12          3   1.78     .pred_TRUE_gb_fit_1_12 0.0701
13 gb_fit_1_13          1 100        .pred_TRUE_gb_fit_1_13 0     
14 gb_fit_1_14          2 100        .pred_TRUE_gb_fit_1_14 0     
15 gb_fit_1_15          3 100        .pred_TRUE_gb_fit_1_15 0.0138

To identify coefficients of logistic regression model

collect_parameters(model_st, "logit_fit")
# A tibble: 81 × 5
   member          penalty mixture terms                         coef
   <chr>             <dbl>   <dbl> <chr>                        <dbl>
 1 logit_fit_1_001 0.0001        0 .pred_TRUE_logit_fit_1_001 0.00213
 2 logit_fit_1_018 0.00543       0 .pred_TRUE_logit_fit_1_018 0      
 3 logit_fit_1_019 0.00687       0 .pred_TRUE_logit_fit_1_019 0      
 4 logit_fit_1_020 0.00869       0 .pred_TRUE_logit_fit_1_020 0      
 5 logit_fit_1_021 0.0110        0 .pred_TRUE_logit_fit_1_021 0      
 6 logit_fit_1_022 0.0139        0 .pred_TRUE_logit_fit_1_022 0      
 7 logit_fit_1_023 0.0176        0 .pred_TRUE_logit_fit_1_023 0      
 8 logit_fit_1_024 0.0222        0 .pred_TRUE_logit_fit_1_024 0      
 9 logit_fit_1_025 0.0281        0 .pred_TRUE_logit_fit_1_025 0      
10 logit_fit_1_026 0.0356        0 .pred_TRUE_logit_fit_1_026 0      
# ℹ 71 more rows

Final classification

Apply the model on the test data and output the final classification

if (file.exists("final_classification.rds")) {
  final_classification <- read_rds("final_classification.rds")
  
  final_classification
  
} else {
  final_classification <- test_data |>
    
    # apply the model on the test data 
    # with type = "prob" to get the predicted probabilities
    bind_cols(predict(model_st, test_data, type = "prob")) |>
    print(width = Inf)
  
  final_classification
  
  final_classification |>
    write_rds("final_classification.rds")
}
# A tibble: 36,591 × 25
   first_careunit           admission_type admission_location insurance language
   <fct>                    <fct>          <fct>              <chr>     <chr>   
 1 Medical Intensive Care … EW EMER.       EMERGENCY ROOM     Medicaid  ENGLISH 
 2 Surgical Intensive Care… EW EMER.       EMERGENCY ROOM     Other     ?       
 3 Surgical Intensive Care… Other          PHYSICIAN REFERRAL Other     ?       
 4 Medical/Surgical Intens… EW EMER.       Other              Other     ENGLISH 
 5 Cardiac Vascular Intens… SURGICAL SAME… PHYSICIAN REFERRAL Medicare  ENGLISH 
 6 Other                    EW EMER.       Other              Other     ENGLISH 
 7 Medical Intensive Care … EW EMER.       EMERGENCY ROOM     Medicare  ENGLISH 
 8 Medical Intensive Care … EW EMER.       EMERGENCY ROOM     Medicare  ENGLISH 
 9 Other                    URGENT         TRANSFER FROM HOS… Medicare  ENGLISH 
10 Other                    URGENT         TRANSFER FROM HOS… Medicare  ENGLISH 
# ℹ 36,581 more rows
# ℹ 20 more variables: marital_status <fct>, race <chr>, gender <chr>,
#   sodium <dbl>, chloride <dbl>, creatinine <dbl>, potassium <dbl>,
#   glucose <dbl>, hematocrit <dbl>, white_blood_cell_count <dbl>,
#   bicarbonate <dbl>, temperature_in_Fahrenheit <dbl>,
#   diastolic_non_invasive_blood_pressure <dbl>, respiratory_rate <dbl>,
#   systolic_non_invasive_blood_pressure <dbl>, heart_rate <dbl>, …

Compute the ROC AUC and accuracy of the final classification

# compute the ROC AUC of the final classification
yardstick::roc_auc(
  final_classification,
  truth = los_long,
  contains(".pred_FALSE")
  )
# A tibble: 1 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 roc_auc binary         0.650
# compute the accuracy of the final classification

# First, convert the predicted probabilities to 
# predicted classes
final_classification <- final_classification |>
  mutate(.pred_class = as.factor(
    ifelse(.pred_TRUE > .pred_FALSE, "TRUE", "FALSE")))

# Then, compute the accuracy
yardstick::accuracy(
  final_classification,
  truth = los_long,
  estimate = .pred_class
  )
# A tibble: 1 × 3
  .metric  .estimator .estimate
  <chr>    <chr>          <dbl>
1 accuracy binary         0.606

Use the members argument to generate predictions from each individual

if (file.exists("class_predictions.rds")) {
  class_predictions <- read_rds("class_predictions.rds")
  
  class_predictions
} else {
  
  class_predictions <- test_data |>
  select(los_long) |>
  
    # apply the model on the test data
    bind_cols(predict(model_st, 
                    test_data, 
                    type = "class", 
                    members = TRUE))
  
  class_predictions
  
  class_predictions |>
  write_rds("class_predictions.rds")
}
# A tibble: 36,591 × 17
   los_long .pred_class .pred_class_logit_fit_1_001 .pred_class_logit_fit_1_051
   <fct>    <fct>       <fct>                       <fct>                      
 1 FALSE    FALSE       FALSE                       FALSE                      
 2 FALSE    FALSE       FALSE                       FALSE                      
 3 FALSE    FALSE       FALSE                       FALSE                      
 4 FALSE    FALSE       FALSE                       FALSE                      
 5 FALSE    FALSE       FALSE                       FALSE                      
 6 TRUE     FALSE       FALSE                       FALSE                      
 7 TRUE     TRUE        TRUE                        TRUE                       
 8 TRUE     FALSE       FALSE                       FALSE                      
 9 TRUE     TRUE        TRUE                        TRUE                       
10 TRUE     TRUE        TRUE                        TRUE                       
# ℹ 36,581 more rows
# ℹ 13 more variables: .pred_class_gb_fit_1_10 <fct>,
#   .pred_class_gb_fit_1_11 <fct>, .pred_class_gb_fit_1_09 <fct>,
#   .pred_class_gb_fit_1_12 <fct>, .pred_class_gb_fit_1_15 <fct>,
#   .pred_class_rf_fit_1_02 <fct>, .pred_class_rf_fit_1_03 <fct>,
#   .pred_class_rf_fit_1_06 <fct>, .pred_class_rf_fit_1_07 <fct>,
#   .pred_class_rf_fit_1_09 <fct>, .pred_class_rf_fit_1_10 <fct>, …

Get the value of each model’s performance

# set the factor columns to character for comparison
class_predictions <- class_predictions |> 
  mutate(across(where(is.factor), as.character))

# get the mean of the predicted classes for each model
map(colnames(class_predictions), 
    ~ mean(class_predictions$los_long == 
             pull(class_predictions, .x))) |> 
  set_names(colnames(class_predictions)) |> 
  as_tibble() |> 
  pivot_longer(c(everything(), -los_long))
# A tibble: 16 × 3
   los_long name                        value
      <dbl> <chr>                       <dbl>
 1        1 .pred_class                 0.606
 2        1 .pred_class_logit_fit_1_001 0.578
 3        1 .pred_class_logit_fit_1_051 0.578
 4        1 .pred_class_gb_fit_1_10     0.588
 5        1 .pred_class_gb_fit_1_11     0.574
 6        1 .pred_class_gb_fit_1_09     0.605
 7        1 .pred_class_gb_fit_1_12     0.553
 8        1 .pred_class_gb_fit_1_15     0.444
 9        1 .pred_class_rf_fit_1_02     0.602
10        1 .pred_class_rf_fit_1_03     0.602
11        1 .pred_class_rf_fit_1_06     0.602
12        1 .pred_class_rf_fit_1_07     0.605
13        1 .pred_class_rf_fit_1_09     0.604
14        1 .pred_class_rf_fit_1_10     0.605
15        1 .pred_class_rf_fit_1_13     0.603
16        1 .pred_class_rf_fit_1_15     0.604

Report AUC and accuracy and interpret the results

The gradient boosting model has the highest AUC (0.6449823) and accuracy (0.6045476) on the test set. The logistic regression model has the lowest AUC (0.6065411) and accuracy (0.5776284). The random forest model has an AUC of 0.6437507 and accuracy of 0.6041103. Therefore, the gradient boosting model is the best-performing model among the three.

From the output of each model’s performance, we can see that the gradient boosting model has the greatest weight in the stacked ensemble. The random forest model has the second greatest weight, and the logistic regression model has the least weight.

The final classification has an AUC of 0.6499224 and an accuracy of 0.6064606. The AUC of the final classification is 0.6499224, meaning that the probability of the model ranking a randomly chosen positive observation higher than a randomly chosen negative observation is 0.6499224. The accuracy of the final classification is 0.6064606, meaning that the proportion of correct predictions is 60.64606%.

As for the coefficients of each model for the stacked ensemble, the boosting model has the greatest weight, and the random forest model has the second greatest weight. The logistic regression model has the least weight. In terms of the number of models, the random forest model has the greatest weight, and the logistic regression model has the least weight.

From the output of accuracy of each model, we can see that the random forest model has the greatest accuracy, and the logistic regression model has the least accuracy. The gradient boosting model has the second greatest accuracy.

In conclusion, the gradient boosting model has the greatest weight in the stacked ensemble, and the random forest model has the second greatest weight. The logistic regression model has the least weight. The final classification has an AUC of 0.6499224 and an accuracy of 0.6064606. The random forest model has the greatest accuracy, and the logistic regression model has the least accuracy. The gradient boosting model has the second greatest accuracy.

The most important features

What are the most important features in predicting long ICU stays?

# retrieve the most important features
logit_final_fit |> 
  extract_fit_engine() |>
  vip::vip() |>
  print()

rf_final_fit |>
  extract_fit_engine() |>
  vip::vip() |>
  print()

gb_final_fit |> 
  extract_fit_parsnip() |>
  pluck("fit") |>
  vip::vip() |>
  print()

From the plots above, we can see that the most important features in different models are different. For the logistic regression model, the most important features are first careunit, heart rate, etc. And for the random forest model, the most important features are systolic non invasive blood pressure, heart rate, etc. And for the gradient boosting model, the most important features are temperature in farhenheit, systolic non invasive blood pressure, etc.

Performance and interpretability

How do the models compare in terms of performance and interpretability?

Overall, in terms of performance, the random forest model has the best. In terms of interpretability, the logistic regression model has the best. The gradient boosting model is in the middle in terms of performance and interpretability.