# Load libraries
library(gtsummary)
library(tidyverse)
library(tidymodels)
library(dplyr)
library(haven)
library(recipes)
library(GGally)
library(ranger)
library(xgboost)
library(stacks)
library(yardstick)
library(purrr)
library(vip)
Biostat 203B Homework 5
Due Mar 22 @ 11:59PM
Predicting ICU duration
Using the ICU cohort mimiciv_icu_cohort.rds
you built in Homework 4, develop at least three machine learning approaches (logistic regression with enet regularization, random forest, boosting, SVM, MLP, etc) plus a model stacking approach for predicting whether a patient’s ICU stay will be longer than 2 days. You should use the los_long
variable as the outcome. You algorithms can use patient demographic information (gender, age at ICU intime
, marital status, race), ICU admission information (first care unit), the last lab measurements before the ICU stay, and first vital measurements during ICU stay as features. You are welcome to use any feature engineering techniques you think are appropriate; but make sure to not use features that are not available at an ICU stay’s intime
. For instance, last_careunit
cannot be used in your algorithms.
Answer
- Data preprocessing and feature engineering.
First, load the data from Homework 4 and check the missing values of the data. Then, check the distribution of each variable. For categorical variables, I use a new category “Unknown” to get rid of missing values. For continuous variables, I use median to impute missing values. Finally, I summarize the data to check the distribution of each variable.
# Load "mimiciv_icu_cohort.rds" data
<- read_rds("mimic_icu_cohort.rds") |>
mimic_icu_cohort select(
-last_careunit,
-dod,
-discharge_location,
-hospital_expire_flag,
-los
)
# Check missing values of the data
colSums(is.na(mimic_icu_cohort))
subject_id hadm_id
0 0
stay_id first_careunit
0 0
admission_type admission_location
0 0
insurance language
0 0
marital_status race
5113 0
gender sodium
0 8872
chloride creatinine
8883 5770
potassium glucose
8901 9099
hematocrit white_blood_cell_count
5017 5094
bicarbonate temperature_in_Fahrenheit
9050 1365
diastolic_non_invasive_blood_pressure respiratory_rate
984 98
systolic_non_invasive_blood_pressure heart_rate
980 20
age_intime los_long
0 0
# Define a vector with the names of the continuous variables
<- c(
colnames "sodium", "chloride", "creatinine",
"potassium", "glucose", "hematocrit",
"white_blood_cell_count", "bicarbonate",
"temperature_in_Fahrenheit",
"diastolic_non_invasive_blood_pressure",
"systolic_non_invasive_blood_pressure",
"respiratory_rate", "heart_rate",
"age_intime"
)
# Check the distribution of each continuous variable
# and I found that most of the continuous variables are right-skewed
# so I decide to use median to impute missing values later
<- lapply(mimic_icu_cohort[, colnames], function(x) {
hist_data data.frame(value = x)
})
names(hist_data) <- colnames
<- bind_rows(hist_data, .id = "variable")
hist_data
ggplot(hist_data, aes(x = value)) +
geom_density(kernel = "cosine", fill = "skyblue", alpha = 0.5) +
facet_wrap(~ variable, scales = "free") +
labs(x = "Value") +
theme_minimal()
Warning: Removed 64133 rows containing non-finite outside the scale range
(`stat_density()`).
# Deal with categorical variables
<- mimic_icu_cohort |>
mimic_icu_cohort
# For categorical variables "marital_status",
# I use a new category "Unknown" to get rid of missing values
mutate(
marital_status = ifelse(
is.na(marital_status), "Unknown", marital_status)
|>
)
# Make the categorical variables as factors
# to make it easier for modeling
mutate(marital_status = as.factor(marital_status)) |>
mutate(los_long = as.factor(los_long))
# summarize the data
|>
mimic_icu_cohort tbl_summary()
Characteristic | N = 73,1811 |
---|---|
subject_id | 14,998,941 (12,491,257, 17,513,267) |
hadm_id | 24,969,636 (22,483,349, 27,471,796) |
stay_id | 34,993,892 (32,489,353, 37,488,401) |
first_careunit | |
Cardiac Vascular Intensive Care Unit (CVICU) | 11,582 (16%) |
Medical Intensive Care Unit (MICU) | 15,898 (22%) |
Medical/Surgical Intensive Care Unit (MICU/SICU) | 12,733 (17%) |
Surgical Intensive Care Unit (SICU) | 11,161 (15%) |
Other | 21,807 (30%) |
admission_type | |
EW EMER. | 38,672 (53%) |
OBSERVATION ADMIT | 8,974 (12%) |
SURGICAL SAME DAY ADMISSION | 7,373 (10%) |
URGENT | 12,453 (17%) |
Other | 5,709 (7.8%) |
admission_location | |
EMERGENCY ROOM | 35,881 (49%) |
PHYSICIAN REFERRAL | 16,398 (22%) |
TRANSFER FROM HOSPITAL | 15,798 (22%) |
Other | 5,104 (7.0%) |
insurance | |
Medicaid | 5,528 (7.6%) |
Medicare | 33,091 (45%) |
Other | 34,562 (47%) |
language | |
? | 7,443 (10%) |
ENGLISH | 65,738 (90%) |
marital_status | |
DIVORCED | 5,404 (7.4%) |
MARRIED | 32,768 (45%) |
SINGLE | 20,858 (29%) |
Unknown | 5,113 (7.0%) |
WIDOWED | 9,038 (12%) |
race | |
ASIAN | 2,155 (2.9%) |
BLACK | 7,960 (11%) |
HISPANIC | 2,741 (3.7%) |
Other | 10,756 (15%) |
WHITE | 49,569 (68%) |
gender | |
F | 32,363 (44%) |
M | 40,818 (56%) |
sodium | 139.0 (136.0, 141.0) |
Unknown | 8,872 |
chloride | 102 (98, 105) |
Unknown | 8,883 |
creatinine | 1.00 (0.80, 1.50) |
Unknown | 5,770 |
potassium | 4.20 (3.80, 4.60) |
Unknown | 8,901 |
glucose | 120 (99, 156) |
Unknown | 9,099 |
hematocrit | 35 (30, 40) |
Unknown | 5,017 |
white_blood_cell_count | 9.3 (6.8, 13.2) |
Unknown | 5,094 |
bicarbonate | 25.0 (22.0, 27.0) |
Unknown | 9,050 |
temperature_in_Fahrenheit | 98.20 (97.60, 98.70) |
Unknown | 1,365 |
diastolic_non_invasive_blood_pressure | 67 (57, 79) |
Unknown | 984 |
respiratory_rate | 18.0 (15.0, 22.0) |
Unknown | 98 |
systolic_non_invasive_blood_pressure | 121 (106, 139) |
Unknown | 980 |
heart_rate | 86 (74, 101) |
Unknown | 20 |
age_intime | 66 (54, 77) |
los_long | |
FALSE | 38,050 (52%) |
TRUE | 35,131 (48%) |
1 Median (IQR); n (%) |
- Partition data into 50% training set and 50% test set. Stratify partitioning according to
los_long
. For grading purpose, sort the data bysubject_id
,hadm_id
, andstay_id
and use the seed203
for the initial data split. Below is the sample code.
- Partition data into 50% training set and 50% test set.
set.seed(203)
# sort the data by subject_id, hadm_id, and stay_id
<- mimic_icu_cohort |>
mimic_icu_cohort arrange(subject_id, hadm_id, stay_id) |>
select(-subject_id, -hadm_id, -stay_id)
# partition data into 50% training set and 50% test set
<- initial_split(
data_split
mimic_icu_cohort, # stratify by los_long
strata = "los_long",
prop = 0.5
)
data_split
<Training/Testing/Total>
<36590/36591/73181>
# check the training set
<- training(data_split)
train_data dim(train_data)
[1] 36590 23
# check the test set
<- testing(data_split)
test_data dim(test_data)
[1] 36591 23
- Pre-processing of data
# impute missing values
<- recipe(los_long ~ ., data = train_data) |>
recipe
# For continuous variables, I use median to impute missing values
step_impute_median(sodium, potassium,
glucose, hematocrit,
chloride, creatinine,
glucose, hematocrit,
white_blood_cell_count,
bicarbonate,
temperature_in_Fahrenheit,
diastolic_non_invasive_blood_pressure,
systolic_non_invasive_blood_pressure,
respiratory_rate,
heart_rate|>
)
# For categorical variables, I create traditional dummy variables
step_dummy(all_nominal_predictors()) |>
# zero-variance filter to remove predictors with zero variance
step_zv(all_numeric_predictors()) |>
# center and scale numeric data
# to have a mean of 0 and a standard deviation of 1
# so that the model can be trained more efficiently
step_normalize(all_numeric_predictors())
- Train and tune the models using the training set.
Logistic regression with elastic net regularization
First, define the model specification and the workflow for the logistic regression model. Then, tune the model using 5-fold cross-validation.
- Define the model specification
# define the recipe
<- recipe logit_recipe
# define the model specification is to
# use the glmnet package to do classification
<- logistic_reg(
logit_model mode = "classification",
penalty = tune(),
mixture = tune()) |>
# set the engine to glmnet
set_engine("glmnet", standardize = FALSE) |>
print()
Logistic Regression Model Specification (classification)
Main Arguments:
penalty = tune()
mixture = tune()
Engine-Specific Arguments:
standardize = FALSE
Computational engine: glmnet
- define the workflow
<- workflow() |>
logit_workflow add_recipe(logit_recipe) |>
add_model(logit_model) |>
print()
══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: logistic_reg()
── Preprocessor ────────────────────────────────────────────────────────────────
4 Recipe Steps
• step_impute_median()
• step_dummy()
• step_zv()
• step_normalize()
── Model ───────────────────────────────────────────────────────────────────────
Logistic Regression Model Specification (classification)
Main Arguments:
penalty = tune()
mixture = tune()
Engine-Specific Arguments:
standardize = FALSE
Computational engine: glmnet
- tuning grid
# I try to tune the model using penalty, mixture, and levels
<- grid_regular(
logit_param_grid penalty(range = c(-4, 1)),
mixture(),
levels = c(50, 3)
|>
) print()
# A tibble: 150 × 2
penalty mixture
<dbl> <dbl>
1 0.0001 0
2 0.000126 0
3 0.000160 0
4 0.000202 0
5 0.000256 0
6 0.000324 0
7 0.000409 0
8 0.000518 0
9 0.000655 0
10 0.000829 0
# ℹ 140 more rows
- cross-validation
Define the cross-validation folds
set.seed(203)
# define the number of folds for cross-validation is 5
<- vfold_cv(train_data, v = 5)
logit_folds
logit_folds
# 5-fold cross-validation
# A tibble: 5 × 2
splits id
<list> <chr>
1 <split [29272/7318]> Fold1
2 <split [29272/7318]> Fold2
3 <split [29272/7318]> Fold3
4 <split [29272/7318]> Fold4
5 <split [29272/7318]> Fold5
Fit cross-validated models
if (file.exists("logit_fit.rds")) {
<- read_rds("logit_fit.rds")
logit_fit
logit_fit
else {
} <- logit_workflow |>
(logit_fit tune_grid(
resamples = logit_folds,
grid = logit_param_grid,
# use roc_auc and accuracy as the evaluation metrics
metrics = metric_set(roc_auc, accuracy),
# prepare for later stacking
control = control_stack_grid()
|>
)) system.time()
|>
logit_fit write_rds("logit_fit.rds")
logit_fit }
# Tuning results
# 5-fold cross-validation
# A tibble: 5 × 5
splits id .metrics .notes .predictions
<list> <chr> <list> <list> <list>
1 <split [29272/7318]> Fold1 <tibble [300 × 6]> <tibble [0 × 3]> <tibble>
2 <split [29272/7318]> Fold2 <tibble [300 × 6]> <tibble [0 × 3]> <tibble>
3 <split [29272/7318]> Fold3 <tibble [300 × 6]> <tibble [0 × 3]> <tibble>
4 <split [29272/7318]> Fold4 <tibble [300 × 6]> <tibble [0 × 3]> <tibble>
5 <split [29272/7318]> Fold5 <tibble [300 × 6]> <tibble [0 × 3]> <tibble>
- visualize CV results and select the best model
|>
logit_fit
# aggregate metrics from 5-fold cross-validation
collect_metrics() |>
print(width = Inf) |>
filter(.metric == "roc_auc") |>
ggplot(mapping = aes(x = penalty, y = mean,
color = factor(mixture))) +
geom_point() +
labs(x = "Penalty", y = "CV AUC") +
scale_x_log10()
# A tibble: 300 × 8
penalty mixture .metric .estimator mean n std_err
<dbl> <dbl> <chr> <chr> <dbl> <int> <dbl>
1 0.0001 0 accuracy binary 0.581 5 0.00310
2 0.0001 0 roc_auc binary 0.615 5 0.00264
3 0.000126 0 accuracy binary 0.581 5 0.00310
4 0.000126 0 roc_auc binary 0.615 5 0.00264
5 0.000160 0 accuracy binary 0.581 5 0.00310
6 0.000160 0 roc_auc binary 0.615 5 0.00264
7 0.000202 0 accuracy binary 0.581 5 0.00310
8 0.000202 0 roc_auc binary 0.615 5 0.00264
9 0.000256 0 accuracy binary 0.581 5 0.00310
10 0.000256 0 roc_auc binary 0.615 5 0.00264
.config
<chr>
1 Preprocessor1_Model001
2 Preprocessor1_Model001
3 Preprocessor1_Model002
4 Preprocessor1_Model002
5 Preprocessor1_Model003
6 Preprocessor1_Model003
7 Preprocessor1_Model004
8 Preprocessor1_Model004
9 Preprocessor1_Model005
10 Preprocessor1_Model005
# ℹ 290 more rows
show top 5 models
|>
logit_fit show_best("roc_auc")
# A tibble: 5 × 8
penalty mixture .metric .estimator mean n std_err .config
<dbl> <dbl> <chr> <chr> <dbl> <int> <dbl> <chr>
1 0.00543 0 roc_auc binary 0.615 5 0.00265 Preprocessor1_Model018
2 0.00687 0 roc_auc binary 0.615 5 0.00266 Preprocessor1_Model019
3 0.000655 1 roc_auc binary 0.615 5 0.00270 Preprocessor1_Model109
4 0.000829 0.5 roc_auc binary 0.615 5 0.00268 Preprocessor1_Model060
5 0.000518 1 roc_auc binary 0.615 5 0.00269 Preprocessor1_Model108
select the best model
<- logit_fit |>
best_logit select_best("roc_auc")
best_logit
# A tibble: 1 × 3
penalty mixture .config
<dbl> <dbl> <chr>
1 0.00543 0 Preprocessor1_Model018
- finalize the model
Set the final workflow
# Final workflow
<- logit_workflow |>
logit_final_workflow finalize_workflow(best_logit)
logit_final_workflow
══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: logistic_reg()
── Preprocessor ────────────────────────────────────────────────────────────────
4 Recipe Steps
• step_impute_median()
• step_dummy()
• step_zv()
• step_normalize()
── Model ───────────────────────────────────────────────────────────────────────
Logistic Regression Model Specification (classification)
Main Arguments:
penalty = 0.00542867543932386
mixture = 0
Engine-Specific Arguments:
standardize = FALSE
Computational engine: glmnet
Fit on the whole training set, then predict on the test set
<- logit_final_workflow |>
logit_final_fit last_fit(data_split)
logit_final_fit
# Resampling results
# Manual resampling
# A tibble: 1 × 6
splits id .metrics .notes .predictions .workflow
<list> <chr> <list> <list> <list> <list>
1 <split [36590/36591]> train/test sp… <tibble> <tibble> <tibble> <workflow>
Output the test set predictions
# Test metrics
|>
logit_final_fit collect_metrics()
# A tibble: 2 × 4
.metric .estimator .estimate .config
<chr> <chr> <dbl> <chr>
1 accuracy binary 0.578 Preprocessor1_Model1
2 roc_auc binary 0.606 Preprocessor1_Model1
Plot the variable importance
# plot the variable importance
|>
logit_final_fit extract_fit_parsnip() |>
::vip() |>
vipprint()
- Interpret the model
First, the AUC of the final logistic regression model is 0.6064933, meaning that the model can correctly classify 60.6% of the test set. The AUC is a measure of the ability of the model to distinguish between classes, and the AUC value ranges from 0 to 1, with 0.5 indicating random guessing and 1 indicating perfect classification. Therefore, the AUC of the final logistic regression model is not very high. Also, the accuracy of the final logistic regression model is 0.5777924, which means that the model can only correctly predict 57.8% of the test set.
The variable importance plot shows the top 10 important predictors, and the most important predictor is the first care unit are important predictors.
Random forest
First, define the model specification and the workflow for the random forest model. Then, tune the model using 5-fold cross-validation.
- define the model specification
# define the recipe
<- recipe rf_recipe
- define random forest model
<- rand_forest(
rf_model mode = "classification",
# Number of predictors randomly sampled in each split
mtry = tune(),
# Number of trees in ensemble
trees = tune()
|>
)
# use the ranger package to do classification
set_engine("ranger", importance = "impurity")
rf_model
Random Forest Model Specification (classification)
Main Arguments:
mtry = tune()
trees = tune()
Engine-Specific Arguments:
importance = impurity
Computational engine: ranger
- define the workflow
<- workflow() |>
rf_workflow add_recipe(rf_recipe) |>
add_model(rf_model) |>
print()
══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: rand_forest()
── Preprocessor ────────────────────────────────────────────────────────────────
4 Recipe Steps
• step_impute_median()
• step_dummy()
• step_zv()
• step_normalize()
── Model ───────────────────────────────────────────────────────────────────────
Random Forest Model Specification (classification)
Main Arguments:
mtry = tune()
trees = tune()
Engine-Specific Arguments:
importance = impurity
Computational engine: ranger
- tuning grid
# define the tuning grid
# try 5 different values for mtry and 3 different values for trees
<- grid_regular(
rf_param_grid trees(range = c(200L, 800L)),
mtry(range = c(2L, 5L)),
levels = c(5, 3)
)
rf_param_grid
# A tibble: 15 × 2
trees mtry
<int> <int>
1 200 2
2 350 2
3 500 2
4 650 2
5 800 2
6 200 3
7 350 3
8 500 3
9 650 3
10 800 3
11 200 5
12 350 5
13 500 5
14 650 5
15 800 5
- cross-validation
Define the cross-validation folds
set.seed(203)
# define the number of folds for cross-validation is 5
<- vfold_cv(train_data, v = 5)
rf_folds
rf_folds
# 5-fold cross-validation
# A tibble: 5 × 2
splits id
<list> <chr>
1 <split [29272/7318]> Fold1
2 <split [29272/7318]> Fold2
3 <split [29272/7318]> Fold3
4 <split [29272/7318]> Fold4
5 <split [29272/7318]> Fold5
Fit cross-validated models
if (file.exists("rf_fit.rds")) {
<- read_rds("rf_fit.rds")
rf_fit
rf_fit
else {
} <- rf_workflow |>
(rf_fit tune_grid(
resamples = rf_folds,
grid = rf_param_grid,
# use roc_auc and accuracy as the evaluation metrics
metrics = metric_set(roc_auc, accuracy),
# prepare for later stacking
control = control_stack_grid()
|>
)) system.time()
|>
rf_fit write_rds("rf_fit.rds")
rf_fit }
# Tuning results
# 5-fold cross-validation
# A tibble: 5 × 5
splits id .metrics .notes .predictions
<list> <chr> <list> <list> <list>
1 <split [29272/7318]> Fold1 <tibble [30 × 6]> <tibble [0 × 3]> <tibble>
2 <split [29272/7318]> Fold2 <tibble [30 × 6]> <tibble [0 × 3]> <tibble>
3 <split [29272/7318]> Fold3 <tibble [30 × 6]> <tibble [0 × 3]> <tibble>
4 <split [29272/7318]> Fold4 <tibble [30 × 6]> <tibble [0 × 3]> <tibble>
5 <split [29272/7318]> Fold5 <tibble [30 × 6]> <tibble [0 × 3]> <tibble>
- visualize CV results and select the best model
|>
rf_fit collect_metrics() |>
print(width = Inf) |>
filter(.metric == "roc_auc") |>
ggplot(mapping = aes(x = trees, y = mean,
color = factor(mtry))) +
geom_point() +
labs(x = "Num. of Trees", y = "CV AUC")
# A tibble: 30 × 8
mtry trees .metric .estimator mean n std_err .config
<int> <int> <chr> <chr> <dbl> <int> <dbl> <chr>
1 2 200 accuracy binary 0.600 5 0.00405 Preprocessor1_Model01
2 2 200 roc_auc binary 0.640 5 0.00245 Preprocessor1_Model01
3 2 350 accuracy binary 0.601 5 0.00225 Preprocessor1_Model02
4 2 350 roc_auc binary 0.643 5 0.00213 Preprocessor1_Model02
5 2 500 accuracy binary 0.601 5 0.00328 Preprocessor1_Model03
6 2 500 roc_auc binary 0.644 5 0.00266 Preprocessor1_Model03
7 2 650 accuracy binary 0.600 5 0.00200 Preprocessor1_Model04
8 2 650 roc_auc binary 0.644 5 0.00231 Preprocessor1_Model04
9 2 800 accuracy binary 0.602 5 0.00342 Preprocessor1_Model05
10 2 800 roc_auc binary 0.644 5 0.00262 Preprocessor1_Model05
# ℹ 20 more rows
show top 5 models
|>
rf_fit show_best("roc_auc")
# A tibble: 5 × 8
mtry trees .metric .estimator mean n std_err .config
<int> <int> <chr> <chr> <dbl> <int> <dbl> <chr>
1 3 650 roc_auc binary 0.645 5 0.00252 Preprocessor1_Model09
2 3 800 roc_auc binary 0.645 5 0.00257 Preprocessor1_Model10
3 2 800 roc_auc binary 0.644 5 0.00262 Preprocessor1_Model05
4 5 800 roc_auc binary 0.644 5 0.00297 Preprocessor1_Model15
5 2 500 roc_auc binary 0.644 5 0.00266 Preprocessor1_Model03
select the best model
<- rf_fit |>
best_rf select_best("roc_auc")
best_rf
# A tibble: 1 × 3
mtry trees .config
<int> <int> <chr>
1 3 650 Preprocessor1_Model09
- finalize the model
Set the final workflow
<- rf_workflow |>
rf_final_workflow finalize_workflow(best_rf)
rf_final_workflow
══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: rand_forest()
── Preprocessor ────────────────────────────────────────────────────────────────
4 Recipe Steps
• step_impute_median()
• step_dummy()
• step_zv()
• step_normalize()
── Model ───────────────────────────────────────────────────────────────────────
Random Forest Model Specification (classification)
Main Arguments:
mtry = 3
trees = 650
Engine-Specific Arguments:
importance = impurity
Computational engine: ranger
Fit on the whole training set, then predict on the test set
if (file.exists("rf_final_fit.rds")) {
<- read_rds("rf_final_fit.rds")
rf_final_fit
rf_final_fit
else {
}
# fit the final model on the whole training set
<- rf_final_workflow |>
rf_final_fit last_fit(data_split)
|>
rf_final_fit write_rds("rf_final_fit.rds")
rf_final_fit }
# Resampling results
# Manual resampling
# A tibble: 1 × 6
splits id .metrics .notes .predictions .workflow
<list> <chr> <list> <list> <list> <list>
1 <split [36590/36591]> train/test sp… <tibble> <tibble> <tibble> <workflow>
Output the test set predictions
|>
rf_final_fit collect_metrics()
# A tibble: 2 × 4
.metric .estimator .estimate .config
<chr> <chr> <dbl> <chr>
1 accuracy binary 0.604 Preprocessor1_Model1
2 roc_auc binary 0.644 Preprocessor1_Model1
Predictor importance
# plot the variable importance
|>
rf_final_fit extract_fit_engine() |>
::vip() |>
vipprint()
- Interpret the model
First, the AUC of the final random forest model is 0.6437507, meaning that the model can correctly classify 64.4% of the test set. The AUC is a measure of the ability of the model to distinguish between classes, and the AUC value ranges from 0 to 1, with 0.5 indicating random guessing and 1 indicating perfect classification. Therefore, the AUC of the final random forest model is higher than the AUC of the final logistic regression model. Also, the accuracy of the final random forest model is 0.6041103, which means that the model can correctly predict 60.4% of the test set.
The variable importance plot shows that the top 10 important predictors, and the most important predictor is the systolic non invasive blood pressure.
Boosting (XGBoost)
First, define the model specification and the workflow for the boosting model. Then, tune the model using 5-fold cross-validation.
- define the model specification
# define the recipe
<- recipe gb_recipe
- define boosting model
<- boost_tree(
gb_model mode = "classification",
# Number of trees in ensemble
trees = 1000,
# Shrinkage factor
learn_rate = tune(),
# Maximum depth of trees
tree_depth = tune()
|>
)
# use the xgboost package to do classification
set_engine("xgboost")
gb_model
Boosted Tree Model Specification (classification)
Main Arguments:
trees = 1000
tree_depth = tune()
learn_rate = tune()
Computational engine: xgboost
- define the workflow
<- workflow() |>
gb_workflow add_recipe(gb_recipe) |>
add_model(gb_model)
gb_workflow
══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: boost_tree()
── Preprocessor ────────────────────────────────────────────────────────────────
4 Recipe Steps
• step_impute_median()
• step_dummy()
• step_zv()
• step_normalize()
── Model ───────────────────────────────────────────────────────────────────────
Boosted Tree Model Specification (classification)
Main Arguments:
trees = 1000
tree_depth = tune()
learn_rate = tune()
Computational engine: xgboost
- tuning grid
# set the tuning grid
<- grid_regular(
gb_param_grid
# try 3 different values for tree_depth,
# 5 different values for learn_rate
tree_depth(range = c(1L, 3L)),
learn_rate(range = c(-5, 2), trans = log10_trans()),
levels = c(3, 5)
)
gb_param_grid
# A tibble: 15 × 2
tree_depth learn_rate
<int> <dbl>
1 1 0.00001
2 2 0.00001
3 3 0.00001
4 1 0.000562
5 2 0.000562
6 3 0.000562
7 1 0.0316
8 2 0.0316
9 3 0.0316
10 1 1.78
11 2 1.78
12 3 1.78
13 1 100
14 2 100
15 3 100
- cross-validation set cross-validation patitions
set.seed(203)
# define the number of folds for cross-validation is 5
<- vfold_cv(train_data, v = 5)
gb_folds
gb_folds
# 5-fold cross-validation
# A tibble: 5 × 2
splits id
<list> <chr>
1 <split [29272/7318]> Fold1
2 <split [29272/7318]> Fold2
3 <split [29272/7318]> Fold3
4 <split [29272/7318]> Fold4
5 <split [29272/7318]> Fold5
Fit cross-validated models
if (file.exists("gb_fit.rds")) {
<- read_rds("gb_fit.rds")
gb_fit
gb_fit
else {
} <- gb_workflow |>
(gb_fit tune_grid(
resamples = gb_folds,
grid = gb_param_grid,
# use roc_auc and accuracy as the evaluation metrics
metrics = metric_set(roc_auc, accuracy),
# prepare for later stacking
control = control_stack_grid()
|>
)) system.time()
|>
gb_fit write_rds("gb_fit.rds")
gb_fit }
# Tuning results
# 5-fold cross-validation
# A tibble: 5 × 5
splits id .metrics .notes .predictions
<list> <chr> <list> <list> <list>
1 <split [29272/7318]> Fold1 <tibble [30 × 6]> <tibble [0 × 3]> <tibble>
2 <split [29272/7318]> Fold2 <tibble [30 × 6]> <tibble [0 × 3]> <tibble>
3 <split [29272/7318]> Fold3 <tibble [30 × 6]> <tibble [0 × 3]> <tibble>
4 <split [29272/7318]> Fold4 <tibble [30 × 6]> <tibble [0 × 3]> <tibble>
5 <split [29272/7318]> Fold5 <tibble [30 × 6]> <tibble [0 × 3]> <tibble>
- visualize CV results and select the best model
|>
gb_fit collect_metrics() |>
print(width = Inf) |>
filter(.metric == "roc_auc") |>
ggplot(mapping = aes(x = learn_rate, y = mean,
color = factor(tree_depth))) +
geom_point() +
labs(x = "Learning Rate", y = "CV AUC") +
scale_x_log10()
# A tibble: 30 × 8
tree_depth learn_rate .metric .estimator mean n std_err
<int> <dbl> <chr> <chr> <dbl> <int> <dbl>
1 1 0.00001 accuracy binary 0.550 5 0.00338
2 1 0.00001 roc_auc binary 0.539 5 0.00172
3 2 0.00001 accuracy binary 0.557 5 0.00236
4 2 0.00001 roc_auc binary 0.562 5 0.00347
5 3 0.00001 accuracy binary 0.561 5 0.00386
6 3 0.00001 roc_auc binary 0.574 5 0.00543
7 1 0.000562 accuracy binary 0.554 5 0.00354
8 1 0.000562 roc_auc binary 0.577 5 0.00308
9 2 0.000562 accuracy binary 0.562 5 0.00367
10 2 0.000562 roc_auc binary 0.592 5 0.00322
.config
<chr>
1 Preprocessor1_Model01
2 Preprocessor1_Model01
3 Preprocessor1_Model02
4 Preprocessor1_Model02
5 Preprocessor1_Model03
6 Preprocessor1_Model03
7 Preprocessor1_Model04
8 Preprocessor1_Model04
9 Preprocessor1_Model05
10 Preprocessor1_Model05
# ℹ 20 more rows
show top 5 models
|>
gb_fit show_best("roc_auc")
# A tibble: 5 × 8
tree_depth learn_rate .metric .estimator mean n std_err .config
<int> <dbl> <chr> <chr> <dbl> <int> <dbl> <chr>
1 3 0.0316 roc_auc binary 0.645 5 0.00320 Preprocessor1_Mo…
2 2 0.0316 roc_auc binary 0.642 5 0.00351 Preprocessor1_Mo…
3 1 0.0316 roc_auc binary 0.633 5 0.00306 Preprocessor1_Mo…
4 1 1.78 roc_auc binary 0.619 5 0.00241 Preprocessor1_Mo…
5 3 0.000562 roc_auc binary 0.602 5 0.00358 Preprocessor1_Mo…
select the best model
<- gb_fit |>
best_gb select_best("roc_auc")
best_gb
# A tibble: 1 × 3
tree_depth learn_rate .config
<int> <dbl> <chr>
1 3 0.0316 Preprocessor1_Model09
- finalize the model
<- gb_workflow |>
gb_final_workflow finalize_workflow(best_gb)
gb_final_workflow
══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: boost_tree()
── Preprocessor ────────────────────────────────────────────────────────────────
4 Recipe Steps
• step_impute_median()
• step_dummy()
• step_zv()
• step_normalize()
── Model ───────────────────────────────────────────────────────────────────────
Boosted Tree Model Specification (classification)
Main Arguments:
trees = 1000
tree_depth = 3
learn_rate = 0.0316227766016838
Computational engine: xgboost
Fit on the whole training set, then predict on the test set
if (file.exists("gb_final_fit.rds")) {
<- read_rds("gb_final_fit.rds")
gb_final_fit
gb_final_fit
else {
}
# fit the final model on the whole training set
<- gb_final_workflow |>
gb_final_fit last_fit(data_split)
|>
gb_final_fit write_rds("gb_final_fit.rds")
gb_final_fit }
# Resampling results
# Manual resampling
# A tibble: 1 × 6
splits id .metrics .notes .predictions .workflow
<list> <chr> <list> <list> <list> <list>
1 <split [36590/36591]> train/test sp… <tibble> <tibble> <tibble> <workflow>
Output the test set predictions
|>
gb_final_fit collect_metrics()
# A tibble: 2 × 4
.metric .estimator .estimate .config
<chr> <chr> <dbl> <chr>
1 accuracy binary 0.605 Preprocessor1_Model1
2 roc_auc binary 0.645 Preprocessor1_Model1
Predictor importance
# plot the variable importance
|>
gb_final_fit extract_fit_engine() |>
::vip() |>
vipprint()
- Interpret the model
First, the AUC of the final gradient boosting model is 0.6449823, meaning that the model can correctly classify 64.5% of the test set. The AUC is a measure of the ability of the model to distinguish between classes, and the AUC value ranges from 0 to 1, with 0.5 indicating random guessing and 1 indicating perfect classification. Therefore, the AUC of the final gradient boosting model is higher than the AUC of the final random forest model. Also, the accuracy of the final gradient boosting model is 0.6045476, which means that the model can correctly predict 60.5% of the test set.
Second, the variable importance plot shows that the top 10 important predictors, and the most important predictor is the temperature in Fahrenheit. And the second important predictor is the systolic non invasive blood pressure, which is consistent with the random forest model.
- Compare model classification performance on the test set. Report both the area under ROC curve and accuracy for each machine learning algorithm and the model stacking. Interpret the results. What are the most important features in predicting long ICU stays? How do the models compare in terms of performance and interpretability?
Model comparison
I used three machine learning algorithms: logistic regression, gradient boosting, and random forest. Here, I compare their classification performance on the test set using the area under the ROC curve (AUC) and accuracy.
# collect the metrics of the three models
|> collect_metrics() logit_final_fit
# A tibble: 2 × 4
.metric .estimator .estimate .config
<chr> <chr> <dbl> <chr>
1 accuracy binary 0.578 Preprocessor1_Model1
2 roc_auc binary 0.606 Preprocessor1_Model1
|> collect_metrics() rf_final_fit
# A tibble: 2 × 4
.metric .estimator .estimate .config
<chr> <chr> <dbl> <chr>
1 accuracy binary 0.604 Preprocessor1_Model1
2 roc_auc binary 0.644 Preprocessor1_Model1
|> collect_metrics() gb_final_fit
# A tibble: 2 × 4
.metric .estimator .estimate .config
<chr> <chr> <dbl> <chr>
1 accuracy binary 0.605 Preprocessor1_Model1
2 roc_auc binary 0.645 Preprocessor1_Model1
From the output, we can see that the gradient boosting model has the highest AUC (0.6449823) and accuracy (0.6045476) on the test set. The logistic regression model has the lowest AUC (0.6065411) and accuracy (0.5776284). The random forest model has an AUC of 0.6437507 and accuracy of 0.6041103. Therefore, the gradient boosting model is the best-performing model among the three.
Model stacking
Build the stacked ensemble.
if (file.exists("stacks.rds")) {
<- read_rds("stacks.rds")
model_st
model_stelse {
} <-
model_st stacks() |>
# add candidate models
add_candidates(logit_fit) |>
add_candidates(gb_fit) |>
add_candidates(rf_fit) |>
# determine how to combine their predictions
blend_predictions(
penalty = 10^(-3:1),
metrics = c("roc_auc"),
# set the number of resamples to 3 to reduce computation time
times = 3) |>
# fit the candidates with nonzero stacking coefficients
fit_members()
|> write_rds("stacks.rds")
model_st
model_st }
# A tibble: 10 × 3
member type weight
<chr> <chr> <dbl>
1 .pred_TRUE_gb_fit_1_09 boost_tree 1.80
2 .pred_TRUE_rf_fit_1_09 rand_forest 0.691
3 .pred_TRUE_rf_fit_1_07 rand_forest 0.590
4 .pred_TRUE_rf_fit_1_02 rand_forest 0.468
5 .pred_TRUE_rf_fit_1_15 rand_forest 0.443
6 .pred_TRUE_rf_fit_1_06 rand_forest 0.427
7 .pred_TRUE_rf_fit_1_13 rand_forest 0.358
8 .pred_TRUE_rf_fit_1_03 rand_forest 0.299
9 .pred_TRUE_rf_fit_1_10 rand_forest 0.163
10 .pred_TRUE_gb_fit_1_10 boost_tree 0.116
Plot the stacked results
autoplot(model_st)
To show the relationship more directly
autoplot(model_st, type = "members")
To see the top models
autoplot(model_st, type = "weights")
To identify coefficients of random forest model
collect_parameters(model_st, "rf_fit")
# A tibble: 15 × 5
member mtry trees terms coef
<chr> <int> <int> <chr> <dbl>
1 rf_fit_1_01 2 200 .pred_TRUE_rf_fit_1_01 0
2 rf_fit_1_02 2 350 .pred_TRUE_rf_fit_1_02 0.468
3 rf_fit_1_03 2 500 .pred_TRUE_rf_fit_1_03 0.299
4 rf_fit_1_04 2 650 .pred_TRUE_rf_fit_1_04 0
5 rf_fit_1_05 2 800 .pred_TRUE_rf_fit_1_05 0
6 rf_fit_1_06 3 200 .pred_TRUE_rf_fit_1_06 0.427
7 rf_fit_1_07 3 350 .pred_TRUE_rf_fit_1_07 0.590
8 rf_fit_1_08 3 500 .pred_TRUE_rf_fit_1_08 0
9 rf_fit_1_09 3 650 .pred_TRUE_rf_fit_1_09 0.691
10 rf_fit_1_10 3 800 .pred_TRUE_rf_fit_1_10 0.163
11 rf_fit_1_11 5 200 .pred_TRUE_rf_fit_1_11 0
12 rf_fit_1_12 5 350 .pred_TRUE_rf_fit_1_12 0
13 rf_fit_1_13 5 500 .pred_TRUE_rf_fit_1_13 0.358
14 rf_fit_1_14 5 650 .pred_TRUE_rf_fit_1_14 0
15 rf_fit_1_15 5 800 .pred_TRUE_rf_fit_1_15 0.443
To identify coefficients of gradient boosting model
collect_parameters(model_st, "gb_fit")
# A tibble: 15 × 5
member tree_depth learn_rate terms coef
<chr> <int> <dbl> <chr> <dbl>
1 gb_fit_1_01 1 0.00001 .pred_TRUE_gb_fit_1_01 0
2 gb_fit_1_02 2 0.00001 .pred_TRUE_gb_fit_1_02 0
3 gb_fit_1_03 3 0.00001 .pred_TRUE_gb_fit_1_03 0
4 gb_fit_1_04 1 0.000562 .pred_TRUE_gb_fit_1_04 0
5 gb_fit_1_05 2 0.000562 .pred_TRUE_gb_fit_1_05 0
6 gb_fit_1_06 3 0.000562 .pred_TRUE_gb_fit_1_06 0
7 gb_fit_1_07 1 0.0316 .pred_TRUE_gb_fit_1_07 0
8 gb_fit_1_08 2 0.0316 .pred_TRUE_gb_fit_1_08 0
9 gb_fit_1_09 3 0.0316 .pred_TRUE_gb_fit_1_09 1.80
10 gb_fit_1_10 1 1.78 .pred_TRUE_gb_fit_1_10 0.116
11 gb_fit_1_11 2 1.78 .pred_TRUE_gb_fit_1_11 0.0455
12 gb_fit_1_12 3 1.78 .pred_TRUE_gb_fit_1_12 0.0701
13 gb_fit_1_13 1 100 .pred_TRUE_gb_fit_1_13 0
14 gb_fit_1_14 2 100 .pred_TRUE_gb_fit_1_14 0
15 gb_fit_1_15 3 100 .pred_TRUE_gb_fit_1_15 0.0138
To identify coefficients of logistic regression model
collect_parameters(model_st, "logit_fit")
# A tibble: 81 × 5
member penalty mixture terms coef
<chr> <dbl> <dbl> <chr> <dbl>
1 logit_fit_1_001 0.0001 0 .pred_TRUE_logit_fit_1_001 0.00213
2 logit_fit_1_018 0.00543 0 .pred_TRUE_logit_fit_1_018 0
3 logit_fit_1_019 0.00687 0 .pred_TRUE_logit_fit_1_019 0
4 logit_fit_1_020 0.00869 0 .pred_TRUE_logit_fit_1_020 0
5 logit_fit_1_021 0.0110 0 .pred_TRUE_logit_fit_1_021 0
6 logit_fit_1_022 0.0139 0 .pred_TRUE_logit_fit_1_022 0
7 logit_fit_1_023 0.0176 0 .pred_TRUE_logit_fit_1_023 0
8 logit_fit_1_024 0.0222 0 .pred_TRUE_logit_fit_1_024 0
9 logit_fit_1_025 0.0281 0 .pred_TRUE_logit_fit_1_025 0
10 logit_fit_1_026 0.0356 0 .pred_TRUE_logit_fit_1_026 0
# ℹ 71 more rows
Final classification
Apply the model on the test data and output the final classification
if (file.exists("final_classification.rds")) {
<- read_rds("final_classification.rds")
final_classification
final_classification
else {
} <- test_data |>
final_classification
# apply the model on the test data
# with type = "prob" to get the predicted probabilities
bind_cols(predict(model_st, test_data, type = "prob")) |>
print(width = Inf)
final_classification
|>
final_classification write_rds("final_classification.rds")
}
# A tibble: 36,591 × 25
first_careunit admission_type admission_location insurance language
<fct> <fct> <fct> <chr> <chr>
1 Medical Intensive Care … EW EMER. EMERGENCY ROOM Medicaid ENGLISH
2 Surgical Intensive Care… EW EMER. EMERGENCY ROOM Other ?
3 Surgical Intensive Care… Other PHYSICIAN REFERRAL Other ?
4 Medical/Surgical Intens… EW EMER. Other Other ENGLISH
5 Cardiac Vascular Intens… SURGICAL SAME… PHYSICIAN REFERRAL Medicare ENGLISH
6 Other EW EMER. Other Other ENGLISH
7 Medical Intensive Care … EW EMER. EMERGENCY ROOM Medicare ENGLISH
8 Medical Intensive Care … EW EMER. EMERGENCY ROOM Medicare ENGLISH
9 Other URGENT TRANSFER FROM HOS… Medicare ENGLISH
10 Other URGENT TRANSFER FROM HOS… Medicare ENGLISH
# ℹ 36,581 more rows
# ℹ 20 more variables: marital_status <fct>, race <chr>, gender <chr>,
# sodium <dbl>, chloride <dbl>, creatinine <dbl>, potassium <dbl>,
# glucose <dbl>, hematocrit <dbl>, white_blood_cell_count <dbl>,
# bicarbonate <dbl>, temperature_in_Fahrenheit <dbl>,
# diastolic_non_invasive_blood_pressure <dbl>, respiratory_rate <dbl>,
# systolic_non_invasive_blood_pressure <dbl>, heart_rate <dbl>, …
Compute the ROC AUC and accuracy of the final classification
# compute the ROC AUC of the final classification
::roc_auc(
yardstick
final_classification,truth = los_long,
contains(".pred_FALSE")
)
# A tibble: 1 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 roc_auc binary 0.650
# compute the accuracy of the final classification
# First, convert the predicted probabilities to
# predicted classes
<- final_classification |>
final_classification mutate(.pred_class = as.factor(
ifelse(.pred_TRUE > .pred_FALSE, "TRUE", "FALSE")))
# Then, compute the accuracy
::accuracy(
yardstick
final_classification,truth = los_long,
estimate = .pred_class
)
# A tibble: 1 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 accuracy binary 0.606
Use the members argument to generate predictions from each individual
if (file.exists("class_predictions.rds")) {
<- read_rds("class_predictions.rds")
class_predictions
class_predictionselse {
}
<- test_data |>
class_predictions select(los_long) |>
# apply the model on the test data
bind_cols(predict(model_st,
test_data, type = "class",
members = TRUE))
class_predictions
|>
class_predictions write_rds("class_predictions.rds")
}
# A tibble: 36,591 × 17
los_long .pred_class .pred_class_logit_fit_1_001 .pred_class_logit_fit_1_051
<fct> <fct> <fct> <fct>
1 FALSE FALSE FALSE FALSE
2 FALSE FALSE FALSE FALSE
3 FALSE FALSE FALSE FALSE
4 FALSE FALSE FALSE FALSE
5 FALSE FALSE FALSE FALSE
6 TRUE FALSE FALSE FALSE
7 TRUE TRUE TRUE TRUE
8 TRUE FALSE FALSE FALSE
9 TRUE TRUE TRUE TRUE
10 TRUE TRUE TRUE TRUE
# ℹ 36,581 more rows
# ℹ 13 more variables: .pred_class_gb_fit_1_10 <fct>,
# .pred_class_gb_fit_1_11 <fct>, .pred_class_gb_fit_1_09 <fct>,
# .pred_class_gb_fit_1_12 <fct>, .pred_class_gb_fit_1_15 <fct>,
# .pred_class_rf_fit_1_02 <fct>, .pred_class_rf_fit_1_03 <fct>,
# .pred_class_rf_fit_1_06 <fct>, .pred_class_rf_fit_1_07 <fct>,
# .pred_class_rf_fit_1_09 <fct>, .pred_class_rf_fit_1_10 <fct>, …
Get the value of each model’s performance
# set the factor columns to character for comparison
<- class_predictions |>
class_predictions mutate(across(where(is.factor), as.character))
# get the mean of the predicted classes for each model
map(colnames(class_predictions),
~ mean(class_predictions$los_long ==
pull(class_predictions, .x))) |>
set_names(colnames(class_predictions)) |>
as_tibble() |>
pivot_longer(c(everything(), -los_long))
# A tibble: 16 × 3
los_long name value
<dbl> <chr> <dbl>
1 1 .pred_class 0.606
2 1 .pred_class_logit_fit_1_001 0.578
3 1 .pred_class_logit_fit_1_051 0.578
4 1 .pred_class_gb_fit_1_10 0.588
5 1 .pred_class_gb_fit_1_11 0.574
6 1 .pred_class_gb_fit_1_09 0.605
7 1 .pred_class_gb_fit_1_12 0.553
8 1 .pred_class_gb_fit_1_15 0.444
9 1 .pred_class_rf_fit_1_02 0.602
10 1 .pred_class_rf_fit_1_03 0.602
11 1 .pred_class_rf_fit_1_06 0.602
12 1 .pred_class_rf_fit_1_07 0.605
13 1 .pred_class_rf_fit_1_09 0.604
14 1 .pred_class_rf_fit_1_10 0.605
15 1 .pred_class_rf_fit_1_13 0.603
16 1 .pred_class_rf_fit_1_15 0.604
Report AUC and accuracy and interpret the results
The gradient boosting model has the highest AUC (0.6449823) and accuracy (0.6045476) on the test set. The logistic regression model has the lowest AUC (0.6065411) and accuracy (0.5776284). The random forest model has an AUC of 0.6437507 and accuracy of 0.6041103. Therefore, the gradient boosting model is the best-performing model among the three.
From the output of each model’s performance, we can see that the gradient boosting model has the greatest weight in the stacked ensemble. The random forest model has the second greatest weight, and the logistic regression model has the least weight.
The final classification has an AUC of 0.6499224 and an accuracy of 0.6064606. The AUC of the final classification is 0.6499224, meaning that the probability of the model ranking a randomly chosen positive observation higher than a randomly chosen negative observation is 0.6499224. The accuracy of the final classification is 0.6064606, meaning that the proportion of correct predictions is 60.64606%.
As for the coefficients of each model for the stacked ensemble, the boosting model has the greatest weight, and the random forest model has the second greatest weight. The logistic regression model has the least weight. In terms of the number of models, the random forest model has the greatest weight, and the logistic regression model has the least weight.
From the output of accuracy of each model, we can see that the random forest model has the greatest accuracy, and the logistic regression model has the least accuracy. The gradient boosting model has the second greatest accuracy.
In conclusion, the gradient boosting model has the greatest weight in the stacked ensemble, and the random forest model has the second greatest weight. The logistic regression model has the least weight. The final classification has an AUC of 0.6499224 and an accuracy of 0.6064606. The random forest model has the greatest accuracy, and the logistic regression model has the least accuracy. The gradient boosting model has the second greatest accuracy.
The most important features
What are the most important features in predicting long ICU stays?
# retrieve the most important features
|>
logit_final_fit extract_fit_engine() |>
::vip() |>
vipprint()
|>
rf_final_fit extract_fit_engine() |>
::vip() |>
vipprint()
|>
gb_final_fit extract_fit_parsnip() |>
pluck("fit") |>
::vip() |>
vipprint()
From the plots above, we can see that the most important features in different models are different. For the logistic regression model, the most important features are first careunit, heart rate, etc. And for the random forest model, the most important features are systolic non invasive blood pressure, heart rate, etc. And for the gradient boosting model, the most important features are temperature in farhenheit, systolic non invasive blood pressure, etc.
Performance and interpretability
How do the models compare in terms of performance and interpretability?
Overall, in terms of performance, the random forest model has the best. In terms of interpretability, the logistic regression model has the best. The gradient boosting model is in the middle in terms of performance and interpretability.