Biostat 203B Homework 3

Due Feb 23 @ 11:59PM

Author

Jiachen Ai. UID: 206182615

Display machine information for reproducibility:

sessionInfo()
R version 4.3.1 (2023-06-16)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Ventura 13.3

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: America/Los_Angeles
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
 [1] htmlwidgets_1.6.4 compiler_4.3.1    fastmap_1.1.1     cli_3.6.1        
 [5] tools_4.3.1       htmltools_0.5.7   rstudioapi_0.15.0 yaml_2.3.7       
 [9] rmarkdown_2.25    knitr_1.44        jsonlite_1.8.7    xfun_0.40        
[13] digest_0.6.33     rlang_1.1.1       evaluate_0.22    

Load necessary libraries (you can add more as needed).

library(arrow)
library(gtsummary)
library(memuse)
library(pryr)
library(R.utils)
library(tidyverse)
library(readr)
library(dplyr)
library(ggplot2)
library(DescTools)

Display your machine memory.

memuse::Sys.meminfo()
<<<<<<< HEAD
Totalram:   8.000 GiB 
Freeram:   97.000 MiB 
=======
Totalram:    8.000 GiB 
Freeram:   158.344 MiB 
>>>>>>> develop

In this exercise, we use tidyverse (ggplot2, dplyr, etc) to explore the MIMIC-IV data introduced in homework 1 and to build a cohort of ICU stays.

Q1. Visualizing patient trajectory

Visualizing a patient’s encounters in a health care system is a common task in clinical data analysis. In this question, we will visualize a patient’s ADT (admission-discharge-transfer) history and ICU vitals in the MIMIC-IV data.

Q1.1 ADT history

A patient’s ADT history records the time of admission, discharge, and transfer in the hospital. This figure shows the ADT history of the patient with subject_id 10001217 in the MIMIC-IV data. The x-axis is the calendar time, and the y-axis is the type of event (ADT, lab, procedure). The color of the line segment represents the care unit. The size of the line segment represents whether the care unit is an ICU/CCU. The crosses represent lab events, and the shape of the dots represents the type of procedure. The title of the figure shows the patient’s demographic information and the subtitle shows top 3 diagnoses.

Do a similar visualization for the patient with subject_id 10013310 using ggplot.

Hint: We need to pull information from data files patients.csv.gz, admissions.csv.gz, transfers.csv.gz, labevents.csv.gz, procedures_icd.csv.gz, diagnoses_icd.csv.gz, d_icd_procedures.csv.gz, and d_icd_diagnoses.csv.gz. For the big file labevents.csv.gz, use the Parquet format you generated in Homework 2. For reproducibility, make the Parquet folder labevents_pq available at the current working directory hw3, for example, by a symbolic link. Make your code reproducible.

Answer

Step I: getting the demographic information

(1) get to know the structure of patients.csv.gz and admissions.csv.gz by zcat and head.

zcat < ~/mimic/hosp/patients.csv.gz | head
zcat < ~/mimic/hosp/admissions.csv.gz | head

(2) read the patients.csv.gz and admissions.csv.gz by read_csv and filter the patient with subject_id 10013310. And then get the demographic information of the patient. Here I found that the race column in admissions.csv.gz is in upper case which is different from the example. So, I lower case the race column.

race <- read_csv("~/mimic/hosp/admissions.csv.gz") %>%
  filter(subject_id == 10013310) %>%
  distinct(race)

demographics <- read_csv("~/mimic/hosp/patients.csv.gz") %>%
  filter(subject_id == 10013310) %>%
  
  # Lower case the race column
  mutate(race = tolower(race$race))  

Step II: getting the top 3 diagnoses of the patient

(1) get to know the structure of diagnoses_icd.csv.gz and d_icd_diagnoses.csv.gz by zcat and head.

zcat < ~/mimic/hosp/diagnoses_icd.csv.gz | head
zcat < ~/mimic/hosp/d_icd_diagnoses.csv.gz | head

(2) read the diagnoses_icd.csv.gz and d_icd_diagnoses.csv.gz by read_csv and filter the patient with subject_id 10013310. Combine the two tables by left_join. Here I found that there are several diagnoses for the patient in each patient hospitalization. So, I count the frequency of all diagnosis and select the top 3 diagnoses.

top_3_diagnoses <- read_csv("~/mimic/hosp/diagnoses_icd.csv.gz") %>%
  filter(subject_id == 10013310) %>%
  left_join(read_csv("~/mimic/hosp/d_icd_diagnoses.csv.gz"), 
            by = c("icd_code" = "icd_code", 
                   "icd_version" = "icd_version")) %>%
  
  group_by(long_title) %>%
  
  # Count the frequency of each diagnosis
  summarise(freq = n()) %>%
  
  # Select the top 3 diagnoses
  arrange(desc(freq)) %>%
  slice(1:3) 

top_3_diagnoses
# A tibble: 3 × 2
  long_title                                            freq
  <chr>                                                <int>
1 Acute on chronic systolic (congestive) heart failure     3
2 Hyperlipidemia, unspecified                              3
3 Long term (current) use of insulin                       3

Step III: getting the ADT history of the patient

(1) get to know the structure of diagnoses_icd.csv.gz by zcat and head.

zcat < ~/mimic/hosp/transfers.csv.gz | head

(2) read the transfers.csv.gz by read_csv and filter the patient with subject_id 10013310. And then get the ADT history of the patient. Also, I add a new column segment_thickness to represent the size of the line segment for the convenience of plotting.

# read the transfers.csv.gz and filter the patient 10013310
ADT <- read_csv("~/mimic/hosp/transfers.csv.gz") %>%
  filter(subject_id == 10013310) %>%
  filter(!is.na(careunit)) %>%
  
  # Add a new column segment_thickness 
  # to represent the size of the line segment
  mutate(segment_thickness = if_else(
    str_detect(careunit, "Care Unit"), 10, 8))

Step IV: getting the lab events of the patient from labevents.parquet with subject_id 10013310 by arrow::open_dataset.

(1) get to know the structure of labevents.parquet by zcat and head.

zcat < ~/mimic/hosp/labevents.csv.gz | head

(2) decompressing the labevents.csv.gz file.

system("gzip -d -k ~/mimic/hosp/labevents.csv.gz")

(3) re-writing the csv file labevents.csv in the Parquet format.

# Define the file paths
file_path <- "~/mimic/hosp/labevents.csv"
parquet_file_path <- "~/mimic/hosp/labevents.parquet"

# Rewrite the CSV file in Parquet format    
arrow::write_dataset(arrow::open_dataset(file_path, format = "csv"), 
                     parquet_file_path, format = "parquet")

Also, I create a symbolic link for the original directory to the Parquet file labevents.parquet under the path under the folder ./labevents_pq available at the current working directory hw3 for reproducibility.

<<<<<<< HEAD
ln -s ~/mimic/hosp/labevents.parquet/
=======
ln -s ~/mimic/hosp/labevents.parquet/ ./labevents_pq
>>>>>>> develop

(4) Reading the labevents.parquet file using the arrow::open_dataset. The information I need is the subject_id and charttime columns. So, I use the dplyr::select function to select these two columns and the dplyr::filter function to filter the subject_id 10013310. Also, there are duplicated charttime values because several lab events can happen at the same time, so I use the dplyr::distinct function to keep the unique charttime values.

# read the labevents.parquet file
labevents <- arrow::open_dataset('./labevents_pq', 
                                 format = "parquet") %>%
  
  # Select the subject_id and charttime columns
  dplyr::select(subject_id, charttime) %>%
  
  # Filter the patient 10013310
  dplyr::filter(subject_id == 10013310) %>%
  
  # get rid of the duplicated charttime values
  dplyr::distinct(subject_id, charttime) %>%
  collect()

Step V: getting the procedure events of the patient.

(1) get to know the structure of procedures_icd.csv.gz and d_icd_procedures.csv.gz by zcat and head.

zcat < ~/mimic/hosp/procedures_icd.csv.gz | head
zcat < ~/mimic/hosp/d_icd_procedures.csv.gz | head

(2) read the procedures_icd.csv.gz and d_icd_procedures.csv.gz by read_csv and filter the patient with subject_id 10013310. Combine the two tables using left_join by the icd_code and icd_version columns.

# read the procedures_icd.csv.gz and d_icd_procedures.csv.gz
procedures <- read_csv("~/mimic/hosp/procedures_icd.csv.gz") %>%
  filter(subject_id == 10013310) %>%
  
  # Combine the two tables by left_join
  left_join(read_csv("~/mimic/hosp/d_icd_procedures.csv.gz"), 
            by = c("icd_code" = "icd_code", 
                   "icd_version" = "icd_version"))

Step VI: getting the ADT_history plot by ggplot2.

First, I use the ggplot function to create a new plot. Then, I use the scale_x_datetime function to specify the x axis limits, the scale_y_discrete function to specify the y axis with 3 levels, the geom_point function to add procedure events, the scale_shape_manual function to manually set the number of shapes according to the number of titles. Also, I use the geom_segment function to add ADT history, the geom_point function to add lab events, and the geom_text function to add diagnosis events. Finally, I use the theme function to set the theme of the plot.

ADT_history <- ggplot() +
  
  # Specify x axis limits
  scale_x_datetime(name = "Calendar Time", 
                   
                   # minus 1 day from the minimum intime 
                   # to display all the data
                   limits = c(min(ADT$intime) - days(1), 
                              max(ADT$outtime))) +
  
  # Specify y axis with 3 levels
  scale_y_discrete(name = NULL, 
                   limits = c("Procedure", "Lab", "ADT")) + 
  
  # Add procedure events
  geom_point(data = procedures, 
             
             # change the format of chartdate
             aes(x = as.POSIXct(chartdate), 
                 y = "Procedure", 
                 
                 # Get shorter title using regular expression
                 shape = sub(",.*", "", long_title)),
             size = 3) + 
  
  # Manually set the number of shapes according to the number of titles
  # For reproducibility, I use n_distinct to determine the length
  scale_shape_manual(values = c(1:n_distinct(procedures$long_title))) +
  
  # Add lab events
  geom_point(data = labevents, 
             aes(x = charttime, y = "Lab"), 
             shape = 3, size = 2) +
  
  # Add ADT events
  geom_segment(data = ADT, 
               aes(x = intime, 
                   xend = outtime, 
                   y = "ADT", 
                   yend = "ADT", 
                   color = careunit, 
                   
                   # define the thickness of the segment
                   size = segment_thickness)) +
  
  # Set legend position and arrangement
  theme_bw() +
  theme(legend.position = "bottom", 
        legend.box = "vertical", 
        legend.key.size = unit(0, "pt"),
        legend.text = element_text(size = 7)) +
  
  # Set legend titles and arrangement
  guides(color = guide_legend(title = "Care Unit", 
                              ncol = 3,
                              keywidth = 1),
         shape = guide_legend(title = "Procedure", 
                              ncol = 2),
         
         # Remove ADT's legend
         size = FALSE) +
  
  # Add patient information as title and subtitle
  labs(title = paste("Patient", demographics$subject_id[1], ", ",
                     demographics$gender[1], ", ",
                     demographics$anchor_age[1], "years old, ",
                     demographics$race[1]),
       subtitle = paste(top_3_diagnoses$long_title[1],
                        top_3_diagnoses$long_title[2],
                        top_3_diagnoses$long_title[3],
                        sep = "\n"))
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
of ggplot2 3.3.4.
# Print the plot
ADT_history
<<<<<<< HEAD

=======

>>>>>>> develop

Q1.2 ICU stays

ICU stays are a subset of ADT history. This figure shows the vitals of the patient 10001217 during ICU stays. The x-axis is the calendar time, and the y-axis is the value of the vital. The color of the line represents the type of vital. The facet grid shows the abbreviation of the vital and the stay ID.

Do a similar visualization for the patient 10013310.

Answer

(1) get to know the structure of chartevents.csv.gz and d_items.csv.gz by zcat and head.

zcat < ~/mimic/icu/chartevents.csv.gz | head
zcat < ~/mimic/icu/d_items.csv.gz | head

(2) decompressing the chartevents.csv.gz by gzip.

system("gzip -d -k ~/mimic/icu/chartevents.csv.gz")

(3) prepare the data for the plot.

Read the d_items.csv.gz and chartevents.csv.gz by read_csv and filter the patient with subject_id 10013310. Combine the two tables by left_join. Then, read the chartevents.csv.gz and filter the patient with subject_id 10013310. Combine the two tables by left_join.

# read the d_items.csv.gz
items <- read_csv("~/mimic/icu/d_items.csv.gz") %>%
  dplyr::select(c(itemid, label, abbreviation)) %>%
  
  # filter the abbreviation with HR, NBPd, NBPs, RR, Temperature F
  dplyr::filter(abbreviation %in% c("HR", "NBPd", 
                                    "NBPs", "RR", 
                                    "Temperature F"))

# define the file path of chartevents.csv
file_path_chartevents <- "~/mimic/icu/chartevents.csv"

# read the chartevents.csv.gz
chartevents <- arrow::open_dataset(file_path_chartevents, 
                                   format = "csv") %>%
  
  # filter out ids number of interested vitals
  dplyr::filter(itemid %in% c(220045, 220179, 
                              220180, 220210, 
                              223761)) %>%
  
  # filter the patient with subject_id 10013310
  dplyr::filter(subject_id == 10013310) %>%
  
  # exclude unnecessary columns
  dplyr::select(-c(hadm_id, caregiver_id, storetime, warning)) %>%
  collect() %>%
  left_join(items, by = c("itemid" = "itemid"))

(4) plot the data.

To plot the data, use ggplot and geom_point to add points, geom_line to add lines, facet_grid to show all combinations of abbreviation and stay_id, labs to add title and axis labels, theme_light to set the theme, and guides to avoid overlapping of the x-axis labels.

vitals_line_plot <- ggplot(chartevents, 
                           aes(x = charttime, 
                               y = valuenum, 
                               color = abbreviation)) +
  geom_point() +
  geom_line() +
  
  # use facet_grid to show all 
  # combinations of abbreviation and stay_id
  facet_grid(abbreviation ~ stay_id, scales = "free") +
  labs(title = paste("Patient", 
                     chartevents$subject_id[1], 
                     "ICU stays - Vitals"),
       x = "",
       y = "") +
  
  theme_light(base_size = 9) +
  
  # remove legend
  theme(legend.position = "none") +
  
  # To avoid overlapping of the x-axis labels, 
  # using guides with n.dodge = 2
  guides(x = guide_axis(n.dodge = 2)) 

# print the plot
vitals_line_plot

Q2. ICU stays

icustays.csv.gz (https://mimic.mit.edu/docs/iv/modules/icu/icustays/) contains data about Intensive Care Units (ICU) stays. The first 10 lines are

zcat < ~/mimic/icu/icustays.csv.gz | head

Q2.1 Ingestion

Import icustays.csv.gz as a tibble icustays_tble.

Answer

icustays_tble <- arrow::open_dataset("~/mimic/icu/icustays.csv.gz", 
                                     format = "csv") %>%
  collect()

Q2.2 Summary and visualization

How many unique subject_id? Can a subject_id have multiple ICU stays? Summarize the number of ICU stays per subject_id by graphs.

Answer

(1) By counting the number of unique subject_id, I can find that there are 50920 unique subject_id.

# calculate the number of unique subject_id
icustays_tble %>%
  dplyr::distinct(subject_id) %>%
  dplyr::count()
# A tibble: 1 × 1
      n
  <int>
1 50920

(2) By grouping the data by subject_id and counting the number of unique stay_id, I find that a subject_id can have multiple ICU stays. And there are 12448 subject_id with multiple ICU stays.

# calculate the number of unique stay_id
icustays_tble %>%
  dplyr::distinct(stay_id) %>%
  dplyr::count()
# A tibble: 1 × 1
      n
  <int>
1 73181
# count the number of subject_id with multiple ICU stays
icustays_tble %>%
  dplyr::group_by(subject_id) %>%
  summarise(num_icu_stays = n_distinct(stay_id)) %>%
  dplyr::filter(num_icu_stays > 1) %>%
  dplyr::count()
# A tibble: 1 × 1
      n
  <int>
1 12448

(3) Summarize the number of ICU stays per subject_id by graphs.

# plot the distribution of ICU stays per `subject_id` by histogram
icustays_tble %>%
  group_by(subject_id) %>%
  summarise(num_icu_stays = n_distinct(stay_id)) %>%
  ggplot(aes(x = num_icu_stays)) +
  geom_histogram(binwidth = 1, fill = "skyblue", color = "black") +
  labs(title = "Distribution of ICU Stays per Subject",
       x = "Number of ICU stays per subject",
       y = "Frequency") +
  theme_minimal()

Q3. admissions data

Information of the patients admitted into hospital is available in admissions.csv.gz. See https://mimic.mit.edu/docs/iv/modules/hosp/admissions/ for details of each field in this file. The first 10 lines are

zcat < ~/mimic/hosp/admissions.csv.gz | head

Q3.1 Ingestion

Import admissions.csv.gz as a tibble admissions_tble.

Answer

admissions_tble <- arrow::open_dataset(
  "~/mimic/hosp/admissions.csv.gz",
  format = "csv") %>%
  collect()

Q3.2 Summary and visualization

Summarize the following information by graphics and explain any patterns you see.

  • number of admissions per patient
  • admission hour (anything unusual?)
  • admission minute (anything unusual?)
  • length of hospital stay (from admission to discharge) (anything unusual?)

According to the MIMIC-IV documentation,

All dates in the database have been shifted to protect patient confidentiality. Dates will be internally consistent for the same patient, but randomly distributed in the future. Dates of birth which occur in the present time are not true dates of birth. Furthermore, dates of birth which occur before the year 1900 occur if the patient is older than 89. In these cases, the patient’s age at their first admission has been fixed to 300.

Answer

(1) Number of admissions per patient:

  • Pattern:

The majority of patients have only one admission, and a small portion of patients have multiple admissions.

  • Explanation:

This may be due to the fact that most people do not need to be admitted to the hospital multiple times. However, some patients may have chronic diseases or other health problems that require multiple admissions.

# number of admissions per patient
admissions_tble %>%
  group_by(subject_id) %>%
  summarise(num_admissions = n_distinct(hadm_id)) %>%
  ggplot(aes(x = num_admissions)) +
  geom_histogram(binwidth = 1, fill = "skyblue", color = "black") +
  labs(title = "Distribution of admissions per patient",
       x = "Number of admissions per subject_id",
       y = "Frequency") +
  theme_minimal()

(2) Admission hour:

  • Usual?

Yes. There are some unusual patterns in the admission hour.

  • Pattern:

The number of admission hour decreases from 0 to 6, and then increases from 8 to 18, with an exception at 7. Then, the number of admission hour decreases from 18 to 23. The frequencies from 0 and 7 seem unusual because they are higher than expected for late night and early morning times.

  • Explanation:

This could be due to certain hospital operations or activities during those hours. For example, the number of admissions at 0 may be due to the fact that the admission time is recorded at 0 when the exact time is not available. The number of admissions at 7 may be due to the fact that the hospital is busy at that time.

# change the timezone to UTC
admissions_tble_admission_hour <- admissions_tble %>%
  mutate(admittime = with_tz(admittime, "UTC")) 

# plot the admission hour
admissions_tble_admission_hour %>%
  ggplot(aes(x = hour(admittime))) +
  geom_histogram(binwidth = 1, fill = "skyblue", color = "black") +
  labs(title = "Admission Hour",
       x = "Hour",
       y = "Frequency") +
  theme_minimal()

(3) Admission minute:

  • Usual?

Yes. There are unusual patterns in the admission minute.

  • Pattern:

The frequencies of the admission minute are extremely high at some specific minutes, such as 0, 15, 30, and 45. The frequencies of the rest of the minutes are relatively low.

  • Explanation:

This could be due to the fact that the admission time is recorded at these specific minutes, or the data is rounded to these specific minutes. For the rest of the minutes, the frequencies are relatively low.

admissions_tble_admission_hour %>%
  ggplot(aes(x = minute(admittime))) +
  geom_bar(fill = "skyblue", color = "black") +
  labs(title = "Admission Minute",
       x = "Minute",
       y = "Frequency") +
  theme_minimal()

(4) Length of hospital stay:

  • Usual?

No. I think the distribution of the length of hospital stay is not unusual.

  • Pattern:

The majority of patients stayed in the hospital for less than 10 days, and only a few patients stayed in the hospital for more than 10 days. The maximum length of stay is around 120 days.

  • Explanation:

This may be due to the fact that most patients are admitted for short-term care, and only a few patients require long-term care.

# standardize the time zone and calculate the length of stay
length_of_stay <- admissions_tble %>%
  mutate(dischtime = with_tz(dischtime, "UTC")) %>%
  mutate(admittime = with_tz(admittime, "UTC")) %>%
  mutate(length_of_stay = as.numeric(
    difftime(dischtime, admittime, units = "days")))

# plot the length of hospital stay
length_of_stay %>%
  ggplot(aes(x = length_of_stay)) +
  geom_histogram(binwidth = 1, fill = "skyblue", color = "black") +
  labs(title = "Length of Hospital Stay",
       x = "Length of Stay (days)",
       y = "Frequency") +
  theme_minimal()

Q4. patients data

Patient information is available in patients.csv.gz. See https://mimic.mit.edu/docs/iv/modules/hosp/patients/ for details of each field in this file. The first 10 lines are

zcat < ~/mimic/hosp/patients.csv.gz | head

Q4.1 Ingestion

Import patients.csv.gz (https://mimic.mit.edu/docs/iv/modules/hosp/patients/) as a tibble patients_tble.

Answer

Import patients.csv.gz as a tibble patients_tble.

patients_tble <- arrow::open_dataset(
  sources = "~/mimic/hosp/patients.csv.gz", 
  format = "csv") %>%
  dplyr::collect()

Q4.2 Summary and visualization

Summarize variables gender and anchor_age by graphics, and explain any patterns you see.

Answer

(1) gender:

  • Pattern:

According to the pie chart, there are more female than male patients. And according to the bar chart, in each age group, it shows the same pattern.

  • Explanation:

This may be due to the gender distribution in the general population in the area where the hospital is located.

# Plot a pie chart to show the gender distribution

# Add percentage of each gender
gender_summary <- patients_tble %>%
  count(gender) %>%
  mutate(percentage = n / sum(n) * 100)

# plot the pie chart
ggplot(gender_summary, aes(x = "", y = percentage, 
                           fill = gender)) +
  geom_bar(stat = "identity", width = 1) +
  geom_text(aes(label = paste0(round(percentage), "%")), 
            position = position_stack(vjust = 0.5)) +
  
  # use coord_polar to make the pie chart
  coord_polar(theta = "y") +
  labs(title = "Distribution of Gender", fill = "Gender") +
  theme_void()

# plot the bar chart
patients_tble %>%
  ggplot(aes(x = anchor_age, fill = gender)) +
  geom_bar() +
  labs(title = "Distribution of age",
       x = "Age",
       y = "Frequency") +
  theme_minimal()

(2) anchor_age:

  • Pattern:

The distribution of anchor_age is multimodal, with peaks at around 21, 60, and over 89. This suggests that there are different age groups represented in the dataset, with a relatively high number of patients in their early 20s and around 60 years old.

  • Explanation:

This could be due to the fact that certain age groups are more likely to require hospital care, or it could be due to the demographics of the general population in the area where the hospital is located. There are also a few patients who are over 89 years old, which may due to that some elderly patients’ age is recorded as fixed value if they are over 89 years old.

# using bar plot to show the distribution of anchor_age
patients_tble %>%
  ggplot(aes(x = anchor_age)) +
  geom_histogram(binwidth = 1, fill = "skyblue", color = "black") +
  labs(title = "Distribution of Anchor Age",
       x = "Anchor Age",
       y = "Frequency") +
  theme_minimal()

Q5. Lab results

labevents.csv.gz (https://mimic.mit.edu/docs/iv/modules/hosp/labevents/) contains all laboratory measurements for patients. The first 10 lines are

zcat < ~/mimic/hosp/labevents.csv.gz | head

d_labitems.csv.gz (https://mimic.mit.edu/docs/iv/modules/hosp/d_labitems/) is the dictionary of lab measurements.

zcat < ~/mimic/hosp/d_labitems.csv.gz | head

We are interested in the lab measurements of creatinine (50912), potassium (50971), sodium (50983), chloride (50902), bicarbonate (50882), hematocrit (51221), white blood cell count (51301), and glucose (50931). Retrieve a subset of labevents.csv.gz that only containing these items for the patients in icustays_tble. Further restrict to the last available measurement (by storetime) before the ICU stay. The final labevents_tble should have one row per ICU stay and columns for each lab measurement.

Hint: Use the Parquet format you generated in Homework 2. For reproducibility, make labevents_pq folder available at the current working directory hw3, for example, by a symbolic link.

Answer

Step 1: Load the icustays.csv.gz and d_labitems.csv.gz files as tibbles icustays_tble and dlabitems_tble using open_dataset and collect functions, respectively.

# load the icustays.csv.gz file
icustays_tble <- arrow::open_dataset("~/mimic/icu/icustays.csv.gz", 
                                     format = "csv") %>%
  collect() %>%
  print(width = Inf)
# A tibble: 73,181 × 8
   subject_id  hadm_id  stay_id first_careunit                                  
        <int>    <int>    <int> <chr>                                           
 1   10000032 29079034 39553978 Medical Intensive Care Unit (MICU)              
 2   10000980 26913865 39765666 Medical Intensive Care Unit (MICU)              
 3   10001217 24597018 37067082 Surgical Intensive Care Unit (SICU)             
 4   10001217 27703517 34592300 Surgical Intensive Care Unit (SICU)             
 5   10001725 25563031 31205490 Medical/Surgical Intensive Care Unit (MICU/SICU)
 6   10001884 26184834 37510196 Medical Intensive Care Unit (MICU)              
 7   10002013 23581541 39060235 Cardiac Vascular Intensive Care Unit (CVICU)    
 8   10002155 20345487 32358465 Medical Intensive Care Unit (MICU)              
 9   10002155 23822395 33685454 Coronary Care Unit (CCU)                        
10   10002155 28994087 31090461 Medical/Surgical Intensive Care Unit (MICU/SICU)
   last_careunit                                    intime             
   <chr>                                            <dttm>             
 1 Medical Intensive Care Unit (MICU)               2180-07-23 07:00:00
 2 Medical Intensive Care Unit (MICU)               2189-06-27 01:42:00
 3 Surgical Intensive Care Unit (SICU)              2157-11-20 11:18:02
 4 Surgical Intensive Care Unit (SICU)              2157-12-19 07:42:24
 5 Medical/Surgical Intensive Care Unit (MICU/SICU) 2110-04-11 08:52:22
 6 Medical Intensive Care Unit (MICU)               2131-01-10 20:20:05
 7 Cardiac Vascular Intensive Care Unit (CVICU)     2160-05-18 03:00:53
 8 Medical Intensive Care Unit (MICU)               2131-03-09 13:33:00
 9 Coronary Care Unit (CCU)                         2129-08-04 05:45:00
10 Medical/Surgical Intensive Care Unit (MICU/SICU) 2130-09-23 17:50:00
   outtime               los
   <dttm>              <dbl>
 1 2180-07-23 16:50:47 0.410
 2 2189-06-27 13:38:27 0.498
 3 2157-11-21 14:08:00 1.12 
 4 2157-12-20 06:27:41 0.948
 5 2110-04-12 16:59:56 1.34 
 6 2131-01-20 00:27:30 9.17 
 7 2160-05-19 10:33:33 1.31 
 8 2131-03-10 10:09:21 0.859
 9 2129-08-10 10:02:38 6.18 
10 2130-09-27 15:13:41 3.89 
# ℹ 73,171 more rows
# define the target items
target_items <- c(50912, 50971, 
                  50983, 50902, 
                  50882, 51221, 
                  51301, 50931)

# load the d_labitems.csv.gz file
dlabitems_tble <- arrow::open_dataset("~/mimic/hosp/d_labitems.csv.gz", 
                                      format = "csv") %>%
  filter(itemid %in% target_items) %>%
  collect() %>%
  print(width = Inf)
# A tibble: 8 × 4
  itemid label             fluid category  
   <int> <chr>             <chr> <chr>     
1  50882 Bicarbonate       Blood Chemistry 
2  50902 Chloride          Blood Chemistry 
3  50912 Creatinine        Blood Chemistry 
4  50931 Glucose           Blood Chemistry 
5  50971 Potassium         Blood Chemistry 
6  50983 Sodium            Blood Chemistry 
7  51221 Hematocrit        Blood Hematology
8  51301 White Blood Cells Blood Hematology

Step 2: Load the labevents.csv.gz file as a tibble labevents_tble and filter the rows with the target items and the patients in icustays_tble.

Here I select the columns of subject_id, itemid, storetime, and valuenum is because I only need these columns for the next step. The storetime is the time when the lab measurement was stored, and I will use it to find the last available measurement before the ICU stay. And the subject_id and itemid are the prepared columns for the next step. The valuenum is the value of the lab measurement which I interested in.

labevents <- arrow::open_dataset('./labevents_pq', 
                                 format = "parquet") %>%
  # select the columns of interest
  select(subject_id, itemid, storetime, valuenum) %>%
  filter(itemid %in% target_items) %>%
  filter(subject_id %in% icustays_tble$subject_id) %>%
  collect() %>%
  print(width = Inf)
# A tibble: 13,473,870 × 4
   subject_id itemid storetime           valuenum
        <int>  <int> <dttm>                 <dbl>
 1   10002013  50983 2159-12-20 09:20:00    139  
 2   10002013  50882 2160-01-04 06:40:00     27  
 3   10002013  50902 2160-01-04 06:40:00    102  
 4   10002013  50912 2160-01-04 06:40:00      0.9
 5   10002013  50971 2160-01-04 06:40:00      3.9
 6   10002013  50983 2160-01-04 06:40:00    138  
 7   10002013  50931 2160-01-04 06:29:00    161  
 8   10002013  51221 2160-01-04 08:32:00     35.8
 9   10002013  51301 2160-01-04 08:32:00      7.9
10   10002013  50931 2160-01-18 08:23:00    220  
# ℹ 13,473,860 more rows

Step 3: Join labevents_tble with icustays_tble to get the stay_id and intime for each lab measurement. Then filter the rows to keep the last available measurement before the ICU stay.

Here I use the left_join function to join the labevents_tble with icustays_tble by the subject_id column. Then I use the filter function to keep the lab measurement before the ICU stay. The group_by and arrange functions are used to sort the lab measurements by storetime and keep the last measurement for each subject_id, stay_id, and itemid. The slice_tail function is used to keep the last measurement for each group. Finally, I use the select function to remove the storetime and intime columns.

# I use if else statement and write_rds function 
# to save my rendering time
if (file.exists("labevents_tble.rds")) {
  
  # If the file exists, print it
  labevents_tble <- read_rds("labevents_tble.rds") %>% 
    print(width = Inf)
} else {
  
  labevents_tble <- labevents %>%
    
    # join with icustays_tble to the labevents table to 
    # get the stay_id, and intime
    left_join(icustays_tble[, c('subject_id', 'stay_id', 'intime')],
              by = "subject_id") %>%
    
    # make sure the lab measurement is before the ICU stay
    filter(storetime <= intime) %>%
    group_by(subject_id, stay_id, itemid) %>%
    
    # sort by storetime and keep the last measurement
    arrange(storetime, by_group = TRUE) %>%
    slice_tail(n = 1) %>%
    
    # remove the storetime and intime columns
    select(-storetime, -intime) %>%
    
    # because the form of the `itemid` is in long format, 
    # I need to convert it to wide format
    pivot_wider(names_from = itemid, values_from = valuenum) %>%
    
    # use the more meaningful names by renaming the columns
    rename(creatinine = `50912`, 
           potassium = `50971`, 
           sodium = `50983`, 
           chloride = `50902`,
           bicarbonate = `50882`, 
           hematocrit = `51221`, 
           white_blood_cell_count = `51301`, 
           glucose = `50931`) %>%
    
<<<<<<< HEAD
    print(width, Inf) %>%
=======
    print(width = Inf) %>%
>>>>>>> develop
    write_rds("labevents_tble.rds")
}
# A tibble: 68,468 × 10
# Groups:   subject_id, stay_id [68,468]
   subject_id  stay_id bicarbonate chloride creatinine glucose potassium sodium
        <int>    <int>       <dbl>    <dbl>      <dbl>   <dbl>     <dbl>  <dbl>
 1   10000032 39553978          25       95        0.7     102       6.7    126
 2   10000980 39765666          21      109        2.3      89       3.9    144
 3   10001217 34592300          30      104        0.5      87       4.1    142
 4   10001217 37067082          22      108        0.6     112       4.2    142
 5   10001725 31205490          NA       98       NA        NA       4.1    139
 6   10001884 37510196          30       88        1.1     141       4.5    130
 7   10002013 39060235          24      102        0.9     288       3.5    137
 8   10002155 31090461          23       98        2.8     117       4.9    135
 9   10002155 32358465          26       85        1.4     133       5.7    120
10   10002155 33685454          24      105        1.1     138       4.6    139
   hematocrit white_blood_cell_count
        <dbl>                  <dbl>
 1       41.1                    6.9
 2       27.3                    5.3
 3       37.4                    5.4
 4       38.1                   15.7
 5       NA                     NA  
 6       39.7                   12.2
 7       34.9                    7.2
 8       25.5                   17.9
 9       22.4                    9.8
10       39.7                    7.9
# ℹ 68,458 more rows

Q6. Vitals from charted events

chartevents.csv.gz (https://mimic.mit.edu/docs/iv/modules/icu/chartevents/) contains all the charted data available for a patient. During their ICU stay, the primary repository of a patient’s information is their electronic chart. The itemid variable indicates a single measurement type in the database. The value variable is the value measured for itemid. The first 10 lines of chartevents.csv.gz are

zcat < ~/mimic/icu/chartevents.csv.gz | head

d_items.csv.gz (https://mimic.mit.edu/docs/iv/modules/icu/d_items/) is the dictionary for the itemid in chartevents.csv.gz.

zcat < ~/mimic/icu/d_items.csv.gz | head

We are interested in the vitals for ICU patients: heart rate (220045), systolic non-invasive blood pressure (220179), diastolic non-invasive blood pressure (220180), body temperature in Fahrenheit (223761), and respiratory rate (220210). Retrieve a subset of chartevents.csv.gz only containing these items for the patients in icustays_tble. Further restrict to the first vital measurement within the ICU stay. The final chartevents_tble should have one row per ICU stay and columns for each vital measurement.

Hint: Use the Parquet format you generated in Homework 2. For reproducibility, make chartevents_pq folder available at the current working directory, for example, by a symbolic link.

Answer

I create a symbolic link for the original directory to the Parquet file chartevents.parquet under the path under the folder ./chartevents_pq available at the current working directory hw3.

<<<<<<< HEAD
ln -s ~/mimic/icu/chartevents.parquet/
=======
ln -s ~/mimic/icu/chartevents.parquet/ ./chartevents_pq
>>>>>>> develop

Step 1: Load the d_items.csv.gz file and filter the rows to keep the itemid for the vitals of interest.

# load the d_items.csv.gz file 
d_items_tble <- arrow::open_dataset("~/mimic/icu/d_items.csv.gz", 
                                    format = "csv") %>%
  
  # filter the rows to keep the itemid for the vitals of interest
  filter(itemid %in% c(220045, 220179, 220180, 223761, 220210)) %>%
  collect() %>%
  print(width = Inf)
# A tibble: 5 × 9
  itemid label                                 abbreviation  linksto    
   <int> <chr>                                 <chr>         <chr>      
1 220045 Heart Rate                            HR            chartevents
2 220179 Non Invasive Blood Pressure systolic  NBPs          chartevents
3 220180 Non Invasive Blood Pressure diastolic NBPd          chartevents
4 220210 Respiratory Rate                      RR            chartevents
5 223761 Temperature Fahrenheit                Temperature F chartevents
  category            unitname param_type lownormalvalue highnormalvalue
  <chr>               <chr>    <chr>               <int>           <dbl>
1 Routine Vital Signs bpm      Numeric                NA              NA
2 Routine Vital Signs mmHg     Numeric                NA              NA
3 Routine Vital Signs mmHg     Numeric                NA              NA
4 Respiratory         insp/min Numeric                NA              NA
5 Routine Vital Signs °F       Numeric                NA              NA

Step 2: Load the chartevents.csv.gz file and filter the rows to keep the itemid for the vitals of interest.

Here, I select the subject_id, itemid, charttime, and valuenum, which are the columns of interest. The subject_id and itemid are preparation for the next step. charttime will be used to filter the charted events within the ICU stay.

# load the chartevents.csv.gz file
chartevents <- arrow::open_dataset('./chartevents_pq', 
                                   format = "parquet") %>%
  select(subject_id, itemid, charttime, valuenum) %>%
  
  # filter the rows to keep the itemid for the vitals of interest
  filter(itemid %in% c(220045, 220179, 220180, 223761, 220210)) %>%
  filter(subject_id %in% icustays_tble$subject_id) %>%
  collect() %>%
  print(width = Inf)
# A tibble: 22,504,119 × 4
   subject_id itemid charttime           valuenum
        <int>  <int> <dttm>                 <dbl>
 1   10001884 220179 2131-01-12 12:01:00      149
 2   10001884 220180 2131-01-12 12:01:00       97
 3   10001884 220045 2131-01-12 13:00:00       78
 4   10001884 220210 2131-01-12 13:00:00       18
 5   10001884 220179 2131-01-12 13:01:00      145
 6   10001884 220180 2131-01-12 13:01:00       80
 7   10001884 220045 2131-01-12 14:00:00       82
 8   10001884 220210 2131-01-12 14:00:00       28
 9   10001884 220179 2131-01-12 14:01:00      144
10   10001884 220180 2131-01-12 14:01:00       83
# ℹ 22,504,109 more rows

Step 3: Left join chartevents with icustays_tble to get stay_id and intime. Then, filter the charted events within the ICU stay. Keep the first measurement for each subject_id and itemid.

Here, I use left_join function to join the chartevents with icustays_tble to get stay_id and intime. Then, I filter the charted events within the ICU stay. I use group_by and arrange functions to keep the first measurement for each subject_id and itemid. Also, I use filter function to keep the charttime within the ICU stay. And I use slice_head function to keep the first measurement for each subject_id and itemid. Finally, I use select function to keep the columns of interest, use pivot_wider function to convert the form of the itemid from long to wide format, and use rename function to use the more meaningful names by renaming the columns.

# I use the if-else statement and write_rds function 
# to save my rendering time
if (file.exists("chartevents_tble.rds")) {
  
  # If the file exists, read the data from the file and print it
  chartevents_tble <- read_rds("chartevents_tble.rds") 
  print(chartevents_tble, width = Inf)
  
} else {
  
  # If the file doesn't exist, create the chartevents_tble
  
  # Left join chartevents with icustays_tble 
  # to get stay_id, intime, and outtime
  chartevents_tble <- chartevents %>%
    left_join(icustays_tble %>% select(subject_id, 
                                       stay_id, 
                                       intime, 
                                       outtime),
              by = "subject_id") %>%
    
    # Filter charted events within ICU stay
    filter(charttime >= intime, charttime <= outtime) %>%
    
    # Keep the first measurement for each subject_id, stay_id, itemid
    group_by(subject_id, stay_id, itemid) %>%
    arrange(charttime, .by_group = TRUE) %>%
    slice_head(n = 1) %>%
    
    # Remove unnecessary columns
    select(-charttime, -intime, -outtime) %>%
    
    # Convert itemid to wide format
    pivot_wider(names_from = itemid, values_from = valuenum) %>%
    
    # Rename columns with meaningful names
    rename(heart_rate = `220045`, 
           systolic_non_invasive_blood_pressure = `220179`, 
           diastolic_non_invasive_blood_pressure = `220180`, 
           temperature_in_Fahrenheit = `223761`, 
           respiratory_rate = `220210`) %>%
    
    print(width = Inf) %>%
    write_rds("chartevents_tble.rds")  
}
# A tibble: 73,164 × 7
# Groups:   subject_id, stay_id [73,164]
   subject_id  stay_id heart_rate systolic_non_invasive_blood_pressure
        <int>    <int>      <dbl>                                <dbl>
 1   10000032 39553978         91                                   84
 2   10000980 39765666         77                                  150
 3   10001217 34592300         96                                  167
 4   10001217 37067082         86                                  151
 5   10001725 31205490         55                                   73
 6   10001884 37510196         38                                  180
 7   10002013 39060235         80                                  104
 8   10002155 31090461         94                                  118
 9   10002155 32358465         98                                  109
10   10002155 33685454         68                                  126
   diastolic_non_invasive_blood_pressure respiratory_rate
                                   <dbl>            <dbl>
 1                                    48               24
 2                                    77               23
 3                                    95               11
 4                                    90               18
 5                                    56               19
 6                                    12               10
 7                                    70               14
 8                                    51               18
 9                                    65               23
10                                    61               18
   temperature_in_Fahrenheit
                       <dbl>
 1                      98.7
 2                      98  
 3                      97.6
 4                      98.5
 5                      97.7
 6                      98.1
 7                      97.2
 8                      96.9
 9                      97.7
10                      95.9
# ℹ 73,154 more rows

Q7. Putting things together

Let us create a tibble mimic_icu_cohort for all ICU stays, where rows are all ICU stays of adults (age at intime >= 18) and columns contain at least following variables

  • all variables in icustays_tble
  • all variables in admissions_tble
  • all variables in patients_tble
  • the last lab measurements before the ICU stay in labevents_tble
  • the first vital measurements during the ICU stay in chartevents_tble

The final mimic_icu_cohort should have one row per ICU stay and columns for each variable.

Answer

Here, I use left_join function to join all the five tibbles. Also, I add a age_intime column to calculate the age at intime and filter the rows to keep the adults (age at intime >= 18).

mimic_icu_cohort <- icustays_tble %>%
  
  # join with admissions_tble to the icustays table to get the
  # admission information
  left_join(admissions_tble, by = c("hadm_id", "subject_id")) %>%
  
  # join with patients_tble to the icustays table to get the
  # patient information
  left_join(patients_tble, by = "subject_id") %>%
  
  # join with labevents_tble to the icustays table to get the
  # last lab measurements before ICU stay
  left_join(labevents_tble, by = c("subject_id", "stay_id")) %>%
  
  # join with chartevents_tble to the icustays table to get the
  # first vital measurements within the ICU stay
  left_join(chartevents_tble, by = c("subject_id", "stay_id")) %>%
  
  # create a new column age_intime
  mutate(age_intime = year(intime) - anchor_year + anchor_age) %>%
  
  # select the rows where age_intime is greater than or equal to 18
  filter(age_intime >= 18) %>%
  print(width = Inf)
# A tibble: 73,181 × 41
   subject_id  hadm_id  stay_id first_careunit                                  
        <int>    <int>    <int> <chr>                                           
 1   10000032 29079034 39553978 Medical Intensive Care Unit (MICU)              
 2   10000980 26913865 39765666 Medical Intensive Care Unit (MICU)              
 3   10001217 24597018 37067082 Surgical Intensive Care Unit (SICU)             
 4   10001217 27703517 34592300 Surgical Intensive Care Unit (SICU)             
 5   10001725 25563031 31205490 Medical/Surgical Intensive Care Unit (MICU/SICU)
 6   10001884 26184834 37510196 Medical Intensive Care Unit (MICU)              
 7   10002013 23581541 39060235 Cardiac Vascular Intensive Care Unit (CVICU)    
 8   10002155 20345487 32358465 Medical Intensive Care Unit (MICU)              
 9   10002155 23822395 33685454 Coronary Care Unit (CCU)                        
10   10002155 28994087 31090461 Medical/Surgical Intensive Care Unit (MICU/SICU)
   last_careunit                                    intime             
   <chr>                                            <dttm>             
 1 Medical Intensive Care Unit (MICU)               2180-07-23 07:00:00
 2 Medical Intensive Care Unit (MICU)               2189-06-27 01:42:00
 3 Surgical Intensive Care Unit (SICU)              2157-11-20 11:18:02
 4 Surgical Intensive Care Unit (SICU)              2157-12-19 07:42:24
 5 Medical/Surgical Intensive Care Unit (MICU/SICU) 2110-04-11 08:52:22
 6 Medical Intensive Care Unit (MICU)               2131-01-10 20:20:05
 7 Cardiac Vascular Intensive Care Unit (CVICU)     2160-05-18 03:00:53
 8 Medical Intensive Care Unit (MICU)               2131-03-09 13:33:00
 9 Coronary Care Unit (CCU)                         2129-08-04 05:45:00
10 Medical/Surgical Intensive Care Unit (MICU/SICU) 2130-09-23 17:50:00
   outtime               los admittime           dischtime          
   <dttm>              <dbl> <dttm>              <dttm>             
 1 2180-07-23 16:50:47 0.410 2180-07-23 05:35:00 2180-07-25 10:55:00
 2 2189-06-27 13:38:27 0.498 2189-06-27 00:38:00 2189-07-02 20:00:00
 3 2157-11-21 14:08:00 1.12  2157-11-18 14:56:00 2157-11-25 10:00:00
 4 2157-12-20 06:27:41 0.948 2157-12-18 08:58:00 2157-12-24 06:55:00
 5 2110-04-12 16:59:56 1.34  2110-04-11 08:08:00 2110-04-14 08:00:00
 6 2131-01-20 00:27:30 9.17  2131-01-07 12:39:00 2131-01-19 21:15:00
 7 2160-05-19 10:33:33 1.31  2160-05-18 00:45:00 2160-05-23 06:30:00
 8 2131-03-10 10:09:21 0.859 2131-03-09 12:33:00 2131-03-09 17:55:00
 9 2129-08-10 10:02:38 6.18  2129-08-04 05:44:00 2129-08-18 09:53:00
10 2130-09-27 15:13:41 3.89  2130-09-23 14:59:00 2130-09-29 11:55:00
   deathtime           admission_type              admit_provider_id
   <dttm>              <chr>                       <chr>            
 1 NA                  EW EMER.                    P30KEH           
 2 NA                  EW EMER.                    P30KEH           
 3 NA                  EW EMER.                    P4645A           
 4 NA                  DIRECT EMER.                P99698           
 5 NA                  EW EMER.                    P35SU0           
 6 2131-01-19 21:15:00 OBSERVATION ADMIT           P874LG           
 7 NA                  SURGICAL SAME DAY ADMISSION P47E1G           
 8 2131-03-10 13:53:00 EW EMER.                    P80515           
 9 NA                  EW EMER.                    P05HUO           
10 NA                  EW EMER.                    P3529J           
   admission_location discharge_location           insurance language
   <chr>              <chr>                        <chr>     <chr>   
 1 EMERGENCY ROOM     HOME                         Medicaid  ENGLISH 
 2 EMERGENCY ROOM     HOME HEALTH CARE             Medicare  ENGLISH 
 3 EMERGENCY ROOM     HOME HEALTH CARE             Other     ?       
 4 PHYSICIAN REFERRAL HOME HEALTH CARE             Other     ?       
 5 PACU               HOME                         Other     ENGLISH 
 6 EMERGENCY ROOM     DIED                         Medicare  ENGLISH 
 7 PHYSICIAN REFERRAL HOME HEALTH CARE             Medicare  ENGLISH 
 8 EMERGENCY ROOM     DIED                         Other     ENGLISH 
 9 PROCEDURE SITE     CHRONIC/LONG TERM ACUTE CARE Other     ENGLISH 
10 EMERGENCY ROOM     HOME HEALTH CARE             Other     ENGLISH 
   marital_status race                   edregtime           edouttime          
   <chr>          <chr>                  <dttm>              <dttm>             
 1 WIDOWED        WHITE                  2180-07-22 22:54:00 2180-07-23 07:00:00
 2 MARRIED        BLACK/AFRICAN AMERICAN 2189-06-26 23:25:00 2189-06-27 01:42:00
 3 MARRIED        WHITE                  2157-11-18 09:38:00 2157-11-18 17:24:00
 4 MARRIED        WHITE                  NA                  NA                 
 5 MARRIED        WHITE                  NA                  NA                 
 6 MARRIED        BLACK/AFRICAN AMERICAN 2131-01-07 05:36:00 2131-01-07 14:13:00
 7 SINGLE         OTHER                  NA                  NA                 
 8 MARRIED        WHITE                  2131-03-09 11:14:00 2131-03-09 13:33:00
 9 MARRIED        WHITE                  2129-08-04 04:00:00 2129-08-04 05:35:00
10 MARRIED        WHITE                  2130-09-23 12:59:00 2130-09-23 17:50:00
   hospital_expire_flag gender anchor_age anchor_year anchor_year_group
                  <int> <chr>       <int>       <int> <chr>            
 1                    0 F              52        2180 2014 - 2016      
 2                    0 F              73        2186 2008 - 2010      
 3                    0 F              55        2157 2011 - 2013      
 4                    0 F              55        2157 2011 - 2013      
 5                    0 F              46        2110 2011 - 2013      
 6                    1 F              68        2122 2008 - 2010      
 7                    0 F              53        2156 2008 - 2010      
 8                    1 F              80        2128 2008 - 2010      
 9                    0 F              80        2128 2008 - 2010      
10                    0 F              80        2128 2008 - 2010      
   dod        bicarbonate chloride creatinine glucose potassium sodium
   <date>           <dbl>    <dbl>      <dbl>   <dbl>     <dbl>  <dbl>
 1 2180-09-09          25       95        0.7     102       6.7    126
 2 2193-08-26          21      109        2.3      89       3.9    144
 3 NA                  22      108        0.6     112       4.2    142
 4 NA                  30      104        0.5      87       4.1    142
 5 NA                  NA       98       NA        NA       4.1    139
 6 2131-01-20          30       88        1.1     141       4.5    130
 7 NA                  24      102        0.9     288       3.5    137
 8 2131-03-10          26       85        1.4     133       5.7    120
 9 2131-03-10          24      105        1.1     138       4.6    139
10 2131-03-10          23       98        2.8     117       4.9    135
   hematocrit white_blood_cell_count heart_rate
        <dbl>                  <dbl>      <dbl>
 1       41.1                    6.9         91
 2       27.3                    5.3         77
 3       38.1                   15.7         86
 4       37.4                    5.4         96
 5       NA                     NA           55
 6       39.7                   12.2         38
 7       34.9                    7.2         80
 8       22.4                    9.8         98
 9       39.7                    7.9         68
10       25.5                   17.9         94
   systolic_non_invasive_blood_pressure diastolic_non_invasive_blood_pressure
                                  <dbl>                                 <dbl>
 1                                   84                                    48
 2                                  150                                    77
 3                                  151                                    90
 4                                  167                                    95
 5                                   73                                    56
 6                                  180                                    12
 7                                  104                                    70
 8                                  109                                    65
 9                                  126                                    61
10                                  118                                    51
   respiratory_rate temperature_in_Fahrenheit age_intime
              <dbl>                     <dbl>      <dbl>
 1               24                      98.7         52
 2               23                      98           76
 3               18                      98.5         55
 4               11                      97.6         55
 5               19                      97.7         46
 6               10                      98.1         77
 7               14                      97.2         57
 8               23                      97.7         83
 9               18                      95.9         81
10               18                      96.9         82
# ℹ 73,171 more rows

Q8. Exploratory data analysis (EDA)

Summarize the following information about the ICU stay cohort mimic_icu_cohort using appropriate numerics or graphs:

  • Length of ICU stay los vs demographic variables (race, insurance, marital_status, gender, age at intime)

  • Length of ICU stay los vs the last available lab measurements before ICU stay filter storetime intime

  • Length of ICU stay los vs the first vital measurements within the ICU stay

  • Length of ICU stay los vs first ICU unit

Answer

(1.1) Length of ICU stay los vs demographic

Getting both the bar plot and the summary statistics for the length of ICU stay

# group by gender and plot the bar plot
mimic_icu_cohort %>%
  group_by(race) %>%
  ggplot(aes(x = los, y = race)) +
  geom_point() +
  labs(title = "Summary of Length of ICU stay by race",
       x = "Length of ICU stay (days)",
       y = "Race") +  
  theme_minimal()

# get the mean, standard deviation and IQR of the length of ICU stay
# for different ethnicial groups
mimic_icu_cohort %>%
  group_by(race) %>%
  summarize(mean_los = mean(los, na.rm = TRUE),
            std_dev = sd(los, na.rm = TRUE),
            iqr = IQR(los, na.rm = TRUE)) 
# A tibble: 33 × 4
   race                          mean_los std_dev   iqr
   <chr>                            <dbl>   <dbl> <dbl>
 1 AMERICAN INDIAN/ALASKA NATIVE     4.46    6.54  4.01
 2 ASIAN                             3.49    4.90  2.70
 3 ASIAN - ASIAN INDIAN              4.20    7.61  2.81
 4 ASIAN - CHINESE                   3.36    5.39  2.28
 5 ASIAN - KOREAN                    4.23    6.35  2.85
 6 ASIAN - SOUTH EAST ASIAN          3.04    5.42  1.90
 7 BLACK/AFRICAN                     3.82    5.90  3.04
 8 BLACK/AFRICAN AMERICAN            3.28    4.80  2.43
 9 BLACK/CAPE VERDEAN                3.22    4.94  2.53
10 BLACK/CARIBBEAN ISLAND            3.61    5.13  2.45
# ℹ 23 more rows

(1.2) Length of ICU stay los vs insurance

Using the box plot to show the summary of the length of ICU stay by insurance

# group by insurance and plot the box plot
mimic_icu_cohort %>%
  group_by(insurance) %>%
  ggplot(aes(x = los, y = insurance)) +
  geom_boxplot() +
  
  labs(title = "Summary of Length of ICU stay by insurance",
       x = "Length of ICU stay (days)",
       y = "Insurance") +  
  theme_minimal()

(1.3) Length of ICU stay los vs marital_status

Using the violin plot to show the summary of the length of ICU stay by marital status. And also, adding the points to show the distribution of the length of ICU stay.

# group by marital_status and plot the violin plot
mimic_icu_cohort %>%
  group_by(marital_status) %>%
  ggplot(aes(x = los, y = marital_status)) +
  geom_violin() +
  geom_point() +
  
  labs(title = "Summary of Length of ICU stay by Marital Status",
       x = "Length of ICU stay (days)",
       y = "Marital Status") +  
  theme_minimal()

(1.4) Length of ICU stay los vs gender

Using the count plot to show the summary of the length of ICU

# group by gender and plot the count plot
mimic_icu_cohort %>%
  group_by(gender) %>%
  ggplot(aes(x = gender, y = los)) +  
  geom_count() +
  
  labs(title = "Summary of Length of ICU stay by Gender",
       x = "Gender",
       y = "Length of ICU stay (days)") +
  theme_minimal()

(1.5) Length of ICU stay los vs age at intime

Using the scatter plot to show the summary of the length of ICU stay by age at intime

# group by age at intime and plot the scatter plot
mimic_icu_cohort %>%
  ggplot(aes(y = los, x = age_intime)) +
  geom_point() +
  
  labs(title = "Summary of Length of ICU stay by Age at intime",
       x = "Age at intime",
       y = "Length of ICU stay (days)") +
  theme_minimal()

(2) Length of ICU stay los vs the last available lab measurements before ICU stay I will dot plot to show the relationship between the length of ICU stay and each last available lab measurements before ICU stay.

# prepare the variables to be used in the plot
<<<<<<< HEAD
variables <- c("white_blood_cell_count", 
               "heart_rate", "respiratory_rate", 
               "temperature_in_Fahrenheit", 
               "systolic_non_invasive_blood_pressure", 
               "diastolic_non_invasive_blood_pressure")
=======
variables <- c("potassium", "sodium", 
               "chloride", "bicarbonate", 
               "hematocrit", 
               "white_blood_cell_count", 
               "glucose") 
>>>>>>> develop

# get the last lab measurements before ICU stay
mimic_icu_cohort_select <- mimic_icu_cohort %>%
  
  # select the last lab measurements before ICU stay
  select(all_of(variables), los) %>%
  
  # remove missing values
  filter(across(c(all_of(variables), los), ~ !is.na(.))) %>%
  
  # trim extreme values and replace old columns
  mutate(across(c(all_of(variables), los), 
                ~ trim(., trim_proportion = 1, na.rm = TRUE)))
Warning: Using `across()` in `filter()` was deprecated in dplyr 1.0.8.
ℹ Please use `if_any()` or `if_all()` instead.
# using a for loop to create separate plots for each variable
plots <- list()

for (variable in variables) {
  plots[[variable]] <- mimic_icu_cohort %>%
    ggplot(aes_string(y = "los", x = variable)) +
    geom_point() +
    labs(title = paste("Length of ICU stay vs", variable),
         x = variable,
         y = "Length of ICU stay (days)") +
    theme_minimal()
}
Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
ℹ Please use tidy evaluation idioms with `aes()`.
ℹ See also `vignette("ggplot2-in-packages")` for more information.
# show the plots
plots
<<<<<<< HEAD
$white_blood_cell_count
Warning: Removed 5092 rows containing missing values (`geom_point()`).
=======
$potassium
Warning: Removed 8888 rows containing missing values or values outside the scale range
(`geom_point()`).


$sodium
Warning: Removed 8858 rows containing missing values or values outside the scale range
(`geom_point()`).


$chloride
Warning: Removed 8869 rows containing missing values or values outside the scale range
(`geom_point()`).


$bicarbonate
Warning: Removed 9035 rows containing missing values or values outside the scale range
(`geom_point()`).


$hematocrit
Warning: Removed 5015 rows containing missing values or values outside the scale range
(`geom_point()`).


$white_blood_cell_count
Warning: Removed 5092 rows containing missing values or values outside the scale range
(`geom_point()`).
>>>>>>> develop


<<<<<<< HEAD
$heart_rate
Warning: Removed 18 rows containing missing values (`geom_point()`).
======= $glucose
Warning: Removed 9084 rows containing missing values or values outside the scale range
(`geom_point()`).
>>>>>>> develop
<<<<<<< HEAD


$respiratory_rate
Warning: Removed 98 rows containing missing values (`geom_point()`).


$temperature_in_Fahrenheit
Warning: Removed 1362 rows containing missing values (`geom_point()`).


$systolic_non_invasive_blood_pressure
Warning: Removed 979 rows containing missing values (`geom_point()`).


$diastolic_non_invasive_blood_pressure
Warning: Removed 983 rows containing missing values (`geom_point()`).

=======

>>>>>>> develop

(3) Length of ICU stay los vs the first vital measurements within the ICU stay

I will use the scatter plot to show the relationship between the length of ICU stay and each first vital measurements within the ICU stay.

<<<<<<< HEAD
# prepare the variables to be used in the plot
variables <- c("heart_rate", "respiratory_rate", 
               "temperature_in_Fahrenheit", 
               "systolic_non_invasive_blood_pressure", 
               "diastolic_non_invasive_blood_pressure")

# get the first vital measurements within the ICU stay
mimic_icu_cohort_select <- mimic_icu_cohort %>%
  
  # select the first vital measurements within the ICU stay
  select(all_of(variables), los) %>%
  
  # remove missing values
  filter(across(c(all_of(variables), los), ~ !is.na(.))) %>%
  
  # trim extreme values and replace old values
  mutate(across(c(all_of(variables), los), 
                ~ trim(., trim_proportion = 1, na.rm = TRUE)))
=======
# prepare the variables to be used in the plot
variables <- c("heart_rate", "respiratory_rate", 
               "temperature_in_Fahrenheit", 
               "systolic_non_invasive_blood_pressure", 
               "diastolic_non_invasive_blood_pressure")

# get the first vital measurements within the ICU stay
mimic_icu_cohort_select <- mimic_icu_cohort %>%
  
  # select the first vital measurements within the ICU stay
  select(all_of(variables), los) %>%
  
  # remove missing values
  filter(across(c(all_of(variables), los), ~ !is.na(.))) %>%
  
  # trim extreme values and replace old values
  mutate(across(c(all_of(variables), los), 
                ~ trim(., trim_proportion = 1, na.rm = TRUE)))
>>>>>>> develop
Warning: Using `across()` in `filter()` was deprecated in dplyr 1.0.8.
ℹ Please use `if_any()` or `if_all()` instead.
<<<<<<< HEAD
# using a for loop to create separate plots for each variable
plots <- list()

for (variable in variables) {
  plots[[variable]] <- mimic_icu_cohort %>%
    ggplot(aes_string(y = "los", x = variable)) +
    geom_point() +
    labs(title = paste("Length of ICU stay vs", variable),
         x = variable,
         y = "Length of ICU stay (days)") +
    theme_minimal()
}

# show the plots
plots
=======
# using a for loop to create separate plots for each variable
plots <- list()

for (variable in variables) {
  plots[[variable]] <- mimic_icu_cohort %>%
    ggplot(aes_string(y = "los", x = variable)) +
    geom_point() +
    labs(title = paste("Length of ICU stay vs", variable),
         x = variable,
         y = "Length of ICU stay (days)") +
    theme_minimal()
}

# show the plots
plots
>>>>>>> develop
$heart_rate
<<<<<<< HEAD
Warning: Removed 18 rows containing missing values (`geom_point()`).
=======
Warning: Removed 18 rows containing missing values or values outside the scale range
(`geom_point()`).
>>>>>>> develop


$respiratory_rate
<<<<<<< HEAD
Warning: Removed 98 rows containing missing values (`geom_point()`).
=======
Warning: Removed 98 rows containing missing values or values outside the scale range
(`geom_point()`).
>>>>>>> develop


$temperature_in_Fahrenheit
<<<<<<< HEAD
Warning: Removed 1362 rows containing missing values (`geom_point()`).
=======
Warning: Removed 1362 rows containing missing values or values outside the scale range
(`geom_point()`).
>>>>>>> develop


$systolic_non_invasive_blood_pressure
<<<<<<< HEAD
Warning: Removed 979 rows containing missing values (`geom_point()`).
=======
Warning: Removed 979 rows containing missing values or values outside the scale range
(`geom_point()`).
>>>>>>> develop


$diastolic_non_invasive_blood_pressure
<<<<<<< HEAD
Warning: Removed 983 rows containing missing values (`geom_point()`).
=======
Warning: Removed 983 rows containing missing values or values outside the scale range
(`geom_point()`).
>>>>>>> develop

(4) Length of ICU stay los vs first ICU

I will use the bar plot to show the proportion of ICU stays by the first ICU.

<<<<<<< HEAD
# count the number of ICU stays and plot the bar plot
mimic_icu_cohort %>%
  count(first_careunit) %>%
  ggplot(aes(x = first_careunit, y = n, fill = first_careunit)) +
  geom_bar(width = 1, stat = "identity", color = "white") +
  
  labs(title = "Proportion of ICU Stays by First ICU",
       x = "First ICU",
       y = "Proportion",
       fill = "First ICU") +
  
  # add the percentage labels
  scale_y_continuous(labels = scales::percent_format(scale = 1)) +
  theme_minimal() +
  theme(axis.text.x = element_blank(),
        axis.ticks.x = element_blank())
=======
# count the number of ICU stays and plot the bar plot
mimic_icu_cohort %>%
  count(first_careunit) %>%
  ggplot(aes(x = first_careunit, y = n, fill = first_careunit)) +
  geom_bar(width = 1, stat = "identity", color = "white") +
  
  labs(title = "Proportion of ICU Stays by First ICU",
       x = "First ICU",
       y = "Proportion",
       fill = "First ICU") +
  
  # add the percentage labels
  scale_y_continuous(labels = scales::percent_format(scale = 1)) +
  theme_minimal() +
  theme(axis.text.x = element_blank(),
        axis.ticks.x = element_blank())
>>>>>>> develop