In this exercise, we use tidyverse (ggplot2, dplyr, etc) to explore the MIMIC-IV data introduced in homework 1 and to build a cohort of ICU stays.
Q1. Visualizing patient trajectory
Visualizing a patient’s encounters in a health care system is a common task in clinical data analysis. In this question, we will visualize a patient’s ADT (admission-discharge-transfer) history and ICU vitals in the MIMIC-IV data.
Q1.1 ADT history
A patient’s ADT history records the time of admission, discharge, and transfer in the hospital. This figure shows the ADT history of the patient with subject_id 10001217 in the MIMIC-IV data. The x-axis is the calendar time, and the y-axis is the type of event (ADT, lab, procedure). The color of the line segment represents the care unit. The size of the line segment represents whether the care unit is an ICU/CCU. The crosses represent lab events, and the shape of the dots represents the type of procedure. The title of the figure shows the patient’s demographic information and the subtitle shows top 3 diagnoses.
Do a similar visualization for the patient with subject_id 10013310 using ggplot.
Hint: We need to pull information from data files patients.csv.gz, admissions.csv.gz, transfers.csv.gz, labevents.csv.gz, procedures_icd.csv.gz, diagnoses_icd.csv.gz, d_icd_procedures.csv.gz, and d_icd_diagnoses.csv.gz. For the big file labevents.csv.gz, use the Parquet format you generated in Homework 2. For reproducibility, make the Parquet folder labevents_pq available at the current working directory hw3, for example, by a symbolic link. Make your code reproducible.
Answer
Step I: getting the demographic information
(1) get to know the structure of patients.csv.gz and admissions.csv.gz by zcat and head.
(2) read the patients.csv.gz and admissions.csv.gz by read_csv and filter the patient with subject_id 10013310. And then get the demographic information of the patient. Here I found that the race column in admissions.csv.gz is in upper case which is different from the example. So, I lower case the race column.
race <-read_csv("~/mimic/hosp/admissions.csv.gz") %>%filter(subject_id ==10013310) %>%distinct(race)demographics <-read_csv("~/mimic/hosp/patients.csv.gz") %>%filter(subject_id ==10013310) %>%# Lower case the race columnmutate(race =tolower(race$race))
Step II: getting the top 3 diagnoses of the patient
(1) get to know the structure of diagnoses_icd.csv.gz and d_icd_diagnoses.csv.gz by zcat and head.
(2) read the diagnoses_icd.csv.gz and d_icd_diagnoses.csv.gz by read_csv and filter the patient with subject_id 10013310. Combine the two tables by left_join. Here I found that there are several diagnoses for the patient in each patient hospitalization. So, I count the frequency of all diagnosis and select the top 3 diagnoses.
top_3_diagnoses <-read_csv("~/mimic/hosp/diagnoses_icd.csv.gz") %>%filter(subject_id ==10013310) %>%left_join(read_csv("~/mimic/hosp/d_icd_diagnoses.csv.gz"), by =c("icd_code"="icd_code", "icd_version"="icd_version")) %>%group_by(long_title) %>%# Count the frequency of each diagnosissummarise(freq =n()) %>%# Select the top 3 diagnosesarrange(desc(freq)) %>%slice(1:3) top_3_diagnoses
# A tibble: 3 × 2
long_title freq
<chr> <int>
1 Acute on chronic systolic (congestive) heart failure 3
2 Hyperlipidemia, unspecified 3
3 Long term (current) use of insulin 3
Step III: getting the ADT history of the patient
(1) get to know the structure of diagnoses_icd.csv.gz by zcat and head.
zcat< ~/mimic/hosp/transfers.csv.gz |head
(2) read the transfers.csv.gz by read_csv and filter the patient with subject_id 10013310. And then get the ADT history of the patient. Also, I add a new column segment_thickness to represent the size of the line segment for the convenience of plotting.
# read the transfers.csv.gz and filter the patient 10013310ADT <-read_csv("~/mimic/hosp/transfers.csv.gz") %>%filter(subject_id ==10013310) %>%filter(!is.na(careunit)) %>%# Add a new column segment_thickness # to represent the size of the line segmentmutate(segment_thickness =if_else(str_detect(careunit, "Care Unit"), 10, 8))
Step IV: getting the lab events of the patient from labevents.parquet with subject_id 10013310 by arrow::open_dataset.
(1) get to know the structure of labevents.parquet by zcat and head.
(3) re-writing the csv file labevents.csv in the Parquet format.
# Define the file pathsfile_path <-"~/mimic/hosp/labevents.csv"parquet_file_path <-"~/mimic/hosp/labevents.parquet"# Rewrite the CSV file in Parquet format arrow::write_dataset(arrow::open_dataset(file_path, format ="csv"), parquet_file_path, format ="parquet")
Also, I create a symbolic link for the original directory to the Parquet file labevents.parquet under the path under the folder ./labevents_pq available at the current working directory hw3 for reproducibility.
(4) Reading the labevents.parquet file using the arrow::open_dataset. The information I need is the subject_id and charttime columns. So, I use the dplyr::select function to select these two columns and the dplyr::filter function to filter the subject_id 10013310. Also, there are duplicated charttime values because several lab events can happen at the same time, so I use the dplyr::distinct function to keep the unique charttime values.
# read the labevents.parquet filelabevents <- arrow::open_dataset('./labevents_pq', format ="parquet") %>%# Select the subject_id and charttime columns dplyr::select(subject_id, charttime) %>%# Filter the patient 10013310 dplyr::filter(subject_id ==10013310) %>%# get rid of the duplicated charttime values dplyr::distinct(subject_id, charttime) %>%collect()
Step V: getting the procedure events of the patient.
(1) get to know the structure of procedures_icd.csv.gz and d_icd_procedures.csv.gz by zcat and head.
(2) read the procedures_icd.csv.gz and d_icd_procedures.csv.gz by read_csv and filter the patient with subject_id 10013310. Combine the two tables using left_join by the icd_code and icd_version columns.
# read the procedures_icd.csv.gz and d_icd_procedures.csv.gzprocedures <-read_csv("~/mimic/hosp/procedures_icd.csv.gz") %>%filter(subject_id ==10013310) %>%# Combine the two tables by left_joinleft_join(read_csv("~/mimic/hosp/d_icd_procedures.csv.gz"), by =c("icd_code"="icd_code", "icd_version"="icd_version"))
Step VI: getting the ADT_history plot by ggplot2.
First, I use the ggplot function to create a new plot. Then, I use the scale_x_datetime function to specify the x axis limits, the scale_y_discrete function to specify the y axis with 3 levels, the geom_point function to add procedure events, the scale_shape_manual function to manually set the number of shapes according to the number of titles. Also, I use the geom_segment function to add ADT history, the geom_point function to add lab events, and the geom_text function to add diagnosis events. Finally, I use the theme function to set the theme of the plot.
ADT_history <-ggplot() +# Specify x axis limitsscale_x_datetime(name ="Calendar Time", # minus 1 day from the minimum intime # to display all the datalimits =c(min(ADT$intime) -days(1), max(ADT$outtime))) +# Specify y axis with 3 levelsscale_y_discrete(name =NULL, limits =c("Procedure", "Lab", "ADT")) +# Add procedure eventsgeom_point(data = procedures, # change the format of chartdateaes(x =as.POSIXct(chartdate), y ="Procedure", # Get shorter title using regular expressionshape =sub(",.*", "", long_title)),size =3) +# Manually set the number of shapes according to the number of titles# For reproducibility, I use n_distinct to determine the lengthscale_shape_manual(values =c(1:n_distinct(procedures$long_title))) +# Add lab eventsgeom_point(data = labevents, aes(x = charttime, y ="Lab"), shape =3, size =2) +# Add ADT eventsgeom_segment(data = ADT, aes(x = intime, xend = outtime, y ="ADT", yend ="ADT", color = careunit, # define the thickness of the segmentsize = segment_thickness)) +# Set legend position and arrangementtheme_bw() +theme(legend.position ="bottom", legend.box ="vertical", legend.key.size =unit(0, "pt"),legend.text =element_text(size =7)) +# Set legend titles and arrangementguides(color =guide_legend(title ="Care Unit", ncol =3,keywidth =1),shape =guide_legend(title ="Procedure", ncol =2),# Remove ADT's legendsize =FALSE) +# Add patient information as title and subtitlelabs(title =paste("Patient", demographics$subject_id[1], ", ", demographics$gender[1], ", ", demographics$anchor_age[1], "years old, ", demographics$race[1]),subtitle =paste(top_3_diagnoses$long_title[1], top_3_diagnoses$long_title[2], top_3_diagnoses$long_title[3],sep ="\n"))
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
of ggplot2 3.3.4.
# Print the plotADT_history
<<<<<<< HEAD
=======
>>>>>>> develop
Q1.2 ICU stays
ICU stays are a subset of ADT history. This figure shows the vitals of the patient 10001217 during ICU stays. The x-axis is the calendar time, and the y-axis is the value of the vital. The color of the line represents the type of vital. The facet grid shows the abbreviation of the vital and the stay ID.
Do a similar visualization for the patient 10013310.
Answer
(1) get to know the structure of chartevents.csv.gz and d_items.csv.gz by zcat and head.
Read the d_items.csv.gz and chartevents.csv.gz by read_csv and filter the patient with subject_id 10013310. Combine the two tables by left_join. Then, read the chartevents.csv.gz and filter the patient with subject_id 10013310. Combine the two tables by left_join.
# read the d_items.csv.gzitems <-read_csv("~/mimic/icu/d_items.csv.gz") %>% dplyr::select(c(itemid, label, abbreviation)) %>%# filter the abbreviation with HR, NBPd, NBPs, RR, Temperature F dplyr::filter(abbreviation %in%c("HR", "NBPd", "NBPs", "RR", "Temperature F"))# define the file path of chartevents.csvfile_path_chartevents <-"~/mimic/icu/chartevents.csv"# read the chartevents.csv.gzchartevents <- arrow::open_dataset(file_path_chartevents, format ="csv") %>%# filter out ids number of interested vitals dplyr::filter(itemid %in%c(220045, 220179, 220180, 220210, 223761)) %>%# filter the patient with subject_id 10013310 dplyr::filter(subject_id ==10013310) %>%# exclude unnecessary columns dplyr::select(-c(hadm_id, caregiver_id, storetime, warning)) %>%collect() %>%left_join(items, by =c("itemid"="itemid"))
(4) plot the data.
To plot the data, use ggplot and geom_point to add points, geom_line to add lines, facet_grid to show all combinations of abbreviation and stay_id, labs to add title and axis labels, theme_light to set the theme, and guides to avoid overlapping of the x-axis labels.
vitals_line_plot <-ggplot(chartevents, aes(x = charttime, y = valuenum, color = abbreviation)) +geom_point() +geom_line() +# use facet_grid to show all # combinations of abbreviation and stay_idfacet_grid(abbreviation ~ stay_id, scales ="free") +labs(title =paste("Patient", chartevents$subject_id[1], "ICU stays - Vitals"),x ="",y ="") +theme_light(base_size =9) +# remove legendtheme(legend.position ="none") +# To avoid overlapping of the x-axis labels, # using guides with n.dodge = 2guides(x =guide_axis(n.dodge =2)) # print the plotvitals_line_plot
icustays_tble <- arrow::open_dataset("~/mimic/icu/icustays.csv.gz", format ="csv") %>%collect()
Q2.2 Summary and visualization
How many unique subject_id? Can a subject_id have multiple ICU stays? Summarize the number of ICU stays per subject_id by graphs.
Answer
(1) By counting the number of unique subject_id, I can find that there are 50920 unique subject_id.
# calculate the number of unique subject_idicustays_tble %>% dplyr::distinct(subject_id) %>% dplyr::count()
# A tibble: 1 × 1
n
<int>
1 50920
(2) By grouping the data by subject_id and counting the number of unique stay_id, I find that a subject_idcan have multiple ICU stays. And there are 12448 subject_id with multiple ICU stays.
# calculate the number of unique stay_idicustays_tble %>% dplyr::distinct(stay_id) %>% dplyr::count()
# A tibble: 1 × 1
n
<int>
1 73181
# count the number of subject_id with multiple ICU staysicustays_tble %>% dplyr::group_by(subject_id) %>%summarise(num_icu_stays =n_distinct(stay_id)) %>% dplyr::filter(num_icu_stays >1) %>% dplyr::count()
# A tibble: 1 × 1
n
<int>
1 12448
(3) Summarize the number of ICU stays per subject_id by graphs.
# plot the distribution of ICU stays per `subject_id` by histogramicustays_tble %>%group_by(subject_id) %>%summarise(num_icu_stays =n_distinct(stay_id)) %>%ggplot(aes(x = num_icu_stays)) +geom_histogram(binwidth =1, fill ="skyblue", color ="black") +labs(title ="Distribution of ICU Stays per Subject",x ="Number of ICU stays per subject",y ="Frequency") +theme_minimal()
All dates in the database have been shifted to protect patient confidentiality. Dates will be internally consistent for the same patient, but randomly distributed in the future. Dates of birth which occur in the present time are not true dates of birth. Furthermore, dates of birth which occur before the year 1900 occur if the patient is older than 89. In these cases, the patient’s age at their first admission has been fixed to 300.
Answer
(1) Number of admissions per patient:
Pattern:
The majority of patients have only one admission, and a small portion of patients have multiple admissions.
Explanation:
This may be due to the fact that most people do not need to be admitted to the hospital multiple times. However, some patients may have chronic diseases or other health problems that require multiple admissions.
# number of admissions per patientadmissions_tble %>%group_by(subject_id) %>%summarise(num_admissions =n_distinct(hadm_id)) %>%ggplot(aes(x = num_admissions)) +geom_histogram(binwidth =1, fill ="skyblue", color ="black") +labs(title ="Distribution of admissions per patient",x ="Number of admissions per subject_id",y ="Frequency") +theme_minimal()
(2) Admission hour:
Usual?
Yes. There are some unusual patterns in the admission hour.
Pattern:
The number of admission hour decreases from 0 to 6, and then increases from 8 to 18, with an exception at 7. Then, the number of admission hour decreases from 18 to 23. The frequencies from 0 and 7 seem unusual because they are higher than expected for late night and early morning times.
Explanation:
This could be due to certain hospital operations or activities during those hours. For example, the number of admissions at 0 may be due to the fact that the admission time is recorded at 0 when the exact time is not available. The number of admissions at 7 may be due to the fact that the hospital is busy at that time.
# change the timezone to UTCadmissions_tble_admission_hour <- admissions_tble %>%mutate(admittime =with_tz(admittime, "UTC")) # plot the admission houradmissions_tble_admission_hour %>%ggplot(aes(x =hour(admittime))) +geom_histogram(binwidth =1, fill ="skyblue", color ="black") +labs(title ="Admission Hour",x ="Hour",y ="Frequency") +theme_minimal()
(3) Admission minute:
Usual?
Yes. There are unusual patterns in the admission minute.
Pattern:
The frequencies of the admission minute are extremely high at some specific minutes, such as 0, 15, 30, and 45. The frequencies of the rest of the minutes are relatively low.
Explanation:
This could be due to the fact that the admission time is recorded at these specific minutes, or the data is rounded to these specific minutes. For the rest of the minutes, the frequencies are relatively low.
No. I think the distribution of the length of hospital stay is not unusual.
Pattern:
The majority of patients stayed in the hospital for less than 10 days, and only a few patients stayed in the hospital for more than 10 days. The maximum length of stay is around 120 days.
Explanation:
This may be due to the fact that most patients are admitted for short-term care, and only a few patients require long-term care.
# standardize the time zone and calculate the length of staylength_of_stay <- admissions_tble %>%mutate(dischtime =with_tz(dischtime, "UTC")) %>%mutate(admittime =with_tz(admittime, "UTC")) %>%mutate(length_of_stay =as.numeric(difftime(dischtime, admittime, units ="days")))# plot the length of hospital staylength_of_stay %>%ggplot(aes(x = length_of_stay)) +geom_histogram(binwidth =1, fill ="skyblue", color ="black") +labs(title ="Length of Hospital Stay",x ="Length of Stay (days)",y ="Frequency") +theme_minimal()
patients_tble <- arrow::open_dataset(sources ="~/mimic/hosp/patients.csv.gz", format ="csv") %>% dplyr::collect()
Q4.2 Summary and visualization
Summarize variables gender and anchor_age by graphics, and explain any patterns you see.
Answer
(1)gender:
Pattern:
According to the pie chart, there are more female than male patients. And according to the bar chart, in each age group, it shows the same pattern.
Explanation:
This may be due to the gender distribution in the general population in the area where the hospital is located.
# Plot a pie chart to show the gender distribution# Add percentage of each gendergender_summary <- patients_tble %>%count(gender) %>%mutate(percentage = n /sum(n) *100)# plot the pie chartggplot(gender_summary, aes(x ="", y = percentage, fill = gender)) +geom_bar(stat ="identity", width =1) +geom_text(aes(label =paste0(round(percentage), "%")), position =position_stack(vjust =0.5)) +# use coord_polar to make the pie chartcoord_polar(theta ="y") +labs(title ="Distribution of Gender", fill ="Gender") +theme_void()
# plot the bar chartpatients_tble %>%ggplot(aes(x = anchor_age, fill = gender)) +geom_bar() +labs(title ="Distribution of age",x ="Age",y ="Frequency") +theme_minimal()
(2) anchor_age:
Pattern:
The distribution of anchor_age is multimodal, with peaks at around 21, 60, and over 89. This suggests that there are different age groups represented in the dataset, with a relatively high number of patients in their early 20s and around 60 years old.
Explanation:
This could be due to the fact that certain age groups are more likely to require hospital care, or it could be due to the demographics of the general population in the area where the hospital is located. There are also a few patients who are over 89 years old, which may due to that some elderly patients’ age is recorded as fixed value if they are over 89 years old.
# using bar plot to show the distribution of anchor_agepatients_tble %>%ggplot(aes(x = anchor_age)) +geom_histogram(binwidth =1, fill ="skyblue", color ="black") +labs(title ="Distribution of Anchor Age",x ="Anchor Age",y ="Frequency") +theme_minimal()
We are interested in the lab measurements of creatinine (50912), potassium (50971), sodium (50983), chloride (50902), bicarbonate (50882), hematocrit (51221), white blood cell count (51301), and glucose (50931). Retrieve a subset of labevents.csv.gz that only containing these items for the patients in icustays_tble. Further restrict to the last available measurement (by storetime) before the ICU stay. The final labevents_tble should have one row per ICU stay and columns for each lab measurement.
Hint: Use the Parquet format you generated in Homework 2. For reproducibility, make labevents_pq folder available at the current working directory hw3, for example, by a symbolic link.
Answer
Step 1: Load the icustays.csv.gz and d_labitems.csv.gz files as tibbles icustays_tble and dlabitems_tble using open_dataset and collect functions, respectively.
# load the icustays.csv.gz fileicustays_tble <- arrow::open_dataset("~/mimic/icu/icustays.csv.gz", format ="csv") %>%collect() %>%print(width =Inf)
# A tibble: 73,181 × 8
subject_id hadm_id stay_id first_careunit
<int> <int> <int> <chr>
1 10000032 29079034 39553978 Medical Intensive Care Unit (MICU)
2 10000980 26913865 39765666 Medical Intensive Care Unit (MICU)
3 10001217 24597018 37067082 Surgical Intensive Care Unit (SICU)
4 10001217 27703517 34592300 Surgical Intensive Care Unit (SICU)
5 10001725 25563031 31205490 Medical/Surgical Intensive Care Unit (MICU/SICU)
6 10001884 26184834 37510196 Medical Intensive Care Unit (MICU)
7 10002013 23581541 39060235 Cardiac Vascular Intensive Care Unit (CVICU)
8 10002155 20345487 32358465 Medical Intensive Care Unit (MICU)
9 10002155 23822395 33685454 Coronary Care Unit (CCU)
10 10002155 28994087 31090461 Medical/Surgical Intensive Care Unit (MICU/SICU)
last_careunit intime
<chr> <dttm>
1 Medical Intensive Care Unit (MICU) 2180-07-23 07:00:00
2 Medical Intensive Care Unit (MICU) 2189-06-27 01:42:00
3 Surgical Intensive Care Unit (SICU) 2157-11-20 11:18:02
4 Surgical Intensive Care Unit (SICU) 2157-12-19 07:42:24
5 Medical/Surgical Intensive Care Unit (MICU/SICU) 2110-04-11 08:52:22
6 Medical Intensive Care Unit (MICU) 2131-01-10 20:20:05
7 Cardiac Vascular Intensive Care Unit (CVICU) 2160-05-18 03:00:53
8 Medical Intensive Care Unit (MICU) 2131-03-09 13:33:00
9 Coronary Care Unit (CCU) 2129-08-04 05:45:00
10 Medical/Surgical Intensive Care Unit (MICU/SICU) 2130-09-23 17:50:00
outtime los
<dttm> <dbl>
1 2180-07-23 16:50:47 0.410
2 2189-06-27 13:38:27 0.498
3 2157-11-21 14:08:00 1.12
4 2157-12-20 06:27:41 0.948
5 2110-04-12 16:59:56 1.34
6 2131-01-20 00:27:30 9.17
7 2160-05-19 10:33:33 1.31
8 2131-03-10 10:09:21 0.859
9 2129-08-10 10:02:38 6.18
10 2130-09-27 15:13:41 3.89
# ℹ 73,171 more rows
# define the target itemstarget_items <-c(50912, 50971, 50983, 50902, 50882, 51221, 51301, 50931)# load the d_labitems.csv.gz filedlabitems_tble <- arrow::open_dataset("~/mimic/hosp/d_labitems.csv.gz", format ="csv") %>%filter(itemid %in% target_items) %>%collect() %>%print(width =Inf)
Step 2: Load the labevents.csv.gz file as a tibble labevents_tble and filter the rows with the target items and the patients in icustays_tble.
Here I select the columns of subject_id, itemid, storetime, and valuenum is because I only need these columns for the next step. The storetime is the time when the lab measurement was stored, and I will use it to find the last available measurement before the ICU stay. And the subject_id and itemid are the prepared columns for the next step. The valuenum is the value of the lab measurement which I interested in.
labevents <- arrow::open_dataset('./labevents_pq', format ="parquet") %>%# select the columns of interestselect(subject_id, itemid, storetime, valuenum) %>%filter(itemid %in% target_items) %>%filter(subject_id %in% icustays_tble$subject_id) %>%collect() %>%print(width =Inf)
Step 3: Join labevents_tble with icustays_tble to get the stay_id and intime for each lab measurement. Then filter the rows to keep the last available measurement before the ICU stay.
Here I use the left_join function to join the labevents_tble with icustays_tble by the subject_id column. Then I use the filter function to keep the lab measurement before the ICU stay. The group_by and arrange functions are used to sort the lab measurements by storetime and keep the last measurement for each subject_id, stay_id, and itemid. The slice_tail function is used to keep the last measurement for each group. Finally, I use the select function to remove the storetime and intime columns.
# I use if else statement and write_rds function # to save my rendering timeif (file.exists("labevents_tble.rds")) {# If the file exists, print it labevents_tble <-read_rds("labevents_tble.rds") %>%print(width =Inf)} else { labevents_tble <- labevents %>%# join with icustays_tble to the labevents table to # get the stay_id, and intimeleft_join(icustays_tble[, c('subject_id', 'stay_id', 'intime')],by ="subject_id") %>%# make sure the lab measurement is before the ICU stayfilter(storetime <= intime) %>%group_by(subject_id, stay_id, itemid) %>%# sort by storetime and keep the last measurementarrange(storetime, by_group =TRUE) %>%slice_tail(n =1) %>%# remove the storetime and intime columnsselect(-storetime, -intime) %>%# because the form of the `itemid` is in long format, # I need to convert it to wide formatpivot_wider(names_from = itemid, values_from = valuenum) %>%# use the more meaningful names by renaming the columnsrename(creatinine =`50912`, potassium =`50971`, sodium =`50983`, chloride =`50902`,bicarbonate =`50882`, hematocrit =`51221`, white_blood_cell_count =`51301`, glucose =`50931`) %>%
<<<<<<< HEAD
print(width, Inf) %>%
=======
print(width =Inf) %>%
>>>>>>> develop
write_rds("labevents_tble.rds")}
chartevents.csv.gz (https://mimic.mit.edu/docs/iv/modules/icu/chartevents/) contains all the charted data available for a patient. During their ICU stay, the primary repository of a patient’s information is their electronic chart. The itemid variable indicates a single measurement type in the database. The value variable is the value measured for itemid. The first 10 lines of chartevents.csv.gz are
We are interested in the vitals for ICU patients: heart rate (220045), systolic non-invasive blood pressure (220179), diastolic non-invasive blood pressure (220180), body temperature in Fahrenheit (223761), and respiratory rate (220210). Retrieve a subset of chartevents.csv.gz only containing these items for the patients in icustays_tble. Further restrict to the first vital measurement within the ICU stay. The final chartevents_tble should have one row per ICU stay and columns for each vital measurement.
Hint: Use the Parquet format you generated in Homework 2. For reproducibility, make chartevents_pq folder available at the current working directory, for example, by a symbolic link.
Answer
I create a symbolic link for the original directory to the Parquet file chartevents.parquet under the path under the folder ./chartevents_pq available at the current working directory hw3.
Step 1: Load the d_items.csv.gz file and filter the rows to keep the itemid for the vitals of interest.
# load the d_items.csv.gz file d_items_tble <- arrow::open_dataset("~/mimic/icu/d_items.csv.gz", format ="csv") %>%# filter the rows to keep the itemid for the vitals of interestfilter(itemid %in%c(220045, 220179, 220180, 223761, 220210)) %>%collect() %>%print(width =Inf)
# A tibble: 5 × 9
itemid label abbreviation linksto
<int> <chr> <chr> <chr>
1 220045 Heart Rate HR chartevents
2 220179 Non Invasive Blood Pressure systolic NBPs chartevents
3 220180 Non Invasive Blood Pressure diastolic NBPd chartevents
4 220210 Respiratory Rate RR chartevents
5 223761 Temperature Fahrenheit Temperature F chartevents
category unitname param_type lownormalvalue highnormalvalue
<chr> <chr> <chr> <int> <dbl>
1 Routine Vital Signs bpm Numeric NA NA
2 Routine Vital Signs mmHg Numeric NA NA
3 Routine Vital Signs mmHg Numeric NA NA
4 Respiratory insp/min Numeric NA NA
5 Routine Vital Signs °F Numeric NA NA
Step 2: Load the chartevents.csv.gz file and filter the rows to keep the itemid for the vitals of interest.
Here, I select the subject_id, itemid, charttime, and valuenum, which are the columns of interest. The subject_id and itemid are preparation for the next step. charttime will be used to filter the charted events within the ICU stay.
# load the chartevents.csv.gz filechartevents <- arrow::open_dataset('./chartevents_pq', format ="parquet") %>%select(subject_id, itemid, charttime, valuenum) %>%# filter the rows to keep the itemid for the vitals of interestfilter(itemid %in%c(220045, 220179, 220180, 223761, 220210)) %>%filter(subject_id %in% icustays_tble$subject_id) %>%collect() %>%print(width =Inf)
Step 3: Left join chartevents with icustays_tble to get stay_id and intime. Then, filter the charted events within the ICU stay. Keep the first measurement for each subject_id and itemid.
Here, I use left_join function to join the chartevents with icustays_tble to get stay_id and intime. Then, I filter the charted events within the ICU stay. I use group_by and arrange functions to keep the first measurement for each subject_id and itemid. Also, I use filter function to keep the charttime within the ICU stay. And I use slice_head function to keep the first measurement for each subject_id and itemid. Finally, I use select function to keep the columns of interest, use pivot_wider function to convert the form of the itemid from long to wide format, and use rename function to use the more meaningful names by renaming the columns.
# I use the if-else statement and write_rds function # to save my rendering timeif (file.exists("chartevents_tble.rds")) {# If the file exists, read the data from the file and print it chartevents_tble <-read_rds("chartevents_tble.rds") print(chartevents_tble, width =Inf)} else {# If the file doesn't exist, create the chartevents_tble# Left join chartevents with icustays_tble # to get stay_id, intime, and outtime chartevents_tble <- chartevents %>%left_join(icustays_tble %>%select(subject_id, stay_id, intime, outtime),by ="subject_id") %>%# Filter charted events within ICU stayfilter(charttime >= intime, charttime <= outtime) %>%# Keep the first measurement for each subject_id, stay_id, itemidgroup_by(subject_id, stay_id, itemid) %>%arrange(charttime, .by_group =TRUE) %>%slice_head(n =1) %>%# Remove unnecessary columnsselect(-charttime, -intime, -outtime) %>%# Convert itemid to wide formatpivot_wider(names_from = itemid, values_from = valuenum) %>%# Rename columns with meaningful namesrename(heart_rate =`220045`, systolic_non_invasive_blood_pressure =`220179`, diastolic_non_invasive_blood_pressure =`220180`, temperature_in_Fahrenheit =`223761`, respiratory_rate =`220210`) %>%print(width =Inf) %>%write_rds("chartevents_tble.rds") }
Let us create a tibble mimic_icu_cohort for all ICU stays, where rows are all ICU stays of adults (age at intime >= 18) and columns contain at least following variables
all variables in icustays_tble
all variables in admissions_tble
all variables in patients_tble
the last lab measurements before the ICU stay in labevents_tble
the first vital measurements during the ICU stay in chartevents_tble
The final mimic_icu_cohort should have one row per ICU stay and columns for each variable.
Answer
Here, I use left_join function to join all the five tibbles. Also, I add a age_intime column to calculate the age at intime and filter the rows to keep the adults (age at intime >= 18).
mimic_icu_cohort <- icustays_tble %>%# join with admissions_tble to the icustays table to get the# admission informationleft_join(admissions_tble, by =c("hadm_id", "subject_id")) %>%# join with patients_tble to the icustays table to get the# patient informationleft_join(patients_tble, by ="subject_id") %>%# join with labevents_tble to the icustays table to get the# last lab measurements before ICU stayleft_join(labevents_tble, by =c("subject_id", "stay_id")) %>%# join with chartevents_tble to the icustays table to get the# first vital measurements within the ICU stayleft_join(chartevents_tble, by =c("subject_id", "stay_id")) %>%# create a new column age_intimemutate(age_intime =year(intime) - anchor_year + anchor_age) %>%# select the rows where age_intime is greater than or equal to 18filter(age_intime >=18) %>%print(width =Inf)
# A tibble: 73,181 × 41
subject_id hadm_id stay_id first_careunit
<int> <int> <int> <chr>
1 10000032 29079034 39553978 Medical Intensive Care Unit (MICU)
2 10000980 26913865 39765666 Medical Intensive Care Unit (MICU)
3 10001217 24597018 37067082 Surgical Intensive Care Unit (SICU)
4 10001217 27703517 34592300 Surgical Intensive Care Unit (SICU)
5 10001725 25563031 31205490 Medical/Surgical Intensive Care Unit (MICU/SICU)
6 10001884 26184834 37510196 Medical Intensive Care Unit (MICU)
7 10002013 23581541 39060235 Cardiac Vascular Intensive Care Unit (CVICU)
8 10002155 20345487 32358465 Medical Intensive Care Unit (MICU)
9 10002155 23822395 33685454 Coronary Care Unit (CCU)
10 10002155 28994087 31090461 Medical/Surgical Intensive Care Unit (MICU/SICU)
last_careunit intime
<chr> <dttm>
1 Medical Intensive Care Unit (MICU) 2180-07-23 07:00:00
2 Medical Intensive Care Unit (MICU) 2189-06-27 01:42:00
3 Surgical Intensive Care Unit (SICU) 2157-11-20 11:18:02
4 Surgical Intensive Care Unit (SICU) 2157-12-19 07:42:24
5 Medical/Surgical Intensive Care Unit (MICU/SICU) 2110-04-11 08:52:22
6 Medical Intensive Care Unit (MICU) 2131-01-10 20:20:05
7 Cardiac Vascular Intensive Care Unit (CVICU) 2160-05-18 03:00:53
8 Medical Intensive Care Unit (MICU) 2131-03-09 13:33:00
9 Coronary Care Unit (CCU) 2129-08-04 05:45:00
10 Medical/Surgical Intensive Care Unit (MICU/SICU) 2130-09-23 17:50:00
outtime los admittime dischtime
<dttm> <dbl> <dttm> <dttm>
1 2180-07-23 16:50:47 0.410 2180-07-23 05:35:00 2180-07-25 10:55:00
2 2189-06-27 13:38:27 0.498 2189-06-27 00:38:00 2189-07-02 20:00:00
3 2157-11-21 14:08:00 1.12 2157-11-18 14:56:00 2157-11-25 10:00:00
4 2157-12-20 06:27:41 0.948 2157-12-18 08:58:00 2157-12-24 06:55:00
5 2110-04-12 16:59:56 1.34 2110-04-11 08:08:00 2110-04-14 08:00:00
6 2131-01-20 00:27:30 9.17 2131-01-07 12:39:00 2131-01-19 21:15:00
7 2160-05-19 10:33:33 1.31 2160-05-18 00:45:00 2160-05-23 06:30:00
8 2131-03-10 10:09:21 0.859 2131-03-09 12:33:00 2131-03-09 17:55:00
9 2129-08-10 10:02:38 6.18 2129-08-04 05:44:00 2129-08-18 09:53:00
10 2130-09-27 15:13:41 3.89 2130-09-23 14:59:00 2130-09-29 11:55:00
deathtime admission_type admit_provider_id
<dttm> <chr> <chr>
1 NA EW EMER. P30KEH
2 NA EW EMER. P30KEH
3 NA EW EMER. P4645A
4 NA DIRECT EMER. P99698
5 NA EW EMER. P35SU0
6 2131-01-19 21:15:00 OBSERVATION ADMIT P874LG
7 NA SURGICAL SAME DAY ADMISSION P47E1G
8 2131-03-10 13:53:00 EW EMER. P80515
9 NA EW EMER. P05HUO
10 NA EW EMER. P3529J
admission_location discharge_location insurance language
<chr> <chr> <chr> <chr>
1 EMERGENCY ROOM HOME Medicaid ENGLISH
2 EMERGENCY ROOM HOME HEALTH CARE Medicare ENGLISH
3 EMERGENCY ROOM HOME HEALTH CARE Other ?
4 PHYSICIAN REFERRAL HOME HEALTH CARE Other ?
5 PACU HOME Other ENGLISH
6 EMERGENCY ROOM DIED Medicare ENGLISH
7 PHYSICIAN REFERRAL HOME HEALTH CARE Medicare ENGLISH
8 EMERGENCY ROOM DIED Other ENGLISH
9 PROCEDURE SITE CHRONIC/LONG TERM ACUTE CARE Other ENGLISH
10 EMERGENCY ROOM HOME HEALTH CARE Other ENGLISH
marital_status race edregtime edouttime
<chr> <chr> <dttm> <dttm>
1 WIDOWED WHITE 2180-07-22 22:54:00 2180-07-23 07:00:00
2 MARRIED BLACK/AFRICAN AMERICAN 2189-06-26 23:25:00 2189-06-27 01:42:00
3 MARRIED WHITE 2157-11-18 09:38:00 2157-11-18 17:24:00
4 MARRIED WHITE NA NA
5 MARRIED WHITE NA NA
6 MARRIED BLACK/AFRICAN AMERICAN 2131-01-07 05:36:00 2131-01-07 14:13:00
7 SINGLE OTHER NA NA
8 MARRIED WHITE 2131-03-09 11:14:00 2131-03-09 13:33:00
9 MARRIED WHITE 2129-08-04 04:00:00 2129-08-04 05:35:00
10 MARRIED WHITE 2130-09-23 12:59:00 2130-09-23 17:50:00
hospital_expire_flag gender anchor_age anchor_year anchor_year_group
<int> <chr> <int> <int> <chr>
1 0 F 52 2180 2014 - 2016
2 0 F 73 2186 2008 - 2010
3 0 F 55 2157 2011 - 2013
4 0 F 55 2157 2011 - 2013
5 0 F 46 2110 2011 - 2013
6 1 F 68 2122 2008 - 2010
7 0 F 53 2156 2008 - 2010
8 1 F 80 2128 2008 - 2010
9 0 F 80 2128 2008 - 2010
10 0 F 80 2128 2008 - 2010
dod bicarbonate chloride creatinine glucose potassium sodium
<date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2180-09-09 25 95 0.7 102 6.7 126
2 2193-08-26 21 109 2.3 89 3.9 144
3 NA 22 108 0.6 112 4.2 142
4 NA 30 104 0.5 87 4.1 142
5 NA NA 98 NA NA 4.1 139
6 2131-01-20 30 88 1.1 141 4.5 130
7 NA 24 102 0.9 288 3.5 137
8 2131-03-10 26 85 1.4 133 5.7 120
9 2131-03-10 24 105 1.1 138 4.6 139
10 2131-03-10 23 98 2.8 117 4.9 135
hematocrit white_blood_cell_count heart_rate
<dbl> <dbl> <dbl>
1 41.1 6.9 91
2 27.3 5.3 77
3 38.1 15.7 86
4 37.4 5.4 96
5 NA NA 55
6 39.7 12.2 38
7 34.9 7.2 80
8 22.4 9.8 98
9 39.7 7.9 68
10 25.5 17.9 94
systolic_non_invasive_blood_pressure diastolic_non_invasive_blood_pressure
<dbl> <dbl>
1 84 48
2 150 77
3 151 90
4 167 95
5 73 56
6 180 12
7 104 70
8 109 65
9 126 61
10 118 51
respiratory_rate temperature_in_Fahrenheit age_intime
<dbl> <dbl> <dbl>
1 24 98.7 52
2 23 98 76
3 18 98.5 55
4 11 97.6 55
5 19 97.7 46
6 10 98.1 77
7 14 97.2 57
8 23 97.7 83
9 18 95.9 81
10 18 96.9 82
# ℹ 73,171 more rows
Q8. Exploratory data analysis (EDA)
Summarize the following information about the ICU stay cohort mimic_icu_cohort using appropriate numerics or graphs:
Length of ICU stay los vs demographic variables (race, insurance, marital_status, gender, age at intime)
Length of ICU stay los vs the last available lab measurements before ICU stay filter storetime intime
Length of ICU stay los vs the first vital measurements within the ICU stay
Length of ICU stay los vs first ICU unit
Answer
(1.1) Length of ICU stay los vs demographic
Getting both the bar plot and the summary statistics for the length of ICU stay
# group by gender and plot the bar plotmimic_icu_cohort %>%group_by(race) %>%ggplot(aes(x = los, y = race)) +geom_point() +labs(title ="Summary of Length of ICU stay by race",x ="Length of ICU stay (days)",y ="Race") +theme_minimal()
# get the mean, standard deviation and IQR of the length of ICU stay# for different ethnicial groupsmimic_icu_cohort %>%group_by(race) %>%summarize(mean_los =mean(los, na.rm =TRUE),std_dev =sd(los, na.rm =TRUE),iqr =IQR(los, na.rm =TRUE))
# A tibble: 33 × 4
race mean_los std_dev iqr
<chr> <dbl> <dbl> <dbl>
1 AMERICAN INDIAN/ALASKA NATIVE 4.46 6.54 4.01
2 ASIAN 3.49 4.90 2.70
3 ASIAN - ASIAN INDIAN 4.20 7.61 2.81
4 ASIAN - CHINESE 3.36 5.39 2.28
5 ASIAN - KOREAN 4.23 6.35 2.85
6 ASIAN - SOUTH EAST ASIAN 3.04 5.42 1.90
7 BLACK/AFRICAN 3.82 5.90 3.04
8 BLACK/AFRICAN AMERICAN 3.28 4.80 2.43
9 BLACK/CAPE VERDEAN 3.22 4.94 2.53
10 BLACK/CARIBBEAN ISLAND 3.61 5.13 2.45
# ℹ 23 more rows
(1.2) Length of ICU stay los vs insurance
Using the box plot to show the summary of the length of ICU stay by insurance
# group by insurance and plot the box plotmimic_icu_cohort %>%group_by(insurance) %>%ggplot(aes(x = los, y = insurance)) +geom_boxplot() +labs(title ="Summary of Length of ICU stay by insurance",x ="Length of ICU stay (days)",y ="Insurance") +theme_minimal()
(1.3) Length of ICU stay los vs marital_status
Using the violin plot to show the summary of the length of ICU stay by marital status. And also, adding the points to show the distribution of the length of ICU stay.
# group by marital_status and plot the violin plotmimic_icu_cohort %>%group_by(marital_status) %>%ggplot(aes(x = los, y = marital_status)) +geom_violin() +geom_point() +labs(title ="Summary of Length of ICU stay by Marital Status",x ="Length of ICU stay (days)",y ="Marital Status") +theme_minimal()
(1.4) Length of ICU stay los vs gender
Using the count plot to show the summary of the length of ICU
# group by gender and plot the count plotmimic_icu_cohort %>%group_by(gender) %>%ggplot(aes(x = gender, y = los)) +geom_count() +labs(title ="Summary of Length of ICU stay by Gender",x ="Gender",y ="Length of ICU stay (days)") +theme_minimal()
(1.5) Length of ICU stay los vs age at intime
Using the scatter plot to show the summary of the length of ICU stay by age at intime
# group by age at intime and plot the scatter plotmimic_icu_cohort %>%ggplot(aes(y = los, x = age_intime)) +geom_point() +labs(title ="Summary of Length of ICU stay by Age at intime",x ="Age at intime",y ="Length of ICU stay (days)") +theme_minimal()
(2) Length of ICU stay los vs the last available lab measurements before ICU stay I will dot plot to show the relationship between the length of ICU stay and each last available lab measurements before ICU stay.
# prepare the variables to be used in the plot
<<<<<<< HEAD
variables <-c("white_blood_cell_count", "heart_rate", "respiratory_rate", "temperature_in_Fahrenheit", "systolic_non_invasive_blood_pressure", "diastolic_non_invasive_blood_pressure")
=======
variables <-c("potassium", "sodium", "chloride", "bicarbonate", "hematocrit", "white_blood_cell_count", "glucose")
>>>>>>> develop
# get the last lab measurements before ICU staymimic_icu_cohort_select <- mimic_icu_cohort %>%# select the last lab measurements before ICU stayselect(all_of(variables), los) %>%# remove missing valuesfilter(across(c(all_of(variables), los), ~!is.na(.))) %>%# trim extreme values and replace old columnsmutate(across(c(all_of(variables), los), ~trim(., trim_proportion =1, na.rm =TRUE)))
Warning: Using `across()` in `filter()` was deprecated in dplyr 1.0.8.
ℹ Please use `if_any()` or `if_all()` instead.
# using a for loop to create separate plots for each variableplots <-list()for (variable in variables) { plots[[variable]] <- mimic_icu_cohort %>%ggplot(aes_string(y ="los", x = variable)) +geom_point() +labs(title =paste("Length of ICU stay vs", variable),x = variable,y ="Length of ICU stay (days)") +theme_minimal()}
Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
ℹ Please use tidy evaluation idioms with `aes()`.
ℹ See also `vignette("ggplot2-in-packages")` for more information.
(3) Length of ICU stay los vs the first vital measurements within the ICU stay
I will use the scatter plot to show the relationship between the length of ICU stay and each first vital measurements within the ICU stay.
<<<<<<< HEAD
# prepare the variables to be used in the plotvariables <-c("heart_rate", "respiratory_rate", "temperature_in_Fahrenheit", "systolic_non_invasive_blood_pressure", "diastolic_non_invasive_blood_pressure")# get the first vital measurements within the ICU staymimic_icu_cohort_select <- mimic_icu_cohort %>%# select the first vital measurements within the ICU stayselect(all_of(variables), los) %>%# remove missing valuesfilter(across(c(all_of(variables), los), ~!is.na(.))) %>%# trim extreme values and replace old valuesmutate(across(c(all_of(variables), los), ~trim(., trim_proportion =1, na.rm =TRUE)))
=======
# prepare the variables to be used in the plotvariables <-c("heart_rate", "respiratory_rate", "temperature_in_Fahrenheit", "systolic_non_invasive_blood_pressure", "diastolic_non_invasive_blood_pressure")# get the first vital measurements within the ICU staymimic_icu_cohort_select <- mimic_icu_cohort %>%# select the first vital measurements within the ICU stayselect(all_of(variables), los) %>%# remove missing valuesfilter(across(c(all_of(variables), los), ~!is.na(.))) %>%# trim extreme values and replace old valuesmutate(across(c(all_of(variables), los), ~trim(., trim_proportion =1, na.rm =TRUE)))
>>>>>>> develop
Warning: Using `across()` in `filter()` was deprecated in dplyr 1.0.8.
ℹ Please use `if_any()` or `if_all()` instead.
<<<<<<< HEAD
# using a for loop to create separate plots for each variableplots <-list()for (variable in variables) { plots[[variable]] <- mimic_icu_cohort %>%ggplot(aes_string(y ="los", x = variable)) +geom_point() +labs(title =paste("Length of ICU stay vs", variable),x = variable,y ="Length of ICU stay (days)") +theme_minimal()}# show the plotsplots
=======
# using a for loop to create separate plots for each variableplots <-list()for (variable in variables) { plots[[variable]] <- mimic_icu_cohort %>%ggplot(aes_string(y ="los", x = variable)) +geom_point() +labs(title =paste("Length of ICU stay vs", variable),x = variable,y ="Length of ICU stay (days)") +theme_minimal()}# show the plotsplots
Warning: Removed 983 rows containing missing values or values outside the scale range
(`geom_point()`).
>>>>>>> develop
(4) Length of ICU stay los vs first ICU
I will use the bar plot to show the proportion of ICU stays by the first ICU.
<<<<<<< HEAD
# count the number of ICU stays and plot the bar plotmimic_icu_cohort %>%count(first_careunit) %>%ggplot(aes(x = first_careunit, y = n, fill = first_careunit)) +geom_bar(width =1, stat ="identity", color ="white") +labs(title ="Proportion of ICU Stays by First ICU",x ="First ICU",y ="Proportion",fill ="First ICU") +# add the percentage labelsscale_y_continuous(labels = scales::percent_format(scale =1)) +theme_minimal() +theme(axis.text.x =element_blank(),axis.ticks.x =element_blank())
=======
# count the number of ICU stays and plot the bar plotmimic_icu_cohort %>%count(first_careunit) %>%ggplot(aes(x = first_careunit, y = n, fill = first_careunit)) +geom_bar(width =1, stat ="identity", color ="white") +labs(title ="Proportion of ICU Stays by First ICU",x ="First ICU",y ="Proportion",fill ="First ICU") +# add the percentage labelsscale_y_continuous(labels = scales::percent_format(scale =1)) +theme_minimal() +theme(axis.text.x =element_blank(),axis.ticks.x =element_blank())