Biostat 203B Homework 2

Due Feb 9 @ 11:59PM

Author

Jiachen Ai, UID:206182615

Display machine information for reproducibility:

sessionInfo()
R version 4.3.1 (2023-06-16)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Ventura 13.3

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: America/Los_Angeles
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
 [1] htmlwidgets_1.6.4 compiler_4.3.1    fastmap_1.1.1     cli_3.6.1        
 [5] tools_4.3.1       htmltools_0.5.7   rstudioapi_0.15.0 yaml_2.3.7       
 [9] rmarkdown_2.25    knitr_1.44        jsonlite_1.8.7    xfun_0.40        
[13] digest_0.6.33     rlang_1.1.1       evaluate_0.22    

Load necessary libraries (you can add more as needed).

library(arrow)
library(data.table)
library(memuse)
library(pryr)
library(R.utils)
library(tidyverse)
library(readr)
library(dplyr)
library(duckdb)
library(DBI)

Display memory information of your computer

memuse::Sys.meminfo()
Totalram:  8.000 GiB 
Freeram:   2.124 GiB 

In this exercise, we explore various tools for ingesting the MIMIC-IV data introduced in homework 1.

Display the contents of MIMIC hosp and icu data folders:

ls -l ~/mimic/hosp/
total 35786920
-rw-rw-r--@ 1 jacenai  staff     15516088 Jan  5  2023 admissions.csv.gz
-rw-rw-r--@ 1 jacenai  staff       427468 Jan  5  2023 d_hcpcs.csv.gz
-rw-rw-r--@ 1 jacenai  staff       859438 Jan  5  2023 d_icd_diagnoses.csv.gz
-rw-rw-r--@ 1 jacenai  staff       578517 Jan  5  2023 d_icd_procedures.csv.gz
-rw-rw-r--@ 1 jacenai  staff        12900 Jan  5  2023 d_labitems.csv.gz
-rw-rw-r--@ 1 jacenai  staff     25070720 Jan  5  2023 diagnoses_icd.csv.gz
-rw-rw-r--@ 1 jacenai  staff      7426955 Jan  5  2023 drgcodes.csv.gz
-rw-rw-r--@ 1 jacenai  staff    508524623 Jan  5  2023 emar.csv.gz
-rw-rw-r--@ 1 jacenai  staff    471096030 Jan  5  2023 emar_detail.csv.gz
-rw-rw-r--@ 1 jacenai  staff      1767138 Jan  5  2023 hcpcsevents.csv.gz
-rw-rw-r--@ 1 jacenai  staff  13730083993 Jan  5  2023 labevents.csv
-rw-rw-r--@ 1 jacenai  staff   1939088924 Jan  5  2023 labevents.csv.gz
drwxr-xr-x@ 3 jacenai  staff           96 Feb  5 16:52 labevents.parquet
-rw-r--r--@ 1 jacenai  staff     56623104 Feb  9 18:57 labevents_filtered.csv.gz
-rw-rw-r--@ 1 jacenai  staff     96698496 Jan  5  2023 microbiologyevents.csv.gz
-rw-rw-r--@ 1 jacenai  staff     36124944 Jan  5  2023 omr.csv.gz
-rw-rw-r--@ 1 jacenai  staff      2312631 Jan  5  2023 patients.csv.gz
-rw-rw-r--@ 1 jacenai  staff    398753125 Jan  5  2023 pharmacy.csv.gz
-rw-rw-r--@ 1 jacenai  staff    498505135 Jan  5  2023 poe.csv.gz
-rw-rw-r--@ 1 jacenai  staff     25477219 Jan  5  2023 poe_detail.csv.gz
-rw-rw-r--@ 1 jacenai  staff    458817415 Jan  5  2023 prescriptions.csv.gz
-rw-rw-r--@ 1 jacenai  staff      6027067 Jan  5  2023 procedures_icd.csv.gz
-rw-rw-r--@ 1 jacenai  staff       122507 Jan  5  2023 provider.csv.gz
-rw-rw-r--@ 1 jacenai  staff      6781247 Jan  5  2023 services.csv.gz
-rw-rw-r--@ 1 jacenai  staff     36158338 Jan  5  2023 transfers.csv.gz
ls -l ~/mimic/icu/
total 65148984
-rw-rw-r--@ 1 jacenai  staff        35893 Jan  5  2023 caregiver.csv.gz
-rw-rw-r--@ 1 jacenai  staff  30204420231 Jan  5  2023 chartevents.csv
-rw-rw-r--@ 1 jacenai  staff   2467761053 Jan  5  2023 chartevents.csv.gz
-rw-rw-r--@ 1 jacenai  staff        57476 Jan  5  2023 d_items.csv.gz
-rw-rw-r--@ 1 jacenai  staff     45721062 Jan  5  2023 datetimeevents.csv.gz
-rw-rw-r--@ 1 jacenai  staff      2614571 Jan  5  2023 icustays.csv.gz
-rw-rw-r--@ 1 jacenai  staff    251962313 Jan  5  2023 ingredientevents.csv.gz
-rw-rw-r--@ 1 jacenai  staff    324218488 Jan  5  2023 inputevents.csv.gz
-rw-rw-r--@ 1 jacenai  staff     38747895 Jan  5  2023 outputevents.csv.gz
-rw-rw-r--@ 1 jacenai  staff     20717852 Jan  5  2023 procedureevents.csv.gz

Q1. read.csv (base R) vs read_csv (tidyverse) vs fread (data.table)

Q1.1 Speed, memory, and data types

There are quite a few utilities in R for reading plain text data files. Let us test the speed of reading a moderate sized compressed csv file, admissions.csv.gz, by three functions: read.csv in base R, read_csv in tidyverse, and fread in the data.table package.

Which function is fastest? Is there difference in the (default) parsed data types? How much memory does each resultant dataframe or tibble use? (Hint: system.time measures run times; pryr::object_size measures memory usage.)

Answer

First, we will read the admissions.csv.gz file using read.csv, read_csv, and fread functions, and compare the speed respectively.

mimic_path <- "~/mimic/hosp/"
# reading time for read.csv
system.time(data_read.csv <- read.csv(
  str_c(mimic_path,"admissions.csv.gz")))
   user  system elapsed 
  4.500   0.075   4.577 
# reading time for read_csv
system.time(data_read_csv <- read_csv(
  str_c(mimic_path,"admissions.csv.gz")))
   user  system elapsed 
  1.404   0.090   0.804 
# reading time for fread
system.time(data_fread <- fread(
  str_c(mimic_path,"admissions.csv.gz")))
   user  system elapsed 
  0.695   0.050   0.754 

Then, we will compare the memory usage.

# memory usage for read.csv
pryr::object_size(data_read.csv)
158.71 MB
# memory usage for read_csv
pryr::object_size(data_read_csv)
55.31 MB
# memory usage for fread
pryr::object_size(data_fread)
50.13 MB

Finally, we will compare the data types.

# data types for read.csv
str(data_read.csv)
'data.frame':   431231 obs. of  16 variables:
 $ subject_id          : int  10000032 10000032 10000032 10000032 10000068 10000084 10000084 10000108 10000117 10000117 ...
 $ hadm_id             : int  22595853 22841357 25742920 29079034 25022803 23052089 29888819 27250926 22927623 27988844 ...
 $ admittime           : chr  "2180-05-06 22:23:00" "2180-06-26 18:27:00" "2180-08-05 23:44:00" "2180-07-23 12:35:00" ...
 $ dischtime           : chr  "2180-05-07 17:15:00" "2180-06-27 18:49:00" "2180-08-07 17:50:00" "2180-07-25 17:55:00" ...
 $ deathtime           : chr  "" "" "" "" ...
 $ admission_type      : chr  "URGENT" "EW EMER." "EW EMER." "EW EMER." ...
 $ admit_provider_id   : chr  "P874LG" "P09Q6Y" "P60CC5" "P30KEH" ...
 $ admission_location  : chr  "TRANSFER FROM HOSPITAL" "EMERGENCY ROOM" "EMERGENCY ROOM" "EMERGENCY ROOM" ...
 $ discharge_location  : chr  "HOME" "HOME" "HOSPICE" "HOME" ...
 $ insurance           : chr  "Other" "Medicaid" "Medicaid" "Medicaid" ...
 $ language            : chr  "ENGLISH" "ENGLISH" "ENGLISH" "ENGLISH" ...
 $ marital_status      : chr  "WIDOWED" "WIDOWED" "WIDOWED" "WIDOWED" ...
 $ race                : chr  "WHITE" "WHITE" "WHITE" "WHITE" ...
 $ edregtime           : chr  "2180-05-06 19:17:00" "2180-06-26 15:54:00" "2180-08-05 20:58:00" "2180-07-23 05:54:00" ...
 $ edouttime           : chr  "2180-05-06 23:30:00" "2180-06-26 21:31:00" "2180-08-06 01:44:00" "2180-07-23 14:00:00" ...
 $ hospital_expire_flag: int  0 0 0 0 0 0 0 0 0 0 ...
# data types for read_csv
str(data_read_csv)
spc_tbl_ [431,231 × 16] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ subject_id          : num [1:431231] 1e+07 1e+07 1e+07 1e+07 1e+07 ...
 $ hadm_id             : num [1:431231] 22595853 22841357 25742920 29079034 25022803 ...
 $ admittime           : POSIXct[1:431231], format: "2180-05-06 22:23:00" "2180-06-26 18:27:00" ...
 $ dischtime           : POSIXct[1:431231], format: "2180-05-07 17:15:00" "2180-06-27 18:49:00" ...
 $ deathtime           : POSIXct[1:431231], format: NA NA ...
 $ admission_type      : chr [1:431231] "URGENT" "EW EMER." "EW EMER." "EW EMER." ...
 $ admit_provider_id   : chr [1:431231] "P874LG" "P09Q6Y" "P60CC5" "P30KEH" ...
 $ admission_location  : chr [1:431231] "TRANSFER FROM HOSPITAL" "EMERGENCY ROOM" "EMERGENCY ROOM" "EMERGENCY ROOM" ...
 $ discharge_location  : chr [1:431231] "HOME" "HOME" "HOSPICE" "HOME" ...
 $ insurance           : chr [1:431231] "Other" "Medicaid" "Medicaid" "Medicaid" ...
 $ language            : chr [1:431231] "ENGLISH" "ENGLISH" "ENGLISH" "ENGLISH" ...
 $ marital_status      : chr [1:431231] "WIDOWED" "WIDOWED" "WIDOWED" "WIDOWED" ...
 $ race                : chr [1:431231] "WHITE" "WHITE" "WHITE" "WHITE" ...
 $ edregtime           : POSIXct[1:431231], format: "2180-05-06 19:17:00" "2180-06-26 15:54:00" ...
 $ edouttime           : POSIXct[1:431231], format: "2180-05-06 23:30:00" "2180-06-26 21:31:00" ...
 $ hospital_expire_flag: num [1:431231] 0 0 0 0 0 0 0 0 0 0 ...
 - attr(*, "spec")=
  .. cols(
  ..   subject_id = col_double(),
  ..   hadm_id = col_double(),
  ..   admittime = col_datetime(format = ""),
  ..   dischtime = col_datetime(format = ""),
  ..   deathtime = col_datetime(format = ""),
  ..   admission_type = col_character(),
  ..   admit_provider_id = col_character(),
  ..   admission_location = col_character(),
  ..   discharge_location = col_character(),
  ..   insurance = col_character(),
  ..   language = col_character(),
  ..   marital_status = col_character(),
  ..   race = col_character(),
  ..   edregtime = col_datetime(format = ""),
  ..   edouttime = col_datetime(format = ""),
  ..   hospital_expire_flag = col_double()
  .. )
 - attr(*, "problems")=<externalptr> 
# data types for fread
str(data_read_csv)
spc_tbl_ [431,231 × 16] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ subject_id          : num [1:431231] 1e+07 1e+07 1e+07 1e+07 1e+07 ...
 $ hadm_id             : num [1:431231] 22595853 22841357 25742920 29079034 25022803 ...
 $ admittime           : POSIXct[1:431231], format: "2180-05-06 22:23:00" "2180-06-26 18:27:00" ...
 $ dischtime           : POSIXct[1:431231], format: "2180-05-07 17:15:00" "2180-06-27 18:49:00" ...
 $ deathtime           : POSIXct[1:431231], format: NA NA ...
 $ admission_type      : chr [1:431231] "URGENT" "EW EMER." "EW EMER." "EW EMER." ...
 $ admit_provider_id   : chr [1:431231] "P874LG" "P09Q6Y" "P60CC5" "P30KEH" ...
 $ admission_location  : chr [1:431231] "TRANSFER FROM HOSPITAL" "EMERGENCY ROOM" "EMERGENCY ROOM" "EMERGENCY ROOM" ...
 $ discharge_location  : chr [1:431231] "HOME" "HOME" "HOSPICE" "HOME" ...
 $ insurance           : chr [1:431231] "Other" "Medicaid" "Medicaid" "Medicaid" ...
 $ language            : chr [1:431231] "ENGLISH" "ENGLISH" "ENGLISH" "ENGLISH" ...
 $ marital_status      : chr [1:431231] "WIDOWED" "WIDOWED" "WIDOWED" "WIDOWED" ...
 $ race                : chr [1:431231] "WHITE" "WHITE" "WHITE" "WHITE" ...
 $ edregtime           : POSIXct[1:431231], format: "2180-05-06 19:17:00" "2180-06-26 15:54:00" ...
 $ edouttime           : POSIXct[1:431231], format: "2180-05-06 23:30:00" "2180-06-26 21:31:00" ...
 $ hospital_expire_flag: num [1:431231] 0 0 0 0 0 0 0 0 0 0 ...
 - attr(*, "spec")=
  .. cols(
  ..   subject_id = col_double(),
  ..   hadm_id = col_double(),
  ..   admittime = col_datetime(format = ""),
  ..   dischtime = col_datetime(format = ""),
  ..   deathtime = col_datetime(format = ""),
  ..   admission_type = col_character(),
  ..   admit_provider_id = col_character(),
  ..   admission_location = col_character(),
  ..   discharge_location = col_character(),
  ..   insurance = col_character(),
  ..   language = col_character(),
  ..   marital_status = col_character(),
  ..   race = col_character(),
  ..   edregtime = col_datetime(format = ""),
  ..   edouttime = col_datetime(format = ""),
  ..   hospital_expire_flag = col_double()
  .. )
 - attr(*, "problems")=<externalptr> 

In terms of the speed and based on the user time, fread is the fastest, followed by read_csv, and read.csv is the slowest (according to the elapsed time).

In terms of the memory usage of the resultant dataframe or tibble, fread uses the least memory, followed by read_csv, and read.csv uses the most memory.

In terms of the default parsed data types, fread and read_csv are similar if “double” and “integer” are categorized as “numeric” data type; However, read.csv is different in some variables’ data type.

Q1.2 User-supplied data types

Re-ingest admissions.csv.gz by indicating appropriate column data types in read_csv. Does the run time change? How much memory does the result tibble use? (Hint: col_types argument in read_csv.)

Answer

I indicated appropriate column data types in read_csv by looking at the original data types in the admissions.csv.gz file.

system.time(data_read_csv <- read_csv(
  str_c(mimic_path,"admissions.csv.gz"), 
  col_types = "nnTTTfcffffffTTn"))
   user  system elapsed 
  1.221   0.077   0.559 
pryr::object_size(data_read_csv)
43.24 MB

Yes. The run time became shorter because the data types of the variables are specified according to the data types in the original data. The memory that the resultant tibble uses is 43.24 MB which is smaller than before.

Q2. Ingest big data files

Let us focus on a bigger file, labevents.csv.gz, which is about 125x bigger than admissions.csv.gz.

ls -l ~/mimic/hosp/labevents.csv.gz
-rw-rw-r--@ 1 jacenai  staff  1939088924 Jan  5  2023 /Users/jacenai/mimic/hosp/labevents.csv.gz

Display the first 10 lines of this file.

zcat < ~/mimic/hosp/labevents.csv.gz | head -10
labevent_id,subject_id,hadm_id,specimen_id,itemid,order_provider_id,charttime,storetime,value,valuenum,valueuom,ref_range_lower,ref_range_upper,flag,priority,comments
1,10000032,,45421181,51237,P28Z0X,2180-03-23 11:51:00,2180-03-23 15:15:00,1.4,1.4,,0.9,1.1,abnormal,ROUTINE,
2,10000032,,45421181,51274,P28Z0X,2180-03-23 11:51:00,2180-03-23 15:15:00,___,15.1,sec,9.4,12.5,abnormal,ROUTINE,VERIFIED.
3,10000032,,52958335,50853,P28Z0X,2180-03-23 11:51:00,2180-03-25 11:06:00,___,15,ng/mL,30,60,abnormal,ROUTINE,NEW ASSAY IN USE ___: DETECTS D2 AND D3 25-OH ACCURATELY.
4,10000032,,52958335,50861,P28Z0X,2180-03-23 11:51:00,2180-03-23 16:40:00,102,102,IU/L,0,40,abnormal,ROUTINE,
5,10000032,,52958335,50862,P28Z0X,2180-03-23 11:51:00,2180-03-23 16:40:00,3.3,3.3,g/dL,3.5,5.2,abnormal,ROUTINE,
6,10000032,,52958335,50863,P28Z0X,2180-03-23 11:51:00,2180-03-23 16:40:00,109,109,IU/L,35,105,abnormal,ROUTINE,
7,10000032,,52958335,50864,P28Z0X,2180-03-23 11:51:00,2180-03-23 16:40:00,___,8,ng/mL,0,8.7,,ROUTINE,MEASURED BY ___.
8,10000032,,52958335,50868,P28Z0X,2180-03-23 11:51:00,2180-03-23 16:40:00,12,12,mEq/L,8,20,,ROUTINE,
9,10000032,,52958335,50878,P28Z0X,2180-03-23 11:51:00,2180-03-23 16:40:00,143,143,IU/L,0,40,abnormal,ROUTINE,

Q2.1 Ingest labevents.csv.gz by read_csv

Try to ingest labevents.csv.gz using read_csv. What happens? If it takes more than 5 minutes on your computer, then abort the program and report your findings.

Answer

I tried to read the file using read_csv function as follows:

read_csv(str_c("~/mimic/hosp/","labevents.csv.gz"))

I spent more than 5 minutes to read the file, but it was not finished. So, I aborted the program. The reason is that the size of the file is quite large (1.94GB), which means there is a significant amount of data to read and process. And since the file is compressed (labevents.csv.gz), it needs to be decompressed before it can be read. Decompressing large files can take time. Also, reading large files can require a significant amount of memory, and my system has limited memory.

Q2.2 Ingest selected columns of labevents.csv.gz by read_csv

Try to ingest only columns subject_id, itemid, charttime, and valuenum in labevents.csv.gz using read_csv. Does this solve the ingestion issue? (Hint: col_select argument in read_csv.)

Answer: I tried to read the file using read_csv function with col_select argument as follows:

read_csv("~/mimic/hosp/labevents.csv.gz", 
         col_select = c("subject_id", 
                        "itemid", 
                        "charttime", 
                        "valuenum"))
# A tibble: 118,171,367 × 4
   subject_id itemid charttime           valuenum
        <dbl>  <dbl> <dttm>                 <dbl>
 1   10000032  51237 2180-03-23 11:51:00      1.4
 2   10000032  51274 2180-03-23 11:51:00     15.1
 3   10000032  50853 2180-03-23 11:51:00     15  
 4   10000032  50861 2180-03-23 11:51:00    102  
 5   10000032  50862 2180-03-23 11:51:00      3.3
 6   10000032  50863 2180-03-23 11:51:00    109  
 7   10000032  50864 2180-03-23 11:51:00      8  
 8   10000032  50868 2180-03-23 11:51:00     12  
 9   10000032  50878 2180-03-23 11:51:00    143  
10   10000032  50882 2180-03-23 11:51:00     27  
# ℹ 118,171,357 more rows

It doesn’t solve the ingestion issue. The file was read more than 5 mins. Also, the file is still quite large, and the memory usage is high.

Q2.3 Ingest subset of labevents.csv.gz

Our first strategy to handle this big data file is to make a subset of the labevents data. Read the MIMIC documentation for the content in data file labevents.csv.

In later exercises, we will only be interested in the following lab items: creatinine (50912), potassium (50971), sodium (50983), chloride (50902), bicarbonate (50882), hematocrit (51221), white blood cell count (51301), and glucose (50931) and the following columns: subject_id, itemid, charttime, valuenum. Write a Bash command to extract these columns and rows from labevents.csv.gz and save the result to a new file labevents_filtered.csv.gz in the current working directory. (Hint: use zcat < to pipe the output of labevents.csv.gz to awk and then to gzip to compress the output. To save render time, put #| eval: false at the beginning of this code chunk.)

Display the first 10 lines of the new file labevents_filtered.csv.gz. How many lines are in this new file? How long does it take read_csv to ingest labevents_filtered.csv.gz?

Answer

First, displaying the first 10 lines of the labevents.csv.gz file to know the structure of the file and the sequence of the columns.

zcat < ~/mimic/hosp/labevents.csv.gz | head
labevent_id,subject_id,hadm_id,specimen_id,itemid,order_provider_id,charttime,storetime,value,valuenum,valueuom,ref_range_lower,ref_range_upper,flag,priority,comments
1,10000032,,45421181,51237,P28Z0X,2180-03-23 11:51:00,2180-03-23 15:15:00,1.4,1.4,,0.9,1.1,abnormal,ROUTINE,
2,10000032,,45421181,51274,P28Z0X,2180-03-23 11:51:00,2180-03-23 15:15:00,___,15.1,sec,9.4,12.5,abnormal,ROUTINE,VERIFIED.
3,10000032,,52958335,50853,P28Z0X,2180-03-23 11:51:00,2180-03-25 11:06:00,___,15,ng/mL,30,60,abnormal,ROUTINE,NEW ASSAY IN USE ___: DETECTS D2 AND D3 25-OH ACCURATELY.
4,10000032,,52958335,50861,P28Z0X,2180-03-23 11:51:00,2180-03-23 16:40:00,102,102,IU/L,0,40,abnormal,ROUTINE,
5,10000032,,52958335,50862,P28Z0X,2180-03-23 11:51:00,2180-03-23 16:40:00,3.3,3.3,g/dL,3.5,5.2,abnormal,ROUTINE,
6,10000032,,52958335,50863,P28Z0X,2180-03-23 11:51:00,2180-03-23 16:40:00,109,109,IU/L,35,105,abnormal,ROUTINE,
7,10000032,,52958335,50864,P28Z0X,2180-03-23 11:51:00,2180-03-23 16:40:00,___,8,ng/mL,0,8.7,,ROUTINE,MEASURED BY ___.
8,10000032,,52958335,50868,P28Z0X,2180-03-23 11:51:00,2180-03-23 16:40:00,12,12,mEq/L,8,20,,ROUTINE,
9,10000032,,52958335,50878,P28Z0X,2180-03-23 11:51:00,2180-03-23 16:40:00,143,143,IU/L,0,40,abnormal,ROUTINE,

Then, extract specific columns and rows from the labevents.csv.gz file and save the result to a new file labevents_filtered.csv.gz

zcat < ~/mimic/hosp/labevents.csv.gz | \
awk -F ',' '{
    if ($5 ~ /^(50912|50971|50983|50902|50882|51221|51301|50931)$/) \
        print $2,$5,$7,$10
    }' | \
gzip > ~/mimic/hosp/labevents_filtered.csv.gz

Display the first 10 lines of the new file labevents_filtered.csv.gz

zcat < ~/mimic/hosp/labevents_filtered.csv.gz | head -n 10
10000032 50882 2180-03-23 11:51:00 27
10000032 50902 2180-03-23 11:51:00 101
10000032 50912 2180-03-23 11:51:00 0.4
10000032 50971 2180-03-23 11:51:00 3.7
10000032 50983 2180-03-23 11:51:00 136
10000032 50931 2180-03-23 11:51:00 95
10000032 51221 2180-03-23 11:51:00 45.4
10000032 51301 2180-03-23 11:51:00 3
10000032 51221 2180-05-06 22:25:00 42.6
10000032 51301 2180-05-06 22:25:00 5

Calculate the number of lines in the new file labevents_filtered.csv.gz. There are 24855909 lines in this new file.

zcat < ~/mimic/hosp/labevents_filtered.csv.gz | wc -l
zcat: (stdin): unexpected end of file
 10568729

Finally, to measure how long it takes to ingest labevents_filtered.csv.gz. The following code chunk shows the time it takes to read the file.

system.time({
  data <- read_csv("~/mimic/hosp/labevents_filtered.csv.gz")
})
   user  system elapsed 
 19.625   0.710  19.676 

Q2.4 Ingest labevents.csv by Apache Arrow

Our second strategy is to use Apache Arrow for larger-than-memory data analytics. Unfortunately Arrow does not work with gz files directly. First decompress labevents.csv.gz to labevents.csv and put it in the current working directory. To save render time, put #| eval: false at the beginning of this code chunk.

Then use arrow::open_dataset to ingest labevents.csv, select columns, and filter itemid as in Q2.3. How long does the ingest+select+filter process take? Display the number of rows and the first 10 rows of the result tibble, and make sure they match those in Q2.3. (Hint: use dplyr verbs for selecting columns and filtering rows.)

Write a few sentences to explain what is Apache Arrow. Imagine you want to explain it to a layman in an elevator.

Answer

First, decompressing the labevents.csv.gz file

system("gzip -d -k ~/mimic/hosp/labevents.csv.gz")

Then, using arrow::open_dataset to ingest labevents.csv, select columns, and filter itemid as in Q2.3.

file_path <- "~/mimic/hosp/labevents.csv"

#ingest+select+filter process and measure the time it takes
system.time(labevents <- arrow::open_dataset(file_path, 
                                             format = "csv") %>% 
  dplyr::select(subject_id, itemid, charttime, valuenum) %>% 
  dplyr::filter(itemid %in% c(50912, 50971, 50983, 50902, 
                              50882, 51221, 51301, 50931)) %>% 
  collect())
   user  system elapsed 
 52.324   3.724  49.725 

And then, to display the number of rows in the result tibble.

#display the number of rows in the result tibble
nrow(labevents)
[1] 24855909

The time it takes to ingest+select+filter process is shown in the output above. The number of rows in the result tibble is 24855909.

Then, to display the first 10 rows of the result tibble.

The content in subject_id, itemid, and valuenum match those in Q2.3. However, the charttime column is in a different format, which is because the code will automatically recognize the input time as a UTC time and convert it to the local time zone.

head(labevents, 10)
# A tibble: 10 × 4
   subject_id itemid charttime           valuenum
        <int>  <int> <dttm>                 <dbl>
 1   10000032  50882 2180-03-23 04:51:00     27  
 2   10000032  50902 2180-03-23 04:51:00    101  
 3   10000032  50912 2180-03-23 04:51:00      0.4
 4   10000032  50971 2180-03-23 04:51:00      3.7
 5   10000032  50983 2180-03-23 04:51:00    136  
 6   10000032  50931 2180-03-23 04:51:00     95  
 7   10000032  51221 2180-03-23 04:51:00     45.4
 8   10000032  51301 2180-03-23 04:51:00      3  
 9   10000032  51221 2180-05-06 15:25:00     42.6
10   10000032  51301 2180-05-06 15:25:00      5  

Explanation of Apache Arrow: Apache Arrow is like a universal language for data that allows different software programs to communicate and share data more efficiently. It’s a technology that makes it easier for different systems to understand and work with large amounts of data, regardless of the programming language they are written in. So, it’s like a common ground where data can be exchanged and understood by various applications, making data analysis and processing faster and more seamless, just like how people from different countries can communicate more effectively using a common language.

Q2.5 Compress labevents.csv to Parquet format and ingest/select/filter

Re-write the csv file labevents.csv in the binary Parquet format (Hint: arrow::write_dataset.) How large is the Parquet file(s)? How long does the ingest+select+filter process of the Parquet file(s) take? Display the number of rows and the first 10 rows of the result tibble and make sure they match those in Q2.3. (Hint: use dplyr verbs for selecting columns and filtering rows.)

Write a few sentences to explain what is the Parquet format. Imagine you want to explain it to a layman in an elevator.

Answer

First, re-writing the csv file labevents.csv in the binary Parquet format.

# Define the file paths
parquet_file_path <- "~/mimic/hosp/labevents.parquet"

# Rewrite the CSV file in Parquet format    
arrow::write_dataset(arrow::open_dataset(file_path, format = "csv"), 
                     parquet_file_path, format = "parquet")

Measuring the size of the Parquet file. And the size of the Parquet file is 1.9GB.

#measure the size of the Parquet file
du -h ~/mimic/hosp/labevents.parquet
1.9G    /Users/jacenai/mimic/hosp/labevents.parquet

Then, measuring the time taken for ingest+select+filter process of the Parquet file.

system.time({
  parquet_dataset <- 
    arrow::open_dataset(parquet_file_path, format = "parquet") %>%
    dplyr::select(subject_id, itemid, charttime, valuenum) %>%
    dplyr::filter(itemid %in% c(50912, 50971, 50983, 50902, 
                                50882, 51221, 51301, 50931)) %>%
    collect()
})
   user  system elapsed 
 12.341   3.460   4.102 

And measuring the number of rows in the result tibble.

nrow(parquet_dataset)
[1] 24855909

The time it takes to ingest+select+filter process of the Parquet file is shown in the output above. And the number of rows in the result tibble is 24855909.

Then, to display the first 10 rows of the result tibble.

The content in subject_id, itemid, and valuenum match those in Q2.3. However, the charttime column is in a different format, which is because the code will automatically recognize the input time as a UTC time and convert it to the local time zone.

#display the first 10 rows of the result tibble
head(parquet_dataset, 10)
# A tibble: 10 × 4
   subject_id itemid charttime           valuenum
        <int>  <int> <dttm>                 <dbl>
 1   10000032  50882 2180-03-23 04:51:00     27  
 2   10000032  50902 2180-03-23 04:51:00    101  
 3   10000032  50912 2180-03-23 04:51:00      0.4
 4   10000032  50971 2180-03-23 04:51:00      3.7
 5   10000032  50983 2180-03-23 04:51:00    136  
 6   10000032  50931 2180-03-23 04:51:00     95  
 7   10000032  51221 2180-03-23 04:51:00     45.4
 8   10000032  51301 2180-03-23 04:51:00      3  
 9   10000032  51221 2180-05-06 15:25:00     42.6
10   10000032  51301 2180-05-06 15:25:00      5  

Explanation of Parquet format: Parquet is a file format for storing data that is designed to be efficient and optimized for use with big data processing frameworks. It’s like a special type of container that organizes data in a highly compressed and columnar format, making it easier for computers to read and process large amounts of data quickly. Think of it as a way to pack and organize data in a smart and efficient way, similar to how a well-organized filing cabinet makes it easy to find and access documents quickly.

Q2.6 DuckDB

Ingest the Parquet file, convert it to a DuckDB table by arrow::to_duckdb, select columns, and filter rows as in Q2.5. How long does the ingest+convert+select+filter process take? Display the number of rows and the first 10 rows of the result tibble and make sure they match those in Q2.3. (Hint: use dplyr verbs for selecting columns and filtering rows.)

Write a few sentences to explain what is DuckDB. Imagine you want to explain it to a layman in an elevator.

Answer

First, ingesting the Parquet file, converting it to a DuckDB table. And measuring the time taken for the ingest+convert+select+filter process.

parquet_file_path_1 <- "~/mimic/hosp/labevents.parquet/part-0.parquet"
# measure the time taken for the ingest+convert+select+filter process
system.time({
  parquet_dataset <- arrow::read_parquet(parquet_file_path_1, 
                                         format = "parquet")
  duckdb_table <- arrow::to_duckdb(parquet_dataset)
  result <- duckdb_table %>%
    dplyr::filter(itemid %in% c(50912, 50971, 50983, 50902, 
                                50882, 51221, 51301, 50931)) %>%
    dplyr::select(subject_id, itemid, charttime, valuenum) %>%
    collect()
})
   user  system elapsed 
 58.387  24.040  64.523 

Then, measuring the number of rows in the result tibble.

nrow(result)
[1] 24855909

The time it takes to ingest+convert+select+filter process is shown in the output above. And the number of rows in the result tibble is 24855909.

Then, to display the first 10 rows of the result tibble.

The content in subject_id, itemid, and valuenum match those in Q2.3. However, the charttime column is in a different format, which is because the code will automatically recognize the input time as a UTC time and convert it to the local time zone.

# Display the first 10 rows
head(result, 10)
# A tibble: 10 × 4
   subject_id itemid charttime           valuenum
        <int>  <int> <dttm>                 <dbl>
 1   10011427  51221 2136-01-22 07:15:00     23.5
 2   10011427  51301 2136-01-22 07:15:00      3.5
 3   10011427  50882 2136-01-22 07:15:00     17  
 4   10011427  50902 2136-01-22 07:15:00    105  
 5   10011427  50912 2136-01-22 07:15:00      3.6
 6   10011427  50931 2136-01-22 07:15:00    104  
 7   10011427  50971 2136-01-22 07:15:00      4.8
 8   10011427  50983 2136-01-22 07:15:00    139  
 9   10011427  50882 2136-01-23 06:17:00     19  
10   10011427  50902 2136-01-23 06:17:00    103  

Explanation of DuckDB: DuckDB is like a digital filing cabinet for data. It’s a special kind of software that helps organize and store large amounts of data in a way that makes it easy to find and work with. Think of it as a super-efficient librarian that keeps all your data neatly organized and quickly accessible. It’s especially useful for tasks like analyzing data, running calculations, and generating reports, making it a valuable tool for businesses and researchers who deal with lots of data.

Q3. Ingest and filter chartevents.csv.gz

chartevents.csv.gz contains all the charted data available for a patient. During their ICU stay, the primary repository of a patient’s information is their electronic chart. The itemid variable indicates a single measurement type in the database. The value variable is the value measured for itemid. The first 10 lines of chartevents.csv.gz are

zcat < ~/mimic/icu/chartevents.csv.gz | head -10
subject_id,hadm_id,stay_id,caregiver_id,charttime,storetime,itemid,value,valuenum,valueuom,warning
10000032,29079034,39553978,47007,2180-07-23 21:01:00,2180-07-23 22:15:00,220179,82,82,mmHg,0
10000032,29079034,39553978,47007,2180-07-23 21:01:00,2180-07-23 22:15:00,220180,59,59,mmHg,0
10000032,29079034,39553978,47007,2180-07-23 21:01:00,2180-07-23 22:15:00,220181,63,63,mmHg,0
10000032,29079034,39553978,47007,2180-07-23 22:00:00,2180-07-23 22:15:00,220045,94,94,bpm,0
10000032,29079034,39553978,47007,2180-07-23 22:00:00,2180-07-23 22:15:00,220179,85,85,mmHg,0
10000032,29079034,39553978,47007,2180-07-23 22:00:00,2180-07-23 22:15:00,220180,55,55,mmHg,0
10000032,29079034,39553978,47007,2180-07-23 22:00:00,2180-07-23 22:15:00,220181,62,62,mmHg,0
10000032,29079034,39553978,47007,2180-07-23 22:00:00,2180-07-23 22:15:00,220210,20,20,insp/min,0
10000032,29079034,39553978,47007,2180-07-23 22:00:00,2180-07-23 22:15:00,220277,95,95,%,0

d_items.csv.gz is the dictionary for the itemid in chartevents.csv.gz.

zcat < ~/mimic/icu/d_items.csv.gz | head -10
itemid,label,abbreviation,linksto,category,unitname,param_type,lownormalvalue,highnormalvalue
220001,Problem List,Problem List,chartevents,General,,Text,,
220003,ICU Admission date,ICU Admission date,datetimeevents,ADT,,Date and time,,
220045,Heart Rate,HR,chartevents,Routine Vital Signs,bpm,Numeric,,
220046,Heart rate Alarm - High,HR Alarm - High,chartevents,Alarms,bpm,Numeric,,
220047,Heart Rate Alarm - Low,HR Alarm - Low,chartevents,Alarms,bpm,Numeric,,
220048,Heart Rhythm,Heart Rhythm,chartevents,Routine Vital Signs,,Text,,
220050,Arterial Blood Pressure systolic,ABPs,chartevents,Routine Vital Signs,mmHg,Numeric,90,140
220051,Arterial Blood Pressure diastolic,ABPd,chartevents,Routine Vital Signs,mmHg,Numeric,60,90
220052,Arterial Blood Pressure mean,ABPm,chartevents,Routine Vital Signs,mmHg,Numeric,,

In later exercises, we are interested in the vitals for ICU patients: heart rate (220045), mean non-invasive blood pressure (220181), systolic non-invasive blood pressure (220179), body temperature in Fahrenheit (223761), and respiratory rate (220210). Retrieve a subset of chartevents.csv.gz only containing these items, using the favorite method you learnt in Q2.

Document the steps and show code. Display the number of rows and the first 10 rows of the result tibble.

Answer

First, unzipping the file chartevents.csv.gz

# Unzip the file
system("gzip -d -k ~/mimic/icu/chartevents.csv.gz")

Then, I want to use the arrow::open_dataset method to ingest the CSV file, select columns, and filter rows based on itemid, because the speed of arrow::open_dataset is fast and it can handle large datasets efficiently.

file_path_3 <- "~/mimic/icu/chartevents.csv"

chartevents <- arrow::open_dataset(file_path_3, format = "csv") %>%
  dplyr::filter(itemid %in% c(220045, 220181, 220179, 
                              223761, 220210)) %>%
  dplyr::select(subject_id, itemid, charttime, value) %>%
  collect()

Finally, displaying the number of rows and the first 10 rows of the result tibble. There are 22502319 rows in the result tibble.

nrow(chartevents)
[1] 22502319

Display the first 10 rows of the result.

head(chartevents, 10)
# A tibble: 10 × 4
   subject_id itemid charttime           value
        <int>  <int> <dttm>              <chr>
 1   10000032 220179 2180-07-23 14:01:00 82   
 2   10000032 220181 2180-07-23 14:01:00 63   
 3   10000032 220045 2180-07-23 15:00:00 94   
 4   10000032 220179 2180-07-23 15:00:00 85   
 5   10000032 220181 2180-07-23 15:00:00 62   
 6   10000032 220210 2180-07-23 15:00:00 20   
 7   10000032 220045 2180-07-23 12:00:00 97   
 8   10000032 220179 2180-07-23 12:00:00 93   
 9   10000032 220181 2180-07-23 12:00:00 56   
10   10000032 220210 2180-07-23 12:00:00 16