Linux Basic, Unix Shell Commands, and HPC

Author

Jiachen Ai

Display machine information for reproducibility:

sessionInfo()
R version 4.3.1 (2023-06-16)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Sonoma 14.6.1

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: America/Los_Angeles
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
 [1] htmlwidgets_1.6.4 compiler_4.3.1    fastmap_1.2.0     cli_3.6.4        
 [5] tools_4.3.1       htmltools_0.5.8.1 rstudioapi_0.16.0 yaml_2.3.8       
 [9] rmarkdown_2.27    knitr_1.47        jsonlite_1.8.8    xfun_0.44        
[13] digest_0.6.35     rlang_1.1.5       evaluate_0.23    

Data ethics training

This exercise (and later in this course) uses the MIMIC-IV data v2.2, a freely accessible critical care database developed by the MIT Lab for Computational Physiology. Follow the instructions at https://mimic.mit.edu/docs/gettingstarted/ to (1) complete the CITI Data or Specimens Only Research course and (2) obtain the PhysioNet credential for using the MIMIC-IV data. Display the verification links to your completion report and completion certificate here.

Answer:

I completed the CITI training on 1/18/2021. The completion report is available at here. The completion certificate is available at here.

I also obtained the PhysioNet credential for using the MIMIC-IV data. Here is the screenshot of my PhysioNet credentialing.screenshot of Credentialing

Linux Shell Commands

  1. Make the MIMIC v2.2 data available at location ~/mimic.
ls -l ~/mimic/

Refer to the documentation https://physionet.org/content/mimiciv/2.2/ for details of data files. Please, do not put these data files into Git; they are big. Do not copy them into your directory. Do not decompress the gz data files. These create unnecessary big files and are not big-data-friendly practices. Read from the data folder ~/mimic directly in following exercises.

Use Bash commands to answer following questions.

Answer:

I downloaded the MIMIC-IV v2.2 data to my local machine. I create a symbolic link to the original directory to MIMIC-IV v2.2 data under the path ~/mimic in my command panel.

ln -s /Users/jacenai/Documents/23W/BIOSTAT_203B/mimic-iv-2.2/  ~/mimic

Now, the data files are available at ~/mimic. The data files are not put into Git. The data files are not copied into my directory. The gz data files are not decompressed.

ls -l ~/mimic/
total 48
-rw-rw-r--@  1 jacenai  staff  13332 Jan  5  2023 CHANGELOG.txt
-rw-rw-r--@  1 jacenai  staff   2518 Jan  5  2023 LICENSE.txt
-rw-rw-r--@  1 jacenai  staff   2884 Jan  6  2023 SHA256SUMS.txt
drwxr-xr-x@ 27 jacenai  staff    864 Mar  7 14:30 hosp
drwxr-xr-x@ 13 jacenai  staff    416 Mar  9  2024 icu
lrwxr-xr-x@  1 jacenai  staff     49 Jan 24  2024 mimic-iv-2.2 -> /Users/jacenai/Desktop/BIOSTAT_203B/mimic-iv-2.2/
  1. Display the contents in the folders hosp and icu using Bash command ls -l. Why are these data files distributed as .csv.gz files instead of .csv (comma separated values) files? Read the page https://mimic.mit.edu/docs/iv/ to understand what’s in each folder.

Answer: This is the content of the folder hosp:

ls -l ~/mimic/hosp
total 8970344
-rw-rw-r--@ 1 jacenai  staff    15516088 Jan  5  2023 admissions.csv.gz
-rw-rw-r--@ 1 jacenai  staff      427468 Jan  5  2023 d_hcpcs.csv.gz
-rw-rw-r--@ 1 jacenai  staff      859438 Jan  5  2023 d_icd_diagnoses.csv.gz
-rw-rw-r--@ 1 jacenai  staff      578517 Jan  5  2023 d_icd_procedures.csv.gz
-rw-rw-r--@ 1 jacenai  staff       12900 Jan  5  2023 d_labitems.csv.gz
-rw-rw-r--@ 1 jacenai  staff    25070720 Jan  5  2023 diagnoses_icd.csv.gz
-rw-rw-r--@ 1 jacenai  staff     7426955 Jan  5  2023 drgcodes.csv.gz
-rw-rw-r--@ 1 jacenai  staff   508524623 Jan  5  2023 emar.csv.gz
-rw-rw-r--@ 1 jacenai  staff   471096030 Jan  5  2023 emar_detail.csv.gz
-rw-rw-r--@ 1 jacenai  staff     1767138 Jan  5  2023 hcpcsevents.csv.gz
-rw-rw-r--@ 1 jacenai  staff  1939088924 Jan  5  2023 labevents.csv.gz
drwxr-xr-x@ 3 jacenai  staff          96 Feb 28  2024 labevents.parquet
-rw-r--r--@ 1 jacenai  staff    56623104 Feb  9  2024 labevents_filtered.csv.gz
-rw-rw-r--@ 1 jacenai  staff    96698496 Jan  5  2023 microbiologyevents.csv.gz
-rw-rw-r--@ 1 jacenai  staff    36124944 Jan  5  2023 omr.csv.gz
-rw-rw-r--@ 1 jacenai  staff     2312631 Jan  5  2023 patients.csv.gz
-rw-rw-r--@ 1 jacenai  staff   398753125 Jan  5  2023 pharmacy.csv.gz
-rw-rw-r--@ 1 jacenai  staff   498505135 Jan  5  2023 poe.csv.gz
-rw-rw-r--@ 1 jacenai  staff    25477219 Jan  5  2023 poe_detail.csv.gz
-rw-rw-r--@ 1 jacenai  staff   458817415 Jan  5  2023 prescriptions.csv.gz
-rw-rw-r--@ 1 jacenai  staff     6027067 Jan  5  2023 procedures_icd.csv.gz
-rw-rw-r--@ 1 jacenai  staff      122507 Jan  5  2023 provider.csv.gz
-rw-rw-r--@ 1 jacenai  staff     6781247 Jan  5  2023 services.csv.gz
-rw-rw-r--@ 1 jacenai  staff    36158338 Jan  5  2023 transfers.csv.gz

This is the content of the folder icu:

ls -l ~/mimic/icu
total 6155968
-rw-rw-r--@ 1 jacenai  staff       35893 Jan  5  2023 caregiver.csv.gz
-rw-rw-r--@ 1 jacenai  staff  2467761053 Jan  5  2023 chartevents.csv.gz
drwxr-xr-x@ 4 jacenai  staff         128 Feb 23  2024 chartevents.parquet
-rw-rw-r--@ 1 jacenai  staff       57476 Jan  5  2023 d_items.csv.gz
-rw-rw-r--@ 1 jacenai  staff    45721062 Jan  5  2023 datetimeevents.csv.gz
-rw-rw-r--@ 1 jacenai  staff     2614571 Jan  5  2023 icustays.csv.gz
-rw-rw-r--@ 1 jacenai  staff   251962313 Jan  5  2023 ingredientevents.csv.gz
-rw-rw-r--@ 1 jacenai  staff   324218488 Jan  5  2023 inputevents.csv.gz
-rw-rw-r--@ 1 jacenai  staff    38747895 Jan  5  2023 outputevents.csv.gz
-rw-rw-r--@ 1 jacenai  staff    20717852 Jan  5  2023 procedureevents.csv.gz

The reason for why these data files are distributed as .csv.gz files instead of .csv files is: the .csv.gz indicates that the file is compressed using gzip compression. It used compressed files rather than the .csv files is because the data set contains comprehensive information for each patient while they were in the hospital, so it would need huge storage space if not compressed. Compression reduces the file size, making it quicker to download and transfer and requiring less bandwidth when downloading, which can be important for users with limited internet bandwidth.

  1. Briefly describe what Bash commands zcat, zless, zmore, and zgrep do.

Answer:

  • zcat is used to display the content of one or more compressed files on the standard output.

  • zless is used to view the contents of a compressed file one page at a time. It provides a convenient way to scroll through the contents of a compressed file.

  • zmore is similar to zless and serves as a pager for compressed files and is used to view the contents of a compressed file one page at a time. It primarily supports forward navigation through the content.

  • zgrep is used to search for a pattern within one or more compressed files.

  1. (Looping in Bash) What’s the output of the following bash script?
for datafile in ~/mimic/hosp/{a,l,pa}*.gz
do
  ls -l $datafile
done

Display the number of lines in each data file using a similar loop. (Hint: combine linux commands zcat < and wc -l.)

Answer:

Display the number of lines in each data file using a similar loop:

for datafile in ~/mimic/hosp/{a,l,pa}*.gz
do
  echo "Number of lines in $datafile: $(zcat < $datafile | wc -l)"
done
Number of lines in /Users/jacenai/mimic/hosp/admissions.csv.gz:   431232
Number of lines in /Users/jacenai/mimic/hosp/labevents.csv.gz:  118171368
zcat: (stdin): unexpected end of file
Number of lines in /Users/jacenai/mimic/hosp/labevents_filtered.csv.gz:  10568729
Number of lines in /Users/jacenai/mimic/hosp/patients.csv.gz:   299713
  1. Display the first few lines of admissions.csv.gz. How many rows are in this data file? How many unique patients (identified by subject_id) are in this data file? Do they match the number of patients listed in the patients.csv.gz file? (Hint: combine Linux commands zcat <, head/tail, awk, sort, uniq, wc, and so on.)

Answer:

Display the first few lines of admissions.csv.gz

zcat < ~/mimic/hosp/admissions.csv.gz | head -10
subject_id,hadm_id,admittime,dischtime,deathtime,admission_type,admit_provider_id,admission_location,discharge_location,insurance,language,marital_status,race,edregtime,edouttime,hospital_expire_flag
10000032,22595853,2180-05-06 22:23:00,2180-05-07 17:15:00,,URGENT,P874LG,TRANSFER FROM HOSPITAL,HOME,Other,ENGLISH,WIDOWED,WHITE,2180-05-06 19:17:00,2180-05-06 23:30:00,0
10000032,22841357,2180-06-26 18:27:00,2180-06-27 18:49:00,,EW EMER.,P09Q6Y,EMERGENCY ROOM,HOME,Medicaid,ENGLISH,WIDOWED,WHITE,2180-06-26 15:54:00,2180-06-26 21:31:00,0
10000032,25742920,2180-08-05 23:44:00,2180-08-07 17:50:00,,EW EMER.,P60CC5,EMERGENCY ROOM,HOSPICE,Medicaid,ENGLISH,WIDOWED,WHITE,2180-08-05 20:58:00,2180-08-06 01:44:00,0
10000032,29079034,2180-07-23 12:35:00,2180-07-25 17:55:00,,EW EMER.,P30KEH,EMERGENCY ROOM,HOME,Medicaid,ENGLISH,WIDOWED,WHITE,2180-07-23 05:54:00,2180-07-23 14:00:00,0
10000068,25022803,2160-03-03 23:16:00,2160-03-04 06:26:00,,EU OBSERVATION,P51VDL,EMERGENCY ROOM,,Other,ENGLISH,SINGLE,WHITE,2160-03-03 21:55:00,2160-03-04 06:26:00,0
10000084,23052089,2160-11-21 01:56:00,2160-11-25 14:52:00,,EW EMER.,P6957U,WALK-IN/SELF REFERRAL,HOME HEALTH CARE,Medicare,ENGLISH,MARRIED,WHITE,2160-11-20 20:36:00,2160-11-21 03:20:00,0
10000084,29888819,2160-12-28 05:11:00,2160-12-28 16:07:00,,EU OBSERVATION,P63AD6,PHYSICIAN REFERRAL,,Medicare,ENGLISH,MARRIED,WHITE,2160-12-27 18:32:00,2160-12-28 16:07:00,0
10000108,27250926,2163-09-27 23:17:00,2163-09-28 09:04:00,,EU OBSERVATION,P38XXV,EMERGENCY ROOM,,Other,ENGLISH,SINGLE,WHITE,2163-09-27 16:18:00,2163-09-28 09:04:00,0
10000117,22927623,2181-11-15 02:05:00,2181-11-15 14:52:00,,EU OBSERVATION,P2358X,EMERGENCY ROOM,,Other,ENGLISH,DIVORCED,WHITE,2181-11-14 21:51:00,2181-11-15 09:57:00,0
#Count the number of rows in admissions.csv.gz
row_count=$(zcat < ~/mimic/hosp/admissions.csv.gz | wc -l)
echo "Number of rows in admissions.csv.gz: $row_count"
Number of rows in admissions.csv.gz:   431232
#Count the number of unique patients (identified by subject_id) in admissions.csv.gz:
subject_count=$(zcat < ~/mimic/hosp/admissions.csv.gz | tail -n +2 | awk -F ',' '{print $1}' | sort -u | wc -l)
echo "Number of unique patients in admissions.csv.gz (excluding header): $subject_count"
Number of unique patients in admissions.csv.gz (excluding header):   180733
#Count the number of unique patients (identified by subject_id) in patients.csv.gz:
subject_count=$(zcat < ~/mimic/hosp/patients.csv.gz | tail -n +2 | awk -F ',' '{print $1}' | sort -u | wc -l)
echo "Number of unique patients in patients.csv.gz (excluding header): $subject_count"
Number of unique patients in patients.csv.gz (excluding header):   299712

Therefore, the number of patients listed in the patients.csv.gz file doesn’t match the number of unique patients in the admissions.csv.gz file.

  1. What are the possible values taken by each of the variable admission_type, admission_location, insurance, and ethnicity? Also report the count for each unique value of these variables. (Hint: combine Linux commands zcat, head/tail, awk, uniq -c, wc, and so on; skip the header line.)

Answer:

First, getting to know the variables’ names.

zcat < ~/mimic/hosp/admissions.csv.gz | head -n 1
subject_id,hadm_id,admittime,dischtime,deathtime,admission_type,admit_provider_id,admission_location,discharge_location,insurance,language,marital_status,race,edregtime,edouttime,hospital_expire_flag

Get possible values taken by the variable admission_type and its count

zcat < ~/mimic/hosp/admissions.csv.gz | awk -F',' 'NR > 1 {print $6}' | sort | uniq -c
6626 AMBULATORY OBSERVATION
19554 DIRECT EMER.
18707 DIRECT OBSERVATION
10565 ELECTIVE
94776 EU OBSERVATION
149413 EW EMER.
52668 OBSERVATION ADMIT
34231 SURGICAL SAME DAY ADMISSION
44691 URGENT

Get possible values taken by the variable admission_location and its count

zcat < ~/mimic/hosp/admissions.csv.gz | awk -F',' 'NR > 1 {print $8}' | sort | uniq -c
 185 AMBULATORY SURGERY TRANSFER
10008 CLINIC REFERRAL
232595 EMERGENCY ROOM
 359 INFORMATION NOT AVAILABLE
4205 INTERNAL TRANSFER TO OR FROM PSYCH
5479 PACU
114963 PHYSICIAN REFERRAL
7804 PROCEDURE SITE
35974 TRANSFER FROM HOSPITAL
3843 TRANSFER FROM SKILLED NURSING FACILITY
15816 WALK-IN/SELF REFERRAL

Get possible values taken by the variable insurance and its count

zcat < ~/mimic/hosp/admissions.csv.gz | awk -F',' 'NR > 1 {print $10}' | sort | uniq -c
41330 Medicaid
160560 Medicare
229341 Other

Get possible values taken by the variable race/ethnicity and its count

zcat < ~/mimic/hosp/admissions.csv.gz | awk -F',' 'NR > 1 {print $13}' | sort | uniq -c
 919 AMERICAN INDIAN/ALASKA NATIVE
6156 ASIAN
1198 ASIAN - ASIAN INDIAN
5587 ASIAN - CHINESE
 506 ASIAN - KOREAN
1446 ASIAN - SOUTH EAST ASIAN
2530 BLACK/AFRICAN
59959 BLACK/AFRICAN AMERICAN
4765 BLACK/CAPE VERDEAN
2704 BLACK/CARIBBEAN ISLAND
7754 HISPANIC OR LATINO
 437 HISPANIC/LATINO - CENTRAL AMERICAN
 639 HISPANIC/LATINO - COLUMBIAN
 500 HISPANIC/LATINO - CUBAN
4383 HISPANIC/LATINO - DOMINICAN
1330 HISPANIC/LATINO - GUATEMALAN
 536 HISPANIC/LATINO - HONDURAN
 665 HISPANIC/LATINO - MEXICAN
8076 HISPANIC/LATINO - PUERTO RICAN
 892 HISPANIC/LATINO - SALVADORAN
 560 MULTIPLE RACE/ETHNICITY
 386 NATIVE HAWAIIAN OR OTHER PACIFIC ISLANDER
15102 OTHER
1761 PATIENT DECLINED TO ANSWER
1510 PORTUGUESE
 505 SOUTH AMERICAN
1603 UNABLE TO OBTAIN
10668 UNKNOWN
272932 WHITE
1103 WHITE - BRAZILIAN
1170 WHITE - EASTERN EUROPEAN
7925 WHITE - OTHER EUROPEAN
5024 WHITE - RUSSIAN
  1. To compress, or not to compress. That’s the question. Let’s focus on the big data file labevents.csv.gz. Compare compressed gz file size to the uncompressed file size. Compare the run times of zcat < ~/mimic/labevents.csv.gz | wc -l versus wc -l labevents.csv. Discuss the trade off between storage and speed for big data files. (Hint: gzip -dk < FILENAME.gz > ./FILENAME. Remember to delete the large labevents.csv file after the exercise.)

Answer:

First, comparing the file sizes: The compressed file size is 1.8G and the uncompressed file size is 13G. The uncompressed file size is more than 7 times larger than the compressed file size.

#compressed file size
ls -lh ~/mimic/hosp/labevents.csv.gz

#decompress the file 
gzip -dk < ~/mimic/hosp/labevents.csv.gz > ~/mimic/hosp/labevents.csv

#uncompressed file size
ls -lh ~/mimic/hosp/labevents.csv
-rw-rw-r--@ 1 jacenai  staff   1.8G Jan  5  2023 /Users/jacenai/mimic/hosp/labevents.csv.gz
-rw-r--r--@ 1 jacenai  staff    13G Mar  7 15:08 /Users/jacenai/mimic/hosp/labevents.csv

Then, comparing the run times: From the output below, in general, the run time of zcat on the compressed file is more than the run time of wc on the uncompressed file.

#run time of zcat on the compressed file
time zcat < ~/mimic/hosp/labevents.csv.gz | wc -l

#run time of wc on the uncompressed file
time wc -l ~/mimic/hosp/labevents.csv
 118171368

real    0m14.751s
user    0m22.682s
sys 0m1.750s
 118171368 /Users/jacenai/mimic/hosp/labevents.csv

real    0m14.347s
user    0m12.338s
sys 0m1.525s

Trade-off between storage and speed: compressed files save storage space but may require additional time for decompression during access. Uncompressed files provide faster access but consume more storage space. If storage space is a critical concern and access speed can be tolerated, compression is beneficial. However, if rapid access is crucial and storage space is not a limiting factor, using uncompressed files might be preferred.

# remove the large uncompressed file
rm ~/mimic/hosp/labevents.csv

More fun with Linux

Try following commands in Bash and interpret the results: cal, cal 2024, cal 9 1752 (anything unusual?), date, hostname, arch, uname -a, uptime, who am i, who, w, id, last | head, echo {con,pre}{sent,fer}{s,ed}, time sleep 5, history | tail.

Answer:

cal
     March 2025       
Su Mo Tu We Th Fr Sa  
                   1  
 2  3  4  5  6  7  8  
 9 10 11 12 13 14 15  
16 17 18 19 20 21 22  
23 24 25 26 27 28 29  
30 31                 

cal: displays the current month’s calendar.

cal 2024
                            2024
      January               February               March          
Su Mo Tu We Th Fr Sa  Su Mo Tu We Th Fr Sa  Su Mo Tu We Th Fr Sa  
    1  2  3  4  5  6               1  2  3                  1  2  
 7  8  9 10 11 12 13   4  5  6  7  8  9 10   3  4  5  6  7  8  9  
14 15 16 17 18 19 20  11 12 13 14 15 16 17  10 11 12 13 14 15 16  
21 22 23 24 25 26 27  18 19 20 21 22 23 24  17 18 19 20 21 22 23  
28 29 30 31           25 26 27 28 29        24 25 26 27 28 29 30  
                                            31                    

       April                  May                   June          
Su Mo Tu We Th Fr Sa  Su Mo Tu We Th Fr Sa  Su Mo Tu We Th Fr Sa  
    1  2  3  4  5  6            1  2  3  4                     1  
 7  8  9 10 11 12 13   5  6  7  8  9 10 11   2  3  4  5  6  7  8  
14 15 16 17 18 19 20  12 13 14 15 16 17 18   9 10 11 12 13 14 15  
21 22 23 24 25 26 27  19 20 21 22 23 24 25  16 17 18 19 20 21 22  
28 29 30              26 27 28 29 30 31     23 24 25 26 27 28 29  
                                            30                    

        July                 August              September        
Su Mo Tu We Th Fr Sa  Su Mo Tu We Th Fr Sa  Su Mo Tu We Th Fr Sa  
    1  2  3  4  5  6               1  2  3   1  2  3  4  5  6  7  
 7  8  9 10 11 12 13   4  5  6  7  8  9 10   8  9 10 11 12 13 14  
14 15 16 17 18 19 20  11 12 13 14 15 16 17  15 16 17 18 19 20 21  
21 22 23 24 25 26 27  18 19 20 21 22 23 24  22 23 24 25 26 27 28  
28 29 30 31           25 26 27 28 29 30 31  29 30                 
                                                                  

      October               November              December        
Su Mo Tu We Th Fr Sa  Su Mo Tu We Th Fr Sa  Su Mo Tu We Th Fr Sa  
       1  2  3  4  5                  1  2   1  2  3  4  5  6  7  
 6  7  8  9 10 11 12   3  4  5  6  7  8  9   8  9 10 11 12 13 14  
13 14 15 16 17 18 19  10 11 12 13 14 15 16  15 16 17 18 19 20 21  
20 21 22 23 24 25 26  17 18 19 20 21 22 23  22 23 24 25 26 27 28  
27 28 29 30 31        24 25 26 27 28 29 30  29 30 31              
                                                                  

cal 2024: displays the calendar for the year 2024.

cal 9 1752
   September 1752     
Su Mo Tu We Th Fr Sa  
       1  2 14 15 16  
17 18 19 20 21 22 23  
24 25 26 27 28 29 30  
                      
                      
                      

cal 9 1752: displays the calendar for September 1752. The calendar is unusual from the modern calendar because the switch from the Julian to the Gregorian calendar happened in September 1752, so the calendar missed 11 days form September 3 to September 13.

date
Fri Mar  7 15:08:47 PST 2025

date: Prints the current date and time.

hostname
Jacens-MacBook-Air.local

hostname: prints the hostname of the server that I’m currently logged into.

arch
arm64

arch command in Linux is used to display the architecture of the current system. It provides information about the instruction set architecture of the processor, helping users identify whether it’s a 32-bit or 64-bit system. In my case, my system is running on an arm64 architecture, and it means that my processor supports 64-bit instructions.

uname -a
Darwin Jacens-MacBook-Air.local 23.6.0 Darwin Kernel Version 23.6.0: Mon Jul 29 21:16:46 PDT 2024; root:xnu-10063.141.2~1/RELEASE_ARM64_T8112 arm64

uname -a command in Linux is used to display detailed information about the system’s kernel and hardware. It provides a comprehensive set of details about the system, including the kernel name, network node hostname, kernel release, kernel version, machine hardware name, processor type, hardware platform, and the operating system. In my case, the kernel name is “Darwin,” and the network node hostname is “s-169-232-81-195.resnet.ucla.edu.” The kernel release, version 22.4.0, indicates the specific version of the Darwin kernel. The timestamp “Mon Mar 6 21:01:02 PST 2023” denotes when the kernel was built.

uptime
15:08  up 1 day,  1:16, 9 users, load averages: 2.81 2.57 2.56

uptime command in Linux is used to display the current time, how long the system has been running, and information about the system’s load averages. It provides a quick overview of the system’s status and activity.

who am i
jacenai                       Mar  7 15:08 

who am i command in Linux is used to display information about the current user who is logged into the system. It provides details such as the username and the time the user logged in. In my case, the username is “jacenai,” and the time the user logged in is “Jan 24 10:11”

who
jacenai          console      Mar  6 13:52 
jacenai          ttys000      Mar  7 14:40 
jacenai          ttys001      Mar  7 14:08 
jacenai          ttys002      Mar  7 14:09 
jacenai          ttys003      Mar  7 14:49 
jacenai          ttys004      Mar  7 14:41 
jacenai          ttys005      Mar  7 15:03 
jacenai          ttys007      Mar  7 14:56 
jacenai          ttys008      Mar  7 14:57 

who command displays information about currently logged-in users. It provides details such as the username, terminal, login time

w
15:08  up 1 day,  1:16, 9 users, load averages: 2.81 2.57 2.56
USER       TTY      FROM    LOGIN@  IDLE WHAT
jacenai    console  -      Thu13   25:15 -
jacenai    s000     -      14:40      28 ssh -o ServerAliveCountMax=5 -o IPQoS=
jacenai    s001     -      14:08      31 ssh jia@hoffman2.idre.ucla.edu
jacenai    s002     -      14:09      57 ssh jia@hoffman2.idre.ucla.edu
jacenai    s003     -      14:49      19 ssh -N -L 8787:n1827:8787 jia@hoffman2
jacenai    s004     -      14:41      19 ssh jia@hoffman2.idre.ucla.edu
jacenai    s005     -      15:03       4 ssh -o ServerAliveCountMax=5 -o IPQoS=
jacenai    s007     -      14:56       1 ssh jia@hoffman2.idre.ucla.edu
jacenai    s008     -      14:57       3 ssh jia@hoffman2.idre.ucla.edu

w command displays information about the currently logged-in users and their activities. It provides a summary of user-related information, including details about each user’s login session, the time they’ve been idle, and the commands they are currently running.

id
uid=501(jacenai) gid=20(staff) groups=20(staff),12(everyone),61(localaccounts),79(_appserverusr),80(admin),81(_appserveradm),98(_lpadmin),33(_appstore),100(_lpoperator),204(_developer),250(_analyticsusers),395(com.apple.access_ftp),398(com.apple.access_screensharing),399(com.apple.access_ssh),400(com.apple.access_remote_ae),701(com.apple.sharepoint.group.1)

id command displays information about the user and group identities (ID) associated with the current user or a specified username.

last | head
jacenai    ttys005                         Fri Mar  7 15:03   still logged in
jacenai    ttys008                         Fri Mar  7 14:57   still logged in
jacenai    ttys007                         Fri Mar  7 14:56   still logged in
jacenai    ttys006                         Fri Mar  7 14:55 - 14:55  (00:00)
jacenai    ttys005                         Fri Mar  7 14:52 - 14:52  (00:00)
jacenai    ttys003                         Fri Mar  7 14:49   still logged in
jacenai    ttys004                         Fri Mar  7 14:41   still logged in
jacenai    ttys000                         Fri Mar  7 14:40   still logged in
jacenai    ttys004                         Fri Mar  7 14:39 - 14:39  (00:00)
jacenai    ttys003                         Fri Mar  7 14:37 - 14:37  (00:00)

last command in Linux is used to display information about previously logged-in users, including their login and logout times. When combined with the head command, it limits the output to the specified 10 newest lines.

echo {con,pre}{sent,fer}{s,ed}
consents consented confers confered presents presented prefers prefered

echo {con,pre}{sent,fer}{s,ed} allows me to generate strings by specifying patterns enclosed in curly braces {}. The comma-separated values within the braces represent options, and this command generates all possible combinations of those options. In this case, the command generates the following strings: “consents,” “conferred,” “consented,” “confers,” “presents,” “presented,” “prefers,” and “preferred.”

time sleep 5

real    0m5.008s
user    0m0.000s
sys 0m0.001s

time command in Linux is used to measure the execution time of a given command or script, and sleep 5 pauses the shell for a duration of five seconds. When combined, the command measures the execution time of the sleep 5 command.

history | tail

history command in Linux is used to display the last executed commands. When combined with the tail command, it limits the output to the specified 10 newest lines. In my case, since each command is executed seperately in the terminal, the output of the history command is empty. But I input expected output if the commands operate in line.

# 100  arch
# 101  uname -a
# 102  uptime
# 103  who am i
# 104  who
# 105  w
# 106  id
# 107  last | head
# 108  time sleep 5
# 109  history | tail

Book

  1. Git clone the repository https://github.com/christophergandrud/Rep-Res-Book for the book Reproducible Research with R and RStudio to your local machine.

  2. Open the project by clicking rep-res-3rd-edition.Rproj and compile the book by clicking Build Book in the Build panel of RStudio. (Hint: I was able to build git_book and epub_book but not pdf_book.)

The point of this exercise is (1) to get the book for free and (2) to see an example how a complicated project such as a book can be organized in a reproducible way.

For grading purpose, include a screenshot of Section 4.1.5 of the book here.

Answer: First, I used fork to create a copy of the repository on github. Then, I used the following command to clone the repository to my local machine.

#git clone git@github.com:jacenai/Rep-Res-Book.git git@github.com:jacenai/biostat-203b-2024-winter.git

Next, I opened the project by clicking rep-res-3rd-edition.Rproj and compiled the book by clicking Build Book in the Build panel of RStudio, during which I downloaded a few R packages. Finally, I was able to build git_book. The following screenshot shows the output of Section 4.1.5 of the book. screenshot of Section 4.1.5

Notes on Usage of Hoffman2

# log into the cluster
ssh jia@hoffman2.idre.ucla.edu

# Then enter the password

# request an interactive session (e.g., 4 hours and 4GB of memory)
qrsh -l h_rt=4:00:00,h_data=4G

# create a directory
mkdir mice_colon

# remove the directory
rm -rf Bed_Human_Blood_Reed-2023-9206

# remove the file
rm Bed_Human_Blood_Reed-2023-9206/CIVR_UGA6_009C.bed

# display the content of the directory
cat /u/home/j/jia/CellFi/Bed_Human_Blood_Reed-2023-9206/CIVR_UGA6_009C.bed

# copy the directory/file/folder to the cluster
# when uploading to Hoffman2, making sure that I'm in the local directory
scp -r /Users/jacenai/Desktop/Matteo_Lab/LabResourse/CellFi/samples jia@dtn2.hoffman2.idre.ucla.edu:/u/home/j/jia/CellFi

# rename the file
mv hoffman2_indiv_network.py hoffman2_indiv_network_1_12.py

# move the file to the directory
mv /u/home/j/jia/mice_colon/age_seperated_cor_matrix /u/home/j/jia/mice_colon/age_cluster/

# load the python module on hoffman2
module load python/2.7.15

# activate the virtual environment
source ./Env/bin/activate

# check the python version
python --version

# check the pandas version
python -c "import pandas as pd; print(pd.__version__)"
pip show pandas

# check the numpy version
python -c "import numpy as np; print(np.__version__)"
pip show numpy

# check the scipy version
python -c "import scipy as sp; print(sp.__version__)"
pip show scipy

# quit hoffman2
exit

Example: Run the .sh file on the cluster

module load R

qsub -l h_rt=4:00:00,h_data=8G -o $SCRATCH/metilene_job.out -j y -M jiachenai@g.ucla.edu -m bea -b y /u/home/j/jia/mice_colon/run_metilene_comparisons_hoffman2.sh
# $SCRATCH is an environment variable that refers to a user-specific directory 
# within the high-performance scratch storage area. This directory is used for 
# temporary files and data that are needed during a job's execution. 
# The scratch space is not backed up and is intended for intermediate or 
# temporary storage while running jobs. You can access your scratch directory 
# by using the $SCRATCH variable in paths, or directly via /u/scratch/username, 
#   where username is your Hoffman2 username.

# quit the running job
qdel 4671427

# check the status of the job
qstat -u jia

# edit the .sh file using vi
vi run_metilene_comparisons_hoffman2.sh
# looking for cheatsheet for vi


# download the output file from the cluster to the local machine
scp -r jia@dtn2.hoffman2.idre.ucla.edu:/u/scratch/j/jia/metilene_job.out jacenai@jacens-air.ad.medctr.ucla.edu:/Users/jacenai/Desktop/GSR_Project/Mice_Colon_Data/Hoffman2_output_short/
# remember to Enable Remote Login on macOS: System Preferences > Sharing > Remote Login
# Then enter the password of the local machine


# check the local machine's user name and hostname
whoami

hostname


# Better way: run the following command on local machine
# Use `scp` to Pull the File from the Cluster
scp jia@dtn2.hoffman2.idre.ucla.edu:/u/home/j/jia/mice_colon/wasserstein_network/output/wasserstein_network_results_78_82.csv /Users/jacenai/Desktop/GSR_Project/Mice_Colon_Data/wasserstein_network/

# Using `rsync` for Large Files
rsync -avz jia@dtn2.hoffman2.idre.ucla.edu:/u/home/j/jia/mice_colon/wasserstein_network/output/wasserstein_network_results_78_82.csv /Users/jacenai/Desktop/GSR_Project/Mice_Colon_Data/wasserstein_network/

# Verify the File Transfer with `ls` command 
ls /Users/jacenai/Desktop/GSR_Project/Mice_Colon_Data/wasserstein_network/

Cheat sheet for vi

Details on the run_metilene_comparisons_hoffman2.sh file.

Here I used a loop to compare the methylation levels of different groups of samples. The script is as follows:

#!/bin/bash

groups=("3M" "9M" "15M" "24M" "28M")
datafile="/u/home/j/jia/mice_colon/whole_data.tsv"
metilene_path="/u/home/j/jia/mice_colon/metilene_v0.2-8"
output_dir="/u/home/j/jia/mice_colon/output"

for ((i=0; i<${#groups[@]}-1; i++)); do
    for ((j=i+1; j<${#groups[@]}; j++)); do
        group_a=${groups[i]}
        group_b=${groups[j]}
        output_file="${output_dir}/${group_a}_vs_${group_b}_output.txt"
        filtered_file="${output_dir}/${group_a}_vs_${group_b}_filtered.bed"
        
        echo "Running comparison: $group_a vs $group_b"
        
        # Run the comparison
        $metilene_path/metilene_linux64 -a "$group_a" -b "$group_b" -m 8 "$datafile" | sort -V -k1,1 -k2,2n > "$output_file"
        
        # Run the filtering process
        echo "Filtering results for $group_a vs $group_b"
        perl $metilene_path/metilene_output.pl -q "$output_file" -o "$filtered_file" -p 0.05 -c 8 -l 8 -a "$group_a" -b "$group_b"
        
    done
done

Run a python file on the cluster

First, prepare all the necessary files and data on the cluster. For example, I have a python file named hoffman2_indiv_network.py.

import pandas as pd
import networkx as nx
import matplotlib.pyplot as plt
import community.community_louvain as community_louvain  # Correct import for modularity calculation
import glob  # For file handling

# Function to process each file
def process_file(file_path):
    
    # Load the adjacency matrix
    adjacency_matrix = pd.read_csv(file_path, index_col=0)
    
    # Create a graph from the adjacency matrix
    G = nx.from_pandas_adjacency(adjacency_matrix)

    # 1. Calculate Number of Edges
    num_edges = G.number_of_edges()

    # 2. Centrality (Degree Centrality)
    degree_centrality = nx.degree_centrality(G)

    # Calculate Average Degree Centrality
    average_degree_centrality = sum(degree_centrality.values()) / len(degree_centrality)

    # 3. Modularity
    partition = community_louvain.best_partition(G)
    modularity = community_louvain.modularity(partition, G)

    # 4. Clustering Coefficient
    average_clustering_coefficient = nx.average_clustering(G)

    # Collect results in a dictionary
    results = {
        # Extract just the file name and also keep the part before _adjacency_matrix.csv
        'Individual': file_path.split('/')[-1].split('_adjacency_matrix.csv')[0],
        'Number of Edges': num_edges,
        'Average Degree Centrality': average_degree_centrality,
        'Modularity': modularity,
        'Average Clustering Coefficient': average_clustering_coefficient
    }

    return results


# Create an empty DataFrame to store results
results_df = pd.DataFrame(columns=['Individual', 
                                   'Number of Edges', 
                                   'Average Degree Centrality', 
                                   'Modularity', 
                                   'Average Clustering Coefficient'])

# Update the path to where your input files are located on Hoffman2
input_path = '/u/home/j/jia/mice_colon/indiv_network/individual_network_stat_1_12/' 

# Collect all CSV files in the input directory
file_paths = glob.glob(input_path + '*.csv')

# Process each file and append results to the DataFrame
for file_path in file_paths:
    result = process_file(file_path)
    results_df = pd.concat([results_df, pd.DataFrame([result])], ignore_index=True)

# Save the results to a CSV file instead of displaying them
output_path = '/u/home/j/jia/mice_colon/indiv_network/output/'
results_df.to_csv(output_path + 'individual_network_results_1_12.csv',
                  index=False)

print(f"Results saved to {output_path}")

Make sure that I have all packages installed on the cluster. If not, install them using the following commands in the terminal (on the cluster not in the python)

# first check all the modules available
module av

# check all the python modules available
module av python

# here's the example output:
# ------------------------- /u/local/Modules/modulefiles -------------------------
# python/2.7.15  python/3.6.8  python/3.9.6(default)  
# python/2.7.18  python/3.7.3 

# load the python module
module load python/3.9.6

pip3 install python-louvain(package name) --user

# upload the script/code to the cluster
scp -r /Users/jacenai/Desktop/GSR_Project/Colon_Methylation/indiv_network_submit.sh jia@dtn2.hoffman2.idre.ucla.edu:/u/home/j/jia/mice_colon/indiv_network

Then, create a .sh file (submission script) to run the python file. The .sh file is as follows:

#!/bin/bash
#$ -cwd                        # Run the job in the current directory
#$ -o joblog.$JOB_ID           # Save the job output in a file named "joblog.<job_id>"
#$ -j y                        # Combine output and error logs
#$ -l h_rt=24:00:00,h_data=8G  # Set runtime to 24 hours and memory to 8 GB
#$ -pe shared 1                # Request 1 CPU core
#$ -M jiachenai@g.ucla.edu     # Email notifications to your UCLA email
#$ -m beas                     # Send email at the start, end, abort, or suspend

# Print job start information
echo "Job $JOB_ID started on:   " `hostname -s`
echo "Job $JOB_ID started on:   " `date `
echo " "

# Initialize the module environment
. /u/local/Modules/default/init/modules.sh

# Load Python module
module load python/3.9.6

# Install required Python packages if not already installed
echo "Installing required Python packages..."
pip3 install --user pandas numpy scipy tqdm

# Optional: Verify the Python environment
python3 --version
pip3 list | grep -E "pandas|numpy|scipy|tqdm"

# Run the Python script
echo "Running wasserstein_distance_hoffman2.py..."
python3 /u/home/j/jia/mice_colon/wasserstein_network/wasserstein_distance_hoffman2.py

# Check for errors during execution
if [ $? -ne 0 ]; then
  echo "Job $JOB_ID failed on:   " `hostname -s`
  echo "Job $JOB_ID failed on:   " `date `
  echo "Please check the log file for details: joblog.$JOB_ID"
  exit 1
fi

# Print job end information
echo "Job $JOB_ID completed successfully on:   " `hostname -s`
echo "Job $JOB_ID ended on:   " `date `

Another example (with creating virtual environment):

#!/bin/bash
#$ -cwd                        # Run the job in the current directory
#$ -o joblog.$JOB_ID           # Save the job output in a file named "joblog.<job_id>"
#$ -j y                        # Combine output and error logs
#$ -l h_rt=24:00:00,h_data=8G  # Set runtime to 24 hours and memory to 8 GB
#$ -pe shared 1                # Request 1 CPU core
#$ -M jiachenai@g.ucla.edu     # Email notifications to your UCLA email
#$ -m beas                     # Send email at the start, end, abort, or suspend

# Print job start information
echo "Job $JOB_ID started on:   " `hostname -s`
echo "Job $JOB_ID started on:   " `date`
echo " "

# Initialize the module environment
. /u/local/Modules/default/init/modules.sh

# Load Python module
module load python/3.9.6

# Set up a virtual environment
VENV_DIR="/u/home/j/jia/python_envs/wasserstein_env"
if [ ! -d "$VENV_DIR" ]; then
  echo "Creating virtual environment..."
  python3 -m venv $VENV_DIR
fi

# Activate the virtual environment
source $VENV_DIR/bin/activate

# Install required Python packages (if not already installed)
REQUIRED_PACKAGES="pandas numpy scipy tqdm networkx python-louvain"
for PACKAGE in $REQUIRED_PACKAGES; do
  pip3 show $PACKAGE > /dev/null 2>&1 || pip3 install $PACKAGE
done

# Optional: Verify the Python environment
echo "Python version:"
python3 --version
echo "Installed packages:"
pip3 list | grep -E "pandas|numpy|scipy|tqdm|networkx|python-louvain"

# Run the Python script
echo "Running wasserstein_indiv_network_1_7.py..."
python3 /u/home/j/jia/mice_colon/wasserstein_network/wasserstein_indiv_network_1_7.py

# Check for errors during execution
if [ $? -ne 0 ]; then
  echo "Job $JOB_ID failed on:   " `hostname -s`
  echo "Job $JOB_ID failed on:   " `date`
  echo "Please check the log file for details: joblog.$JOB_ID"
  exit 1
fi

# Print job end information
echo "Job $JOB_ID completed successfully on:   " `hostname -s`
echo "Job $JOB_ID ended on:   " `date`

Then, submit the job to the cluster using the following command:

# make the .sh file executable
chmod +x hoffman2_indiv_network.sh

# run the .sh file directly
./organize_wasserstein.sh

# before submitting the job, make sure the .sh file is executable and 
# request enough resources the resources requested not necessarily 
# the same as the ones in the .sh file if you're going to 
# use `qsub` submission script, but if you're going to use `qrsh` 
# then you need to request the resources in the .sh file
qrsh -l h_rt=4:00:00,h_data=4G

module load python/3.9.6

# go to the directory where the .sh file is located
cd /u/home/j/jia/mice_colon/indiv_network

qsub hoffman2_indiv_network.sh

After submitting the job, you can check the status of the job using the following command:

# check the log file for the job
# in the run directory
ls -lh
cat joblog.<job_id>


# check the status of your own jobs
qstat -u $USER

# or check the status of all jobs
qstat

# or check the status of a specific job
qstat <job_id>

# or check the status of all jobs in a queue
qstat -q

# or check the status of all jobs in a queue with more details
qstat -g c

# or check the status of all jobs in a queue with even more details
qstat -g t

# or check the status of all jobs in a queue with the most details
qstat -f

# or check the status of all jobs in a queue with the most details and filter by user
qstat -f -u $USER

If the storage is full, you can delete the files that are not needed using the following command:

# Ensure you have sufficient disk space and are not exceeding your quota on Hoffman2
quota -v

quota output indicates that you have exceeded your disk usage quota on several filesystems. The * next to the blocks column confirms that you are over quota

# Check your disk usage
du -sh /u/home/j/jia/mice_colon/indiv_network

# Check your home directory for large or unnecessary files
du -sh /u/home/j/jia/* | sort -h
du -sh /u/home/j/jia/mice_colon/indiv_network/* | sort -h

# Delete files that are no longer needed
rm /u/home/j/jia/mice_colon/indiv_network/output/individual_network_results_1_12.csv

# Delete a directory
rm -r /u/home/j/jia/mice_colon/indiv_network/output

# Check your disk usage again
du -sh /u/home/j/jia/mice_colon/indiv_network

# check the most space-consuming files
du -ah ~ | sort -h | tail -n 20

Run code on Jupyter notebook: (run this command on local machine)

# run the command
./h2jupynb -u jia -t 24 -m 10  # -t runtime in hours, -m memory in GB per core

# check the help message
./h2jupynb --help

h2jupynb [-u ] [-v ] [-t <time, integer number of hours>] [-m <memory, integer number of GB per core>] [-e <parallel environment: 1 for shared, 2 for distributed>] [-s ] [-o ] [-x ] [-a ] [-d ] [-g ] [-c ] [-l ] [-p ] [-j ] [-k ] [-b ] [-z <write ssh debug files?:yes/no>].

How to reconnect to a Jupyter notebook session on Hoffman2:

# First, get the information of the running Jupyter notebook 
# run this on Hoffman2
qstat -u $USER

# Knowing the job ID
qstat -j 7164149
qstat -j <job_id>

# Find the information at the bottom of the output: 
# exec_host_list        1:    n7361:1
# n7361 is the exec host where the Jupyter notebook is running on Hoffman2


# Second, ssh to the exec host # ssh means secure shell
ssh jia@n7361
ssh jia@<exec_host>

# Load the Appropriate Python Module
# For instance, you can load the Anaconda module, which comes with Jupyter Notebook pre-installed.
module load anaconda3/2020.11

# Verify the Jupyter Installation
jupyter --version

# Start the Jupyter Notebook Server
jupyter notebook --no-browser --ip=0.0.0.0 --port=8692
# This command starts the Jupyter Notebook server, listening on all IP addresses (0.0.0.0) and on the specified port (8692) on my local machine.



# Finally on your local machine, run the following command to connect to the Jupyter notebook
ssh -L 8692:n7361:8692 jia@hoffman2.idre.ucla.edu
ssh -L <local_port>:<exec_host>:<exec_host_port> <username>@hoffman2.idre.ucla.edu

# Open a web browser and navigate to the following URL:
http://localhost:8692
http://localhost:<local_port>
# This URL connects to the Jupyter Notebook server running on the exec host through the SSH tunnel.

RStudio on Hoffman2

Create a permanent library for R on Hoffman2

When you install an R package in RStudio, it typically goes to:

# in R, run this command
.libPaths()

A list of applications available via modules

all_apps_via_modules

# or the module for a particular application can be searched via:
modules_lookup -m <app-name>

Creating a Permanent R Library Directory.

Step 1:

Choose a location where you want to store R packages permanently:

# bash command

# Choose a location where you want to store R packages permanently:
mkdir -p $HOME/R/library

# Or, if you prefer scratch storage (faster, but not backed up):
mkdir -p $SCRATCH/R/library

Run code on RStudio on Hoffman2

Step 2:

Run RStudio inside the container

# bash command

# Request an interactive job:
qrsh -l h_rt=4:00:00,h_data=4G  # this does not influence the running storage

# Load the apptainer module:
module load apptainer

# Set up a large temporary directory for RStudio to use:
mkdir -p $SCRATCH/rstudio_large_tmp
export TMPDIR=$SCRATCH/rstudio_large_tmp

# set the R_LIBS_USER environment variable to the location you chose in Step 1:
export R_LIBS_USER=$HOME/R/library


# You can replace export RSTUDIO_VERSION=4.1.0 with any Rstudio version available on Hoffman2
# This will display information and an ssh -L ... command to run in a separate terminal.
export RSTUDIO_VERSION=4.1.0

# Then, launch RStudio:
apptainer run -B $SCRATCH/rstudio_large_tmp:/tmp \
              -B $SCRATCH/rstudiotmp/var/lib:/var/lib/rstudio-server \
              -B $SCRATCH/rstudiotmp/var/run:/var/run/rstudio-server \
              $H2_CONTAINER_LOC/h2-rstudio_${RSTUDIO_VERSION}.sif

              
# Connect to the compute node's port:
# Open a another new SSH tunnel on your local computer by running:
ssh  -L 8787:nXXX:8787 username@hoffman2.idre.ucla.edu # Or whatever command was displayed earlier 
# Access RStudio in your web browser:
http://localhost:8787 #or whatever port number that was displayed

# exit Rstudio, run
[CTRL-C]

Alternative: Use Environment Variables in RStudio

Step 1:

Running RStudio: Automated Script

./h2_rstudio.sh -u H2USERNAME

This will start RStudio as a qrsh job, open a port tunnel, and allow you to access RStudio in your web browser.

Step 2:

Set the Library Path in RStudio.

Once RStudio is running inside the container, open the R console and run:

.libPaths(c("~/R/library", .libPaths()))

This tells R to first check your home directory (~/R/library) before using the default container library.

To make this permanent, add this to your .Rprofile in your home directory:

cat('.libPaths(c("~/R/library", .libPaths()))\n', file = "~/.Rprofile", append = TRUE)

# You can verify the installation location using:
.libPaths()

Step 3:

Install Packages to the Permanent Directory

install.packages("ggplot2", lib="~/R/library")