R version 4.3.1 (2023-06-16)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Sonoma 14.6.1
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.11.0
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
time zone: America/Los_Angeles
tzcode source: internal
attached base packages:
[1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached):
[1] htmlwidgets_1.6.4 compiler_4.3.1 fastmap_1.2.0 cli_3.6.4
[5] tools_4.3.1 htmltools_0.5.8.1 rstudioapi_0.16.0 yaml_2.3.8
[9] rmarkdown_2.27 knitr_1.47 jsonlite_1.8.8 xfun_0.44
[13] digest_0.6.35 rlang_1.1.5 evaluate_0.23
Data ethics training
This exercise (and later in this course) uses the MIMIC-IV data v2.2, a freely accessible critical care database developed by the MIT Lab for Computational Physiology. Follow the instructions at https://mimic.mit.edu/docs/gettingstarted/ to (1) complete the CITI Data or Specimens Only Research course and (2) obtain the PhysioNet credential for using the MIMIC-IV data. Display the verification links to your completion report and completion certificate here.
Answer:
I completed the CITI training on 1/18/2021. The completion report is available at here. The completion certificate is available at here.
I also obtained the PhysioNet credential for using the MIMIC-IV data. Here is the screenshot of my PhysioNet credentialing.
Linux Shell Commands
Make the MIMIC v2.2 data available at location ~/mimic.
ls-l ~/mimic/
Refer to the documentation https://physionet.org/content/mimiciv/2.2/ for details of data files. Please, do not put these data files into Git; they are big. Do not copy them into your directory. Do not decompress the gz data files. These create unnecessary big files and are not big-data-friendly practices. Read from the data folder ~/mimic directly in following exercises.
Use Bash commands to answer following questions.
Answer:
I downloaded the MIMIC-IV v2.2 data to my local machine. I create a symbolic link to the original directory to MIMIC-IV v2.2 data under the path ~/mimic in my command panel.
Now, the data files are available at ~/mimic. The data files are not put into Git. The data files are not copied into my directory. The gz data files are not decompressed.
ls-l ~/mimic/
total 48
-rw-rw-r--@ 1 jacenai staff 13332 Jan 5 2023 CHANGELOG.txt
-rw-rw-r--@ 1 jacenai staff 2518 Jan 5 2023 LICENSE.txt
-rw-rw-r--@ 1 jacenai staff 2884 Jan 6 2023 SHA256SUMS.txt
drwxr-xr-x@ 27 jacenai staff 864 Mar 7 14:30 hosp
drwxr-xr-x@ 13 jacenai staff 416 Mar 9 2024 icu
lrwxr-xr-x@ 1 jacenai staff 49 Jan 24 2024 mimic-iv-2.2 -> /Users/jacenai/Desktop/BIOSTAT_203B/mimic-iv-2.2/
Display the contents in the folders hosp and icu using Bash command ls -l. Why are these data files distributed as .csv.gz files instead of .csv (comma separated values) files? Read the page https://mimic.mit.edu/docs/iv/ to understand what’s in each folder.
Answer: This is the content of the folder hosp:
ls-l ~/mimic/hosp
total 8970344
-rw-rw-r--@ 1 jacenai staff 15516088 Jan 5 2023 admissions.csv.gz
-rw-rw-r--@ 1 jacenai staff 427468 Jan 5 2023 d_hcpcs.csv.gz
-rw-rw-r--@ 1 jacenai staff 859438 Jan 5 2023 d_icd_diagnoses.csv.gz
-rw-rw-r--@ 1 jacenai staff 578517 Jan 5 2023 d_icd_procedures.csv.gz
-rw-rw-r--@ 1 jacenai staff 12900 Jan 5 2023 d_labitems.csv.gz
-rw-rw-r--@ 1 jacenai staff 25070720 Jan 5 2023 diagnoses_icd.csv.gz
-rw-rw-r--@ 1 jacenai staff 7426955 Jan 5 2023 drgcodes.csv.gz
-rw-rw-r--@ 1 jacenai staff 508524623 Jan 5 2023 emar.csv.gz
-rw-rw-r--@ 1 jacenai staff 471096030 Jan 5 2023 emar_detail.csv.gz
-rw-rw-r--@ 1 jacenai staff 1767138 Jan 5 2023 hcpcsevents.csv.gz
-rw-rw-r--@ 1 jacenai staff 1939088924 Jan 5 2023 labevents.csv.gz
drwxr-xr-x@ 3 jacenai staff 96 Feb 28 2024 labevents.parquet
-rw-r--r--@ 1 jacenai staff 56623104 Feb 9 2024 labevents_filtered.csv.gz
-rw-rw-r--@ 1 jacenai staff 96698496 Jan 5 2023 microbiologyevents.csv.gz
-rw-rw-r--@ 1 jacenai staff 36124944 Jan 5 2023 omr.csv.gz
-rw-rw-r--@ 1 jacenai staff 2312631 Jan 5 2023 patients.csv.gz
-rw-rw-r--@ 1 jacenai staff 398753125 Jan 5 2023 pharmacy.csv.gz
-rw-rw-r--@ 1 jacenai staff 498505135 Jan 5 2023 poe.csv.gz
-rw-rw-r--@ 1 jacenai staff 25477219 Jan 5 2023 poe_detail.csv.gz
-rw-rw-r--@ 1 jacenai staff 458817415 Jan 5 2023 prescriptions.csv.gz
-rw-rw-r--@ 1 jacenai staff 6027067 Jan 5 2023 procedures_icd.csv.gz
-rw-rw-r--@ 1 jacenai staff 122507 Jan 5 2023 provider.csv.gz
-rw-rw-r--@ 1 jacenai staff 6781247 Jan 5 2023 services.csv.gz
-rw-rw-r--@ 1 jacenai staff 36158338 Jan 5 2023 transfers.csv.gz
This is the content of the folder icu:
ls-l ~/mimic/icu
total 6155968
-rw-rw-r--@ 1 jacenai staff 35893 Jan 5 2023 caregiver.csv.gz
-rw-rw-r--@ 1 jacenai staff 2467761053 Jan 5 2023 chartevents.csv.gz
drwxr-xr-x@ 4 jacenai staff 128 Feb 23 2024 chartevents.parquet
-rw-rw-r--@ 1 jacenai staff 57476 Jan 5 2023 d_items.csv.gz
-rw-rw-r--@ 1 jacenai staff 45721062 Jan 5 2023 datetimeevents.csv.gz
-rw-rw-r--@ 1 jacenai staff 2614571 Jan 5 2023 icustays.csv.gz
-rw-rw-r--@ 1 jacenai staff 251962313 Jan 5 2023 ingredientevents.csv.gz
-rw-rw-r--@ 1 jacenai staff 324218488 Jan 5 2023 inputevents.csv.gz
-rw-rw-r--@ 1 jacenai staff 38747895 Jan 5 2023 outputevents.csv.gz
-rw-rw-r--@ 1 jacenai staff 20717852 Jan 5 2023 procedureevents.csv.gz
The reason for why these data files are distributed as .csv.gz files instead of .csv files is: the .csv.gz indicates that the file is compressed using gzip compression. It used compressed files rather than the .csv files is because the data set contains comprehensive information for each patient while they were in the hospital, so it would need huge storage space if not compressed. Compression reduces the file size, making it quicker to download and transfer and requiring less bandwidth when downloading, which can be important for users with limited internet bandwidth.
Briefly describe what Bash commands zcat, zless, zmore, and zgrep do.
Answer:
zcat is used to display the content of one or more compressed files on the standard output.
zless is used to view the contents of a compressed file one page at a time. It provides a convenient way to scroll through the contents of a compressed file.
zmore is similar to zless and serves as a pager for compressed files and is used to view the contents of a compressed file one page at a time. It primarily supports forward navigation through the content.
zgrep is used to search for a pattern within one or more compressed files.
(Looping in Bash) What’s the output of the following bash script?
for datafile in ~/mimic/hosp/{a,l,pa}*.gzdols-l$datafiledone
Display the number of lines in each data file using a similar loop. (Hint: combine linux commands zcat < and wc -l.)
Answer:
Display the number of lines in each data file using a similar loop:
for datafile in ~/mimic/hosp/{a,l,pa}*.gzdoecho"Number of lines in $datafile: $(zcat<$datafile|wc-l)"done
Number of lines in /Users/jacenai/mimic/hosp/admissions.csv.gz: 431232
Number of lines in /Users/jacenai/mimic/hosp/labevents.csv.gz: 118171368
zcat: (stdin): unexpected end of file
Number of lines in /Users/jacenai/mimic/hosp/labevents_filtered.csv.gz: 10568729
Number of lines in /Users/jacenai/mimic/hosp/patients.csv.gz: 299713
Display the first few lines of admissions.csv.gz. How many rows are in this data file? How many unique patients (identified by subject_id) are in this data file? Do they match the number of patients listed in the patients.csv.gz file? (Hint: combine Linux commands zcat <, head/tail, awk, sort, uniq, wc, and so on.)
#Count the number of rows in admissions.csv.gzrow_count=$(zcat< ~/mimic/hosp/admissions.csv.gz |wc-l)echo"Number of rows in admissions.csv.gz: $row_count"
Number of rows in admissions.csv.gz: 431232
#Count the number of unique patients (identified by subject_id) in admissions.csv.gz:subject_count=$(zcat< ~/mimic/hosp/admissions.csv.gz |tail-n +2 |awk-F',''{print $1}'|sort-u|wc-l)echo"Number of unique patients in admissions.csv.gz (excluding header): $subject_count"
Number of unique patients in admissions.csv.gz (excluding header): 180733
#Count the number of unique patients (identified by subject_id) in patients.csv.gz:subject_count=$(zcat< ~/mimic/hosp/patients.csv.gz |tail-n +2 |awk-F',''{print $1}'|sort-u|wc-l)echo"Number of unique patients in patients.csv.gz (excluding header): $subject_count"
Number of unique patients in patients.csv.gz (excluding header): 299712
Therefore, the number of patients listed in the patients.csv.gz file doesn’t match the number of unique patients in the admissions.csv.gz file.
What are the possible values taken by each of the variable admission_type, admission_location, insurance, and ethnicity? Also report the count for each unique value of these variables. (Hint: combine Linux commands zcat, head/tail, awk, uniq -c, wc, and so on; skip the header line.)
6626 AMBULATORY OBSERVATION
19554 DIRECT EMER.
18707 DIRECT OBSERVATION
10565 ELECTIVE
94776 EU OBSERVATION
149413 EW EMER.
52668 OBSERVATION ADMIT
34231 SURGICAL SAME DAY ADMISSION
44691 URGENT
Get possible values taken by the variable admission_location and its count
185 AMBULATORY SURGERY TRANSFER
10008 CLINIC REFERRAL
232595 EMERGENCY ROOM
359 INFORMATION NOT AVAILABLE
4205 INTERNAL TRANSFER TO OR FROM PSYCH
5479 PACU
114963 PHYSICIAN REFERRAL
7804 PROCEDURE SITE
35974 TRANSFER FROM HOSPITAL
3843 TRANSFER FROM SKILLED NURSING FACILITY
15816 WALK-IN/SELF REFERRAL
Get possible values taken by the variable insurance and its count
919 AMERICAN INDIAN/ALASKA NATIVE
6156 ASIAN
1198 ASIAN - ASIAN INDIAN
5587 ASIAN - CHINESE
506 ASIAN - KOREAN
1446 ASIAN - SOUTH EAST ASIAN
2530 BLACK/AFRICAN
59959 BLACK/AFRICAN AMERICAN
4765 BLACK/CAPE VERDEAN
2704 BLACK/CARIBBEAN ISLAND
7754 HISPANIC OR LATINO
437 HISPANIC/LATINO - CENTRAL AMERICAN
639 HISPANIC/LATINO - COLUMBIAN
500 HISPANIC/LATINO - CUBAN
4383 HISPANIC/LATINO - DOMINICAN
1330 HISPANIC/LATINO - GUATEMALAN
536 HISPANIC/LATINO - HONDURAN
665 HISPANIC/LATINO - MEXICAN
8076 HISPANIC/LATINO - PUERTO RICAN
892 HISPANIC/LATINO - SALVADORAN
560 MULTIPLE RACE/ETHNICITY
386 NATIVE HAWAIIAN OR OTHER PACIFIC ISLANDER
15102 OTHER
1761 PATIENT DECLINED TO ANSWER
1510 PORTUGUESE
505 SOUTH AMERICAN
1603 UNABLE TO OBTAIN
10668 UNKNOWN
272932 WHITE
1103 WHITE - BRAZILIAN
1170 WHITE - EASTERN EUROPEAN
7925 WHITE - OTHER EUROPEAN
5024 WHITE - RUSSIAN
To compress, or not to compress. That’s the question. Let’s focus on the big data file labevents.csv.gz. Compare compressed gz file size to the uncompressed file size. Compare the run times of zcat < ~/mimic/labevents.csv.gz | wc -l versus wc -l labevents.csv. Discuss the trade off between storage and speed for big data files. (Hint: gzip -dk < FILENAME.gz > ./FILENAME. Remember to delete the large labevents.csv file after the exercise.)
Answer:
First, comparing the file sizes: The compressed file size is 1.8G and the uncompressed file size is 13G. The uncompressed file size is more than 7 times larger than the compressed file size.
-rw-rw-r--@ 1 jacenai staff 1.8G Jan 5 2023 /Users/jacenai/mimic/hosp/labevents.csv.gz
-rw-r--r--@ 1 jacenai staff 13G Mar 7 15:08 /Users/jacenai/mimic/hosp/labevents.csv
Then, comparing the run times: From the output below, in general, the run time of zcat on the compressed file is more than the run time of wc on the uncompressed file.
#run time of zcat on the compressed filetime zcat < ~/mimic/hosp/labevents.csv.gz |wc-l#run time of wc on the uncompressed filetime wc -l ~/mimic/hosp/labevents.csv
118171368
real 0m14.751s
user 0m22.682s
sys 0m1.750s
118171368 /Users/jacenai/mimic/hosp/labevents.csv
real 0m14.347s
user 0m12.338s
sys 0m1.525s
Trade-off between storage and speed: compressed files save storage space but may require additional time for decompression during access. Uncompressed files provide faster access but consume more storage space. If storage space is a critical concern and access speed can be tolerated, compression is beneficial. However, if rapid access is crucial and storage space is not a limiting factor, using uncompressed files might be preferred.
# remove the large uncompressed filerm ~/mimic/hosp/labevents.csv
Text Mining: Who’s popular in Price and Prejudice
You and your friend just have finished reading Pride and Prejudice by Jane Austen. Among the four main characters in the book, Elizabeth, Jane, Lydia, and Darcy, your friend thinks that Darcy was the most mentioned. You, however, are certain it was Elizabeth. Obtain the full text of the novel from http://www.gutenberg.org/cache/epub/42671/pg42671.txt and save to your local folder.
Explain what wget -nc does. Do not put this text file pg42671.txt in Git. Complete the following loop to tabulate the number of times each of the four characters is mentioned using Linux commands.
wget-nc http://www.gutenberg.org/cache/epub/42671/pg42671.txtfor char in Elizabeth Jane Lydia Darcydoecho$char:# some bash commands heredone
Answer:
Explanation: The wget command with the option -nc is used to download a file from the web URL, but it will not overwrite an existing file. Specifically, wget is the command for retrieving files from the web; -nc stands for no-clobber. It ensures that wget won’t overwrite existing files. If the file already exists locally, wget will not download it again.
wget-nc http://www.gutenberg.org/cache/epub/42671/pg42671.txtfor char in Elizabeth Jane Lydia Darcydoecho"$char:"grep-o-i"$char" pg42671.txt |wc-ldone
The number of times each of the four characters is: Elizabeth: 634 Jane: 293 Lydia: 171 Darcy: 418.
What’s the difference between the following two commands?
echo'hello, world'> test1.txt
and
echo'hello, world'>> test2.txt
Answer:
In the first command echo 'hello, world' > test1.txt, > is used for output redirection and creates or overwrites the content of the specified file (test1.txt) with the output of the echo command. While for the second command echo 'hello, world' >> test2.txt, >> is also used for output redirection but appends the output of the echo command to the end of the specified file (test2.txt). In summary, > overwrites the file with new content, while >> appends the content to the end of the file.
Using your favorite text editor (e.g., vi), type the following and save the file as middle.sh:
#!/bin/sh# Select lines from the middle of a file.# Usage: bash middle.sh filename end_line num_lineshead-n"$2""$1"|tail-n"$3"
Using chmod to make the file executable by the owner, and run
./middle.sh pg42671.txt 20 5
Explain the output. Explain the meaning of "$1", "$2", and "$3" in this shell script. Why do we need the first line of the shell script?
Answer:
First, I used vi and type the command, and save the file as middle.sh
echo'#!/bin/sh'> middle.shecho'# Select lines from the middle of a file.'>> middle.shecho'# Usage: bash middle.sh filename end_line num_lines'>> middle.shecho'head -n "$2" "$1" | tail -n "$3"'>> middle.sh
Then, I used chmod to make the file executable by the owner and run the command
Release date: May 9, 2013 [eBook #42671]
Language: English
Output Explanation: The output will be the 5 lines from the middle of the file pg42671.txt, starting from line 16 and ending at line 20. pg42671.txt specifies the file need to read, 20 means that it extracts the first 20 lines from the file pg42671.txt, and 5 means that it extracts the last 5 lines from the output received from that 20 lines.
Explanation of $1, $2, and $3:
Since the following code gives the same output as the command head -20 pg42671.txt | tail -5
head-20 pg42671.txt |tail-5
Release date: May 9, 2013 [eBook #42671]
Language: English
I can conclude that:
$1: this is the first argument ($1) passed to the script. It represents the filename of the file. In this case, it’s assumed to be a file named pg42671.txt.
$2: this is the second argument ($2) passed to the script. It represents the end line number from which to start selecting lines. In this case, it’s 20, meaning that it extracts the first 20 lines from the file pg42671.txt.
$3: this is the third argument ($3) passed to the script. It represents the number of lines to select from the middle. In this case, it’s 5, meaning that it extracts the last 5 lines from the output received from the head command. The result is then printed to the standard output.
Why do we need the first line of the shell script? / Purpose of (#!/bin/sh):
(#!/bin/sh) is essential for indicating which shell interpreter should be used to execute the script. In this case, it specifies /bin/sh, which is a common location for the Bourne shell. Without this line, the script might be interpreted by a different shell, and its behavior could vary.
More fun with Linux
Try following commands in Bash and interpret the results: cal, cal 2024, cal 9 1752 (anything unusual?), date, hostname, arch, uname -a, uptime, who am i, who, w, id, last | head, echo {con,pre}{sent,fer}{s,ed}, time sleep 5, history | tail.
Answer:
cal
March 2025
Su Mo Tu We Th Fr Sa
1
2 3 4 5 6 7 8
9 10 11 12 13 14 15
16 17 18 19 20 21 22
23 24 25 26 27 28 29
30 31
cal: displays the current month’s calendar.
cal 2024
2024
January February March
Su Mo Tu We Th Fr Sa Su Mo Tu We Th Fr Sa Su Mo Tu We Th Fr Sa
1 2 3 4 5 6 1 2 3 1 2
7 8 9 10 11 12 13 4 5 6 7 8 9 10 3 4 5 6 7 8 9
14 15 16 17 18 19 20 11 12 13 14 15 16 17 10 11 12 13 14 15 16
21 22 23 24 25 26 27 18 19 20 21 22 23 24 17 18 19 20 21 22 23
28 29 30 31 25 26 27 28 29 24 25 26 27 28 29 30
31
April May June
Su Mo Tu We Th Fr Sa Su Mo Tu We Th Fr Sa Su Mo Tu We Th Fr Sa
1 2 3 4 5 6 1 2 3 4 1
7 8 9 10 11 12 13 5 6 7 8 9 10 11 2 3 4 5 6 7 8
14 15 16 17 18 19 20 12 13 14 15 16 17 18 9 10 11 12 13 14 15
21 22 23 24 25 26 27 19 20 21 22 23 24 25 16 17 18 19 20 21 22
28 29 30 26 27 28 29 30 31 23 24 25 26 27 28 29
30
July August September
Su Mo Tu We Th Fr Sa Su Mo Tu We Th Fr Sa Su Mo Tu We Th Fr Sa
1 2 3 4 5 6 1 2 3 1 2 3 4 5 6 7
7 8 9 10 11 12 13 4 5 6 7 8 9 10 8 9 10 11 12 13 14
14 15 16 17 18 19 20 11 12 13 14 15 16 17 15 16 17 18 19 20 21
21 22 23 24 25 26 27 18 19 20 21 22 23 24 22 23 24 25 26 27 28
28 29 30 31 25 26 27 28 29 30 31 29 30
October November December
Su Mo Tu We Th Fr Sa Su Mo Tu We Th Fr Sa Su Mo Tu We Th Fr Sa
1 2 3 4 5 1 2 1 2 3 4 5 6 7
6 7 8 9 10 11 12 3 4 5 6 7 8 9 8 9 10 11 12 13 14
13 14 15 16 17 18 19 10 11 12 13 14 15 16 15 16 17 18 19 20 21
20 21 22 23 24 25 26 17 18 19 20 21 22 23 22 23 24 25 26 27 28
27 28 29 30 31 24 25 26 27 28 29 30 29 30 31
cal 2024: displays the calendar for the year 2024.
cal 9 1752
September 1752
Su Mo Tu We Th Fr Sa
1 2 14 15 16
17 18 19 20 21 22 23
24 25 26 27 28 29 30
cal 9 1752: displays the calendar for September 1752. The calendar is unusual from the modern calendar because the switch from the Julian to the Gregorian calendar happened in September 1752, so the calendar missed 11 days form September 3 to September 13.
date
Fri Mar 7 15:08:47 PST 2025
date: Prints the current date and time.
hostname
Jacens-MacBook-Air.local
hostname: prints the hostname of the server that I’m currently logged into.
arch
arm64
arch command in Linux is used to display the architecture of the current system. It provides information about the instruction set architecture of the processor, helping users identify whether it’s a 32-bit or 64-bit system. In my case, my system is running on an arm64 architecture, and it means that my processor supports 64-bit instructions.
uname-a
Darwin Jacens-MacBook-Air.local 23.6.0 Darwin Kernel Version 23.6.0: Mon Jul 29 21:16:46 PDT 2024; root:xnu-10063.141.2~1/RELEASE_ARM64_T8112 arm64
uname -a command in Linux is used to display detailed information about the system’s kernel and hardware. It provides a comprehensive set of details about the system, including the kernel name, network node hostname, kernel release, kernel version, machine hardware name, processor type, hardware platform, and the operating system. In my case, the kernel name is “Darwin,” and the network node hostname is “s-169-232-81-195.resnet.ucla.edu.” The kernel release, version 22.4.0, indicates the specific version of the Darwin kernel. The timestamp “Mon Mar 6 21:01:02 PST 2023” denotes when the kernel was built.
uptime command in Linux is used to display the current time, how long the system has been running, and information about the system’s load averages. It provides a quick overview of the system’s status and activity.
who am i
jacenai Mar 7 15:08
who am i command in Linux is used to display information about the current user who is logged into the system. It provides details such as the username and the time the user logged in. In my case, the username is “jacenai,” and the time the user logged in is “Jan 24 10:11”
who
jacenai console Mar 6 13:52
jacenai ttys000 Mar 7 14:40
jacenai ttys001 Mar 7 14:08
jacenai ttys002 Mar 7 14:09
jacenai ttys003 Mar 7 14:49
jacenai ttys004 Mar 7 14:41
jacenai ttys005 Mar 7 15:03
jacenai ttys007 Mar 7 14:56
jacenai ttys008 Mar 7 14:57
who command displays information about currently logged-in users. It provides details such as the username, terminal, login time
w command displays information about the currently logged-in users and their activities. It provides a summary of user-related information, including details about each user’s login session, the time they’ve been idle, and the commands they are currently running.
id command displays information about the user and group identities (ID) associated with the current user or a specified username.
last|head
jacenai ttys005 Fri Mar 7 15:03 still logged in
jacenai ttys008 Fri Mar 7 14:57 still logged in
jacenai ttys007 Fri Mar 7 14:56 still logged in
jacenai ttys006 Fri Mar 7 14:55 - 14:55 (00:00)
jacenai ttys005 Fri Mar 7 14:52 - 14:52 (00:00)
jacenai ttys003 Fri Mar 7 14:49 still logged in
jacenai ttys004 Fri Mar 7 14:41 still logged in
jacenai ttys000 Fri Mar 7 14:40 still logged in
jacenai ttys004 Fri Mar 7 14:39 - 14:39 (00:00)
jacenai ttys003 Fri Mar 7 14:37 - 14:37 (00:00)
last command in Linux is used to display information about previously logged-in users, including their login and logout times. When combined with the head command, it limits the output to the specified 10 newest lines.
echo {con,pre}{sent,fer}{s,ed} allows me to generate strings by specifying patterns enclosed in curly braces {}. The comma-separated values within the braces represent options, and this command generates all possible combinations of those options. In this case, the command generates the following strings: “consents,” “conferred,” “consented,” “confers,” “presents,” “presented,” “prefers,” and “preferred.”
time sleep 5
real 0m5.008s
user 0m0.000s
sys 0m0.001s
time command in Linux is used to measure the execution time of a given command or script, and sleep 5 pauses the shell for a duration of five seconds. When combined, the command measures the execution time of the sleep 5 command.
history|tail
history command in Linux is used to display the last executed commands. When combined with the tail command, it limits the output to the specified 10 newest lines. In my case, since each command is executed seperately in the terminal, the output of the history command is empty. But I input expected output if the commands operate in line.
# 100 arch# 101 uname -a# 102 uptime# 103 who am i# 104 who# 105 w# 106 id# 107 last | head# 108 time sleep 5# 109 history | tail
Open the project by clicking rep-res-3rd-edition.Rproj and compile the book by clicking Build Book in the Build panel of RStudio. (Hint: I was able to build git_book and epub_book but not pdf_book.)
The point of this exercise is (1) to get the book for free and (2) to see an example how a complicated project such as a book can be organized in a reproducible way.
For grading purpose, include a screenshot of Section 4.1.5 of the book here.
Answer: First, I used fork to create a copy of the repository on github. Then, I used the following command to clone the repository to my local machine.
Next, I opened the project by clicking rep-res-3rd-edition.Rproj and compiled the book by clicking Build Book in the Build panel of RStudio, during which I downloaded a few R packages. Finally, I was able to build git_book. The following screenshot shows the output of Section 4.1.5 of the book.
Notes on Usage of Hoffman2
# log into the clusterssh jia@hoffman2.idre.ucla.edu# Then enter the password# request an interactive session (e.g., 4 hours and 4GB of memory)qrsh-l h_rt=4:00:00,h_data=4G# create a directorymkdir mice_colon# remove the directoryrm-rf Bed_Human_Blood_Reed-2023-9206# remove the filerm Bed_Human_Blood_Reed-2023-9206/CIVR_UGA6_009C.bed# display the content of the directorycat /u/home/j/jia/CellFi/Bed_Human_Blood_Reed-2023-9206/CIVR_UGA6_009C.bed# copy the directory/file/folder to the cluster# when uploading to Hoffman2, making sure that I'm in the local directoryscp-r /Users/jacenai/Desktop/Matteo_Lab/LabResourse/CellFi/samples jia@dtn2.hoffman2.idre.ucla.edu:/u/home/j/jia/CellFi# rename the filemv hoffman2_indiv_network.py hoffman2_indiv_network_1_12.py# move the file to the directorymv /u/home/j/jia/mice_colon/age_seperated_cor_matrix /u/home/j/jia/mice_colon/age_cluster/# load the python module on hoffman2module load python/2.7.15# activate the virtual environmentsource ./Env/bin/activate# check the python versionpython--version# check the pandas versionpython-c"import pandas as pd; print(pd.__version__)"pip show pandas# check the numpy versionpython-c"import numpy as np; print(np.__version__)"pip show numpy# check the scipy versionpython-c"import scipy as sp; print(sp.__version__)"pip show scipy# quit hoffman2exit
Example: Run the .sh file on the cluster
module load Rqsub-l h_rt=4:00:00,h_data=8G -o$SCRATCH/metilene_job.out -j y -M jiachenai@g.ucla.edu -m bea -b y /u/home/j/jia/mice_colon/run_metilene_comparisons_hoffman2.sh# $SCRATCH is an environment variable that refers to a user-specific directory # within the high-performance scratch storage area. This directory is used for # temporary files and data that are needed during a job's execution. # The scratch space is not backed up and is intended for intermediate or # temporary storage while running jobs. You can access your scratch directory # by using the $SCRATCH variable in paths, or directly via /u/scratch/username, # where username is your Hoffman2 username.# quit the running jobqdel 4671427# check the status of the jobqstat-u jia# edit the .sh file using vivi run_metilene_comparisons_hoffman2.sh# looking for cheatsheet for vi# download the output file from the cluster to the local machinescp-r jia@dtn2.hoffman2.idre.ucla.edu:/u/scratch/j/jia/metilene_job.out jacenai@jacens-air.ad.medctr.ucla.edu:/Users/jacenai/Desktop/GSR_Project/Mice_Colon_Data/Hoffman2_output_short/# remember to Enable Remote Login on macOS: System Preferences > Sharing > Remote Login# Then enter the password of the local machine# check the local machine's user name and hostnamewhoamihostname# Better way: run the following command on local machine# Use `scp` to Pull the File from the Clusterscp jia@dtn2.hoffman2.idre.ucla.edu:/u/home/j/jia/mice_colon/wasserstein_network/output/wasserstein_network_results_78_82.csv /Users/jacenai/Desktop/GSR_Project/Mice_Colon_Data/wasserstein_network/# Using `rsync` for Large Filesrsync-avz jia@dtn2.hoffman2.idre.ucla.edu:/u/home/j/jia/mice_colon/wasserstein_network/output/wasserstein_network_results_78_82.csv /Users/jacenai/Desktop/GSR_Project/Mice_Colon_Data/wasserstein_network/# Verify the File Transfer with `ls` command ls /Users/jacenai/Desktop/GSR_Project/Mice_Colon_Data/wasserstein_network/
Details on the run_metilene_comparisons_hoffman2.sh file.
Here I used a loop to compare the methylation levels of different groups of samples. The script is as follows:
#!/bin/bashgroups=("3M""9M""15M""24M""28M")datafile="/u/home/j/jia/mice_colon/whole_data.tsv"metilene_path="/u/home/j/jia/mice_colon/metilene_v0.2-8"output_dir="/u/home/j/jia/mice_colon/output"for((i=0;i<${#groups[@]}-1;i++));dofor((j=i+1;j<${#groups[@]};j++));dogroup_a=${groups[i]}group_b=${groups[j]}output_file="${output_dir}/${group_a}_vs_${group_b}_output.txt"filtered_file="${output_dir}/${group_a}_vs_${group_b}_filtered.bed"echo"Running comparison: $group_a vs $group_b"# Run the comparison$metilene_path/metilene_linux64-a"$group_a"-b"$group_b"-m 8 "$datafile"|sort-V-k1,1-k2,2n>"$output_file"# Run the filtering processecho"Filtering results for $group_a vs $group_b"perl$metilene_path/metilene_output.pl -q"$output_file"-o"$filtered_file"-p 0.05 -c 8 -l 8 -a"$group_a"-b"$group_b"donedone
Run a python file on the cluster
First, prepare all the necessary files and data on the cluster. For example, I have a python file named hoffman2_indiv_network.py.
import pandas as pdimport networkx as nximport matplotlib.pyplot as pltimport community.community_louvain as community_louvain # Correct import for modularity calculationimport glob # For file handling# Function to process each filedef process_file(file_path):# Load the adjacency matrix adjacency_matrix = pd.read_csv(file_path, index_col=0)# Create a graph from the adjacency matrix G = nx.from_pandas_adjacency(adjacency_matrix)# 1. Calculate Number of Edges num_edges = G.number_of_edges()# 2. Centrality (Degree Centrality) degree_centrality = nx.degree_centrality(G)# Calculate Average Degree Centrality average_degree_centrality =sum(degree_centrality.values()) /len(degree_centrality)# 3. Modularity partition = community_louvain.best_partition(G) modularity = community_louvain.modularity(partition, G)# 4. Clustering Coefficient average_clustering_coefficient = nx.average_clustering(G)# Collect results in a dictionary results = {# Extract just the file name and also keep the part before _adjacency_matrix.csv'Individual': file_path.split('/')[-1].split('_adjacency_matrix.csv')[0],'Number of Edges': num_edges,'Average Degree Centrality': average_degree_centrality,'Modularity': modularity,'Average Clustering Coefficient': average_clustering_coefficient }return results# Create an empty DataFrame to store resultsresults_df = pd.DataFrame(columns=['Individual', 'Number of Edges', 'Average Degree Centrality', 'Modularity', 'Average Clustering Coefficient'])# Update the path to where your input files are located on Hoffman2input_path ='/u/home/j/jia/mice_colon/indiv_network/individual_network_stat_1_12/'# Collect all CSV files in the input directoryfile_paths = glob.glob(input_path +'*.csv')# Process each file and append results to the DataFramefor file_path in file_paths: result = process_file(file_path) results_df = pd.concat([results_df, pd.DataFrame([result])], ignore_index=True)# Save the results to a CSV file instead of displaying themoutput_path ='/u/home/j/jia/mice_colon/indiv_network/output/'results_df.to_csv(output_path +'individual_network_results_1_12.csv', index=False)print(f"Results saved to {output_path}")
Make sure that I have all packages installed on the cluster. If not, install them using the following commands in the terminal (on the cluster not in the python)
# first check all the modules availablemodule av# check all the python modules availablemodule av python# here's the example output:# ------------------------- /u/local/Modules/modulefiles -------------------------# python/2.7.15 python/3.6.8 python/3.9.6(default) # python/2.7.18 python/3.7.3 # load the python modulemodule load python/3.9.6pip3 install python-louvain(package name)--user# upload the script/code to the clusterscp-r /Users/jacenai/Desktop/GSR_Project/Colon_Methylation/indiv_network_submit.sh jia@dtn2.hoffman2.idre.ucla.edu:/u/home/j/jia/mice_colon/indiv_network
Then, create a .sh file (submission script) to run the python file. The .sh file is as follows:
#!/bin/bash#$ -cwd # Run the job in the current directory#$ -o joblog.$JOB_ID # Save the job output in a file named "joblog.<job_id>"#$ -j y # Combine output and error logs#$ -l h_rt=24:00:00,h_data=8G # Set runtime to 24 hours and memory to 8 GB#$ -pe shared 1 # Request 1 CPU core#$ -M jiachenai@g.ucla.edu # Email notifications to your UCLA email#$ -m beas # Send email at the start, end, abort, or suspend# Print job start informationecho"Job $JOB_ID started on: "`hostname-s`echo"Job $JOB_ID started on: "`date`echo" "# Initialize the module environment. /u/local/Modules/default/init/modules.sh# Load Python modulemodule load python/3.9.6# Install required Python packages if not already installedecho"Installing required Python packages..."pip3 install --user pandas numpy scipy tqdm# Optional: Verify the Python environmentpython3--versionpip3 list |grep-E"pandas|numpy|scipy|tqdm"# Run the Python scriptecho"Running wasserstein_distance_hoffman2.py..."python3 /u/home/j/jia/mice_colon/wasserstein_network/wasserstein_distance_hoffman2.py# Check for errors during executionif[$?-ne 0 ];thenecho"Job $JOB_ID failed on: "`hostname-s`echo"Job $JOB_ID failed on: "`date`echo"Please check the log file for details: joblog.$JOB_ID"exit 1fi# Print job end informationecho"Job $JOB_ID completed successfully on: "`hostname-s`echo"Job $JOB_ID ended on: "`date`
Another example (with creating virtual environment):
#!/bin/bash#$ -cwd # Run the job in the current directory#$ -o joblog.$JOB_ID # Save the job output in a file named "joblog.<job_id>"#$ -j y # Combine output and error logs#$ -l h_rt=24:00:00,h_data=8G # Set runtime to 24 hours and memory to 8 GB#$ -pe shared 1 # Request 1 CPU core#$ -M jiachenai@g.ucla.edu # Email notifications to your UCLA email#$ -m beas # Send email at the start, end, abort, or suspend# Print job start informationecho"Job $JOB_ID started on: "`hostname-s`echo"Job $JOB_ID started on: "`date`echo" "# Initialize the module environment. /u/local/Modules/default/init/modules.sh# Load Python modulemodule load python/3.9.6# Set up a virtual environmentVENV_DIR="/u/home/j/jia/python_envs/wasserstein_env"if[!-d"$VENV_DIR"];thenecho"Creating virtual environment..."python3-m venv $VENV_DIRfi# Activate the virtual environmentsource$VENV_DIR/bin/activate# Install required Python packages (if not already installed)REQUIRED_PACKAGES="pandas numpy scipy tqdm networkx python-louvain"for PACKAGE in$REQUIRED_PACKAGES;dopip3 show $PACKAGE> /dev/null 2>&1||pip3 install $PACKAGEdone# Optional: Verify the Python environmentecho"Python version:"python3--versionecho"Installed packages:"pip3 list |grep-E"pandas|numpy|scipy|tqdm|networkx|python-louvain"# Run the Python scriptecho"Running wasserstein_indiv_network_1_7.py..."python3 /u/home/j/jia/mice_colon/wasserstein_network/wasserstein_indiv_network_1_7.py# Check for errors during executionif[$?-ne 0 ];thenecho"Job $JOB_ID failed on: "`hostname-s`echo"Job $JOB_ID failed on: "`date`echo"Please check the log file for details: joblog.$JOB_ID"exit 1fi# Print job end informationecho"Job $JOB_ID completed successfully on: "`hostname-s`echo"Job $JOB_ID ended on: "`date`
Then, submit the job to the cluster using the following command:
# make the .sh file executablechmod +x hoffman2_indiv_network.sh# run the .sh file directly./organize_wasserstein.sh# before submitting the job, make sure the .sh file is executable and # request enough resources the resources requested not necessarily # the same as the ones in the .sh file if you're going to # use `qsub` submission script, but if you're going to use `qrsh` # then you need to request the resources in the .sh fileqrsh-l h_rt=4:00:00,h_data=4Gmodule load python/3.9.6# go to the directory where the .sh file is locatedcd /u/home/j/jia/mice_colon/indiv_networkqsub hoffman2_indiv_network.sh
After submitting the job, you can check the status of the job using the following command:
# check the log file for the job# in the run directoryls-lhcat joblog.<job_id># check the status of your own jobsqstat-u$USER# or check the status of all jobsqstat# or check the status of a specific jobqstat<job_id># or check the status of all jobs in a queueqstat-q# or check the status of all jobs in a queue with more detailsqstat-g c# or check the status of all jobs in a queue with even more detailsqstat-g t# or check the status of all jobs in a queue with the most detailsqstat-f# or check the status of all jobs in a queue with the most details and filter by userqstat-f-u$USER
If the storage is full, you can delete the files that are not needed using the following command:
# Ensure you have sufficient disk space and are not exceeding your quota on Hoffman2quota-v
quota output indicates that you have exceeded your disk usage quota on several filesystems. The * next to the blocks column confirms that you are over quota
# Check your disk usagedu-sh /u/home/j/jia/mice_colon/indiv_network# Check your home directory for large or unnecessary filesdu-sh /u/home/j/jia/*|sort-hdu-sh /u/home/j/jia/mice_colon/indiv_network/*|sort-h# Delete files that are no longer neededrm /u/home/j/jia/mice_colon/indiv_network/output/individual_network_results_1_12.csv# Delete a directoryrm-r /u/home/j/jia/mice_colon/indiv_network/output# Check your disk usage againdu-sh /u/home/j/jia/mice_colon/indiv_network# check the most space-consuming filesdu-ah ~ |sort-h|tail-n 20
Run code on Jupyter notebook: (run this command on local machine)
# run the command./h2jupynb-u jia -t 24 -m 10 # -t runtime in hours, -m memory in GB per core# check the help message./h2jupynb--help
h2jupynb [-u ] [-v ] [-t <time, integer number of hours>] [-m <memory, integer number of GB per core>] [-e <parallel environment: 1 for shared, 2 for distributed>] [-s ] [-o ] [-x ] [-a ] [-d ] [-g ] [-c ] [-l ] [-p ] [-j ] [-k ] [-b ] [-z <write ssh debug files?:yes/no>].
How to reconnect to a Jupyter notebook session on Hoffman2:
# First, get the information of the running Jupyter notebook # run this on Hoffman2qstat-u$USER# Knowing the job IDqstat-j 7164149qstat-j<job_id># Find the information at the bottom of the output: # exec_host_list 1: n7361:1# n7361 is the exec host where the Jupyter notebook is running on Hoffman2# Second, ssh to the exec host # ssh means secure shellssh jia@n7361ssh jia@<exec_host># Load the Appropriate Python Module# For instance, you can load the Anaconda module, which comes with Jupyter Notebook pre-installed.module load anaconda3/2020.11# Verify the Jupyter Installationjupyter--version# Start the Jupyter Notebook Serverjupyter notebook --no-browser--ip=0.0.0.0 --port=8692# This command starts the Jupyter Notebook server, listening on all IP addresses (0.0.0.0) and on the specified port (8692) on my local machine.# Finally on your local machine, run the following command to connect to the Jupyter notebookssh-L 8692:n7361:8692 jia@hoffman2.idre.ucla.edussh-L<local_port>:<exec_host>:<exec_host_port><username>@hoffman2.idre.ucla.edu# Open a web browser and navigate to the following URL:http://localhost:8692http://localhost:<local_port># This URL connects to the Jupyter Notebook server running on the exec host through the SSH tunnel.
RStudio on Hoffman2
Create a permanent library for R on Hoffman2
When you install an R package in RStudio, it typically goes to:
# in R, run this command.libPaths()
A list of applications available via modules
all_apps_via_modules# or the module for a particular application can be searched via:modules_lookup-m<app-name>
Creating a Permanent R Library Directory.
Step 1:
Choose a location where you want to store R packages permanently:
# bash command# Choose a location where you want to store R packages permanently:mkdir-p$HOME/R/library# Or, if you prefer scratch storage (faster, but not backed up):mkdir-p$SCRATCH/R/library
Run code on RStudio on Hoffman2
Step 2:
Run RStudio inside the container
# bash command# Request an interactive job:qrsh-l h_rt=4:00:00,h_data=4G # this does not influence the running storage# Load the apptainer module:module load apptainer# Set up a large temporary directory for RStudio to use:mkdir-p$SCRATCH/rstudio_large_tmpexportTMPDIR=$SCRATCH/rstudio_large_tmp# set the R_LIBS_USER environment variable to the location you chose in Step 1:exportR_LIBS_USER=$HOME/R/library# You can replace export RSTUDIO_VERSION=4.1.0 with any Rstudio version available on Hoffman2# This will display information and an ssh -L ... command to run in a separate terminal.exportRSTUDIO_VERSION=4.1.0# Then, launch RStudio:apptainer run -B$SCRATCH/rstudio_large_tmp:/tmp \-B$SCRATCH/rstudiotmp/var/lib:/var/lib/rstudio-server \-B$SCRATCH/rstudiotmp/var/run:/var/run/rstudio-server \$H2_CONTAINER_LOC/h2-rstudio_${RSTUDIO_VERSION}.sif# Connect to the compute node's port:# Open a another new SSH tunnel on your local computer by running:ssh-L 8787:nXXX:8787 username@hoffman2.idre.ucla.edu # Or whatever command was displayed earlier # Access RStudio in your web browser:http://localhost:8787#or whatever port number that was displayed# exit Rstudio, run[CTRL-C]
Alternative: Use Environment Variables in RStudio
Step 1:
Running RStudio: Automated Script
./h2_rstudio.sh-u H2USERNAME
This will start RStudio as a qrsh job, open a port tunnel, and allow you to access RStudio in your web browser.
Step 2:
Set the Library Path in RStudio.
Once RStudio is running inside the container, open the R console and run:
.libPaths(c("~/R/library", .libPaths()))
This tells R to first check your home directory (~/R/library) before using the default container library.
To make this permanent, add this to your .Rprofile in your home directory:
cat('.libPaths(c("~/R/library", .libPaths()))\n', file ="~/.Rprofile", append =TRUE)# You can verify the installation location using:.libPaths()