R version 4.3.1 (2023-06-16)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Sonoma 14.6.1
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.11.0
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
time zone: America/Los_Angeles
tzcode source: internal
attached base packages:
[1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached):
[1] htmlwidgets_1.6.4 compiler_4.3.1 fastmap_1.2.0 cli_3.6.4
[5] tools_4.3.1 htmltools_0.5.8.1 rstudioapi_0.16.0 yaml_2.3.8
[9] rmarkdown_2.27 knitr_1.47 jsonlite_1.8.8 xfun_0.44
[13] digest_0.6.35 rlang_1.1.5 evaluate_0.23
Basic Commands
# log into the clusterssh jia@hoffman2.idre.ucla.edu# Then enter the password# request an interactive session (e.g., 4 hours and 4GB of memory)qrsh-l h_rt=4:00:00,h_data=4G# create a directorymkdir mice_colon# remove the directoryrm-rf Bed_Human_Blood_Reed-2023-9206# remove the filerm Bed_Human_Blood_Reed-2023-9206/CIVR_UGA6_009C.bed# display the content of the directorycat /u/home/j/jia/CellFi/Bed_Human_Blood_Reed-2023-9206/CIVR_UGA6_009C.bed# copy the directory/file/folder to the cluster# when uploading to Hoffman2, making sure that I'm in the local directoryscp-r /Users/jacenai/Desktop/Matteo_Lab/LabResourse/CellFi/samples jia@dtn2.hoffman2.idre.ucla.edu:/u/home/j/jia/CellFi# rename the filemv hoffman2_indiv_network.py hoffman2_indiv_network_1_12.py# move the file to the directorymv /u/home/j/jia/mice_colon/age_seperated_cor_matrix /u/home/j/jia/mice_colon/age_cluster/# load the python module on hoffman2module load python/2.7.15# activate the virtual environmentsource ./Env/bin/activate# check the python versionpython--version# check the pandas versionpython-c"import pandas as pd; print(pd.__version__)"pip show pandas# check the numpy versionpython-c"import numpy as np; print(np.__version__)"pip show numpy# check the scipy versionpython-c"import scipy as sp; print(sp.__version__)"pip show scipy# Get the information of my resource usagemyresources# quit hoffman2exit
Example: Run the .sh file on the cluster
module load Rqsub-l h_rt=4:00:00,h_data=8G -o$SCRATCH/metilene_job.out -j y -M jiachenai@g.ucla.edu -m bea -b y /u/home/j/jia/mice_colon/run_metilene_comparisons_hoffman2.sh# $SCRATCH is an environment variable that refers to a user-specific directory # within the high-performance scratch storage area. This directory is used for # temporary files and data that are needed during a job's execution. # The scratch space is not backed up and is intended for intermediate or # temporary storage while running jobs. You can access your scratch directory # by using the $SCRATCH variable in paths, or directly via /u/scratch/username, # where username is your Hoffman2 username.# quit the running jobqdel 4671427# check the status of the jobqstat-u jia# edit the .sh file using vivi run_metilene_comparisons_hoffman2.sh# looking for cheatsheet for vi# download the output file from the cluster to the local machinescp-r jia@dtn2.hoffman2.idre.ucla.edu:/u/scratch/j/jia/metilene_job.out jacenai@jacens-air.ad.medctr.ucla.edu:/Users/jacenai/Desktop/GSR_Project/Mice_Colon_Data/Hoffman2_output_short/# remember to Enable Remote Login on macOS: System Preferences > Sharing > Remote Login# Then enter the password of the local machine# check the local machine's user name and hostnamewhoamihostname# Better way: run the following command on local machine# Use `scp` to Pull the File from the Clusterscp jia@dtn2.hoffman2.idre.ucla.edu:/u/home/j/jia/mice_colon/wasserstein_network/output/wasserstein_network_results_78_82.csv /Users/jacenai/Desktop/GSR_Project/Mice_Colon_Data/wasserstein_network/# Using `rsync` for Large Filesrsync-avz jia@dtn2.hoffman2.idre.ucla.edu:/u/home/j/jia/mice_colon/wasserstein_network/output/wasserstein_network_results_78_82.csv /Users/jacenai/Desktop/GSR_Project/Mice_Colon_Data/wasserstein_network/# Verify the File Transfer with `ls` command ls /Users/jacenai/Desktop/GSR_Project/Mice_Colon_Data/wasserstein_network/
Details on the run_metilene_comparisons_hoffman2.sh file.
Here I used a loop to compare the methylation levels of different groups of samples. The script is as follows:
#!/bin/bashgroups=("3M""9M""15M""24M""28M")datafile="/u/home/j/jia/mice_colon/whole_data.tsv"metilene_path="/u/home/j/jia/mice_colon/metilene_v0.2-8"output_dir="/u/home/j/jia/mice_colon/output"for((i=0;i<${#groups[@]}-1;i++));dofor((j=i+1;j<${#groups[@]};j++));dogroup_a=${groups[i]}group_b=${groups[j]}output_file="${output_dir}/${group_a}_vs_${group_b}_output.txt"filtered_file="${output_dir}/${group_a}_vs_${group_b}_filtered.bed"echo"Running comparison: $group_a vs $group_b"# Run the comparison$metilene_path/metilene_linux64-a"$group_a"-b"$group_b"-m 8 "$datafile"|sort-V-k1,1-k2,2n>"$output_file"# Run the filtering processecho"Filtering results for $group_a vs $group_b"perl$metilene_path/metilene_output.pl -q"$output_file"-o"$filtered_file"-p 0.05 -c 8 -l 8 -a"$group_a"-b"$group_b"donedone
Run a python file on the cluster
First, prepare all the necessary files and data on the cluster. For example, I have a python file named hoffman2_indiv_network.py.
import pandas as pdimport networkx as nximport matplotlib.pyplot as pltimport community.community_louvain as community_louvain # Correct import for modularity calculationimport glob # For file handling# Function to process each filedef process_file(file_path):# Load the adjacency matrix adjacency_matrix = pd.read_csv(file_path, index_col=0)# Create a graph from the adjacency matrix G = nx.from_pandas_adjacency(adjacency_matrix)# 1. Calculate Number of Edges num_edges = G.number_of_edges()# 2. Centrality (Degree Centrality) degree_centrality = nx.degree_centrality(G)# Calculate Average Degree Centrality average_degree_centrality =sum(degree_centrality.values()) /len(degree_centrality)# 3. Modularity partition = community_louvain.best_partition(G) modularity = community_louvain.modularity(partition, G)# 4. Clustering Coefficient average_clustering_coefficient = nx.average_clustering(G)# Collect results in a dictionary results = {# Extract just the file name and also keep the part before _adjacency_matrix.csv'Individual': file_path.split('/')[-1].split('_adjacency_matrix.csv')[0],'Number of Edges': num_edges,'Average Degree Centrality': average_degree_centrality,'Modularity': modularity,'Average Clustering Coefficient': average_clustering_coefficient }return results# Create an empty DataFrame to store resultsresults_df = pd.DataFrame(columns=['Individual', 'Number of Edges', 'Average Degree Centrality', 'Modularity', 'Average Clustering Coefficient'])# Update the path to where your input files are located on Hoffman2input_path ='/u/home/j/jia/mice_colon/indiv_network/individual_network_stat_1_12/'# Collect all CSV files in the input directoryfile_paths = glob.glob(input_path +'*.csv')# Process each file and append results to the DataFramefor file_path in file_paths: result = process_file(file_path) results_df = pd.concat([results_df, pd.DataFrame([result])], ignore_index=True)# Save the results to a CSV file instead of displaying themoutput_path ='/u/home/j/jia/mice_colon/indiv_network/output/'results_df.to_csv(output_path +'individual_network_results_1_12.csv', index=False)print(f"Results saved to {output_path}")
Make sure that I have all packages installed on the cluster. If not, install them using the following commands in the terminal (on the cluster not in the python)
# first check all the modules availablemodule av# check all the python modules availablemodule av python# here's the example output:# ------------------------- /u/local/Modules/modulefiles -------------------------# python/2.7.15 python/3.6.8 python/3.9.6(default) # python/2.7.18 python/3.7.3 # load the python modulemodule load python/3.9.6pip3 install python-louvain(package name)--user# upload the script/code to the clusterscp-r /Users/jacenai/Desktop/GSR_Project/Colon_Methylation/indiv_network_submit.sh jia@dtn2.hoffman2.idre.ucla.edu:/u/home/j/jia/mice_colon/indiv_network
Then, create a .sh file (submission script) to run the python file. The .sh file is as follows:
#!/bin/bash#$ -cwd # Run the job in the current directory#$ -o joblog.$JOB_ID # Save the job output in a file named "joblog.<job_id>"#$ -j y # Combine output and error logs#$ -l h_rt=24:00:00,h_data=8G # Set runtime to 24 hours and memory to 8 GB#$ -pe shared 1 # Request 1 CPU core#$ -M jiachenai@g.ucla.edu # Email notifications to your UCLA email#$ -m beas # Send email at the start, end, abort, or suspend# Print job start informationecho"Job $JOB_ID started on: "`hostname-s`echo"Job $JOB_ID started on: "`date`echo" "# Initialize the module environment. /u/local/Modules/default/init/modules.sh# Load Python modulemodule load python/3.9.6# Install required Python packages if not already installedecho"Installing required Python packages..."pip3 install --user pandas numpy scipy tqdm# Optional: Verify the Python environmentpython3--versionpip3 list |grep-E"pandas|numpy|scipy|tqdm"# Run the Python scriptecho"Running wasserstein_distance_hoffman2.py..."python3 /u/home/j/jia/mice_colon/wasserstein_network/wasserstein_distance_hoffman2.py# Check for errors during executionif[$?-ne 0 ];thenecho"Job $JOB_ID failed on: "`hostname-s`echo"Job $JOB_ID failed on: "`date`echo"Please check the log file for details: joblog.$JOB_ID"exit 1fi# Print job end informationecho"Job $JOB_ID completed successfully on: "`hostname-s`echo"Job $JOB_ID ended on: "`date`
Another example (with creating virtual environment):
#!/bin/bash#$ -cwd # Run the job in the current directory#$ -o joblog.$JOB_ID # Save the job output in a file named "joblog.<job_id>"#$ -j y # Combine output and error logs#$ -l h_rt=24:00:00,h_data=8G # Set runtime to 24 hours and memory to 8 GB#$ -pe shared 1 # Request 1 CPU core#$ -M jiachenai@g.ucla.edu # Email notifications to your UCLA email#$ -m beas # Send email at the start, end, abort, or suspend# Print job start informationecho"Job $JOB_ID started on: "`hostname-s`echo"Job $JOB_ID started on: "`date`echo" "# Initialize the module environment. /u/local/Modules/default/init/modules.sh# Load Python modulemodule load python/3.9.6# Set up a virtual environmentVENV_DIR="/u/home/j/jia/python_envs/wasserstein_env"if[!-d"$VENV_DIR"];thenecho"Creating virtual environment..."python3-m venv $VENV_DIRfi# Activate the virtual environmentsource$VENV_DIR/bin/activate# Install required Python packages (if not already installed)REQUIRED_PACKAGES="pandas numpy scipy tqdm networkx python-louvain"for PACKAGE in$REQUIRED_PACKAGES;dopip3 show $PACKAGE> /dev/null 2>&1||pip3 install $PACKAGEdone# Optional: Verify the Python environmentecho"Python version:"python3--versionecho"Installed packages:"pip3 list |grep-E"pandas|numpy|scipy|tqdm|networkx|python-louvain"# Run the Python scriptecho"Running wasserstein_indiv_network_1_7.py..."python3 /u/home/j/jia/mice_colon/wasserstein_network/wasserstein_indiv_network_1_7.py# Check for errors during executionif[$?-ne 0 ];thenecho"Job $JOB_ID failed on: "`hostname-s`echo"Job $JOB_ID failed on: "`date`echo"Please check the log file for details: joblog.$JOB_ID"exit 1fi# Print job end informationecho"Job $JOB_ID completed successfully on: "`hostname-s`echo"Job $JOB_ID ended on: "`date`
Then, submit the job to the cluster using the following command:
# make the .sh file executablechmod +x hoffman2_indiv_network.sh# run the .sh file directly./organize_wasserstein.sh# before submitting the job, make sure the .sh file is executable and # request enough resources the resources requested not necessarily # the same as the ones in the .sh file if you're going to # use `qsub` submission script, but if you're going to use `qrsh` # then you need to request the resources in the .sh fileqrsh-l h_rt=4:00:00,h_data=4Gmodule load python/3.9.6# go to the directory where the .sh file is locatedcd /u/home/j/jia/mice_colon/indiv_networkqsub hoffman2_indiv_network.sh
After submitting the job, you can check the status of the job using the following command:
# check the log file for the job# in the run directoryls-lhcat joblog.<job_id># check the status of your own jobsqstat-u$USER# or check the status of all jobsqstat# or check the status of a specific jobqstat<job_id># or check the status of all jobs in a queueqstat-q# or check the status of all jobs in a queue with more detailsqstat-g c# or check the status of all jobs in a queue with even more detailsqstat-g t# or check the status of all jobs in a queue with the most detailsqstat-f# or check the status of all jobs in a queue with the most details and filter by userqstat-f-u$USER
If the storage is full, you can delete the files that are not needed using the following command:
# Ensure you have sufficient disk space and are not exceeding your quota on Hoffman2quota-v
quota output indicates that you have exceeded your disk usage quota on several filesystems. The * next to the blocks column confirms that you are over quota
# Check your disk usagedu-sh /u/home/j/jia/mice_colon/indiv_network# Check your home directory for large or unnecessary filesdu-sh /u/home/j/jia/*|sort-hdu-sh /u/home/j/jia/mice_colon/indiv_network/*|sort-h# Delete files that are no longer neededrm /u/home/j/jia/mice_colon/indiv_network/output/individual_network_results_1_12.csv# Delete a directoryrm-r /u/home/j/jia/mice_colon/indiv_network/output# Check your disk usage againdu-sh /u/home/j/jia/mice_colon/indiv_network# check the most space-consuming filesdu-ah ~ |sort-h|tail-n 20
Run code on Jupyter notebook: (run this command on local machine)
# run the command./h2jupynb-u jia -t 24 -m 10 # -t runtime in hours, -m memory in GB per core# check the help message./h2jupynb--help
h2jupynb [-u ] [-v ] [-t <time, integer number of hours>] [-m <memory, integer number of GB per core>] [-e <parallel environment: 1 for shared, 2 for distributed>] [-s ] [-o ] [-x ] [-a ] [-d ] [-g ] [-c ] [-l ] [-p ] [-j ] [-k ] [-b ] [-z <write ssh debug files?:yes/no>].
How to reconnect to a Jupyter notebook session on Hoffman2:
# First, get the information of the running Jupyter notebook # run this on Hoffman2qstat-u$USER# Knowing the job IDqstat-j 7164149qstat-j<job_id># Find the information at the bottom of the output: # exec_host_list 1: n7361:1# n7361 is the exec host where the Jupyter notebook is running on Hoffman2# Second, ssh to the exec host # ssh means secure shellssh jia@n7361ssh jia@<exec_host># Load the Appropriate Python Module# For instance, you can load the Anaconda module, which comes with Jupyter Notebook pre-installed.module load anaconda3/2020.11# Verify the Jupyter Installationjupyter--version# Start the Jupyter Notebook Serverjupyter notebook --no-browser--ip=0.0.0.0 --port=8692# This command starts the Jupyter Notebook server, listening on all IP addresses (0.0.0.0) and on the specified port (8692) on my local machine.# Finally on your local machine, run the following command to connect to the Jupyter notebookssh-L 8692:n7361:8692 jia@hoffman2.idre.ucla.edussh-L<local_port>:<exec_host>:<exec_host_port><username>@hoffman2.idre.ucla.edu# Open a web browser and navigate to the following URL:http://localhost:8692http://localhost:<local_port># This URL connects to the Jupyter Notebook server running on the exec host through the SSH tunnel.
RStudio on Hoffman2
Create a permanent library for R on Hoffman2
When you install an R package in RStudio, it typically goes to:
# in R, run this command.libPaths()
A list of applications available via modules
all_apps_via_modules# or the module for a particular application can be searched via:modules_lookup-m<app-name>
Creating a Permanent R Library Directory.
Step 1:
Choose a location where you want to store R packages permanently:
# bash command# Choose a location where you want to store R packages permanently:mkdir-p$HOME/R/library# Or, if you prefer scratch storage (faster, but not backed up):mkdir-p$SCRATCH/R/library
Run code on RStudio on Hoffman2
Step 2:
Run RStudio inside the container
# bash command# Request an interactive job:qrsh-l h_rt=4:00:00,h_data=4G # this does not influence the running storage# Load the apptainer module:module load apptainer# Set up a large temporary directory for RStudio to use:mkdir-p$SCRATCH/rstudio_large_tmpexportTMPDIR=$SCRATCH/rstudio_large_tmp# set the R_LIBS_USER environment variable to the location you chose in Step 1:exportR_LIBS_USER=$HOME/R/library# You can replace export RSTUDIO_VERSION=4.1.0 with any Rstudio version available on Hoffman2# This will display information and an ssh -L ... command to run in a separate terminal.exportRSTUDIO_VERSION=4.1.0# Then, launch RStudio:apptainer run -B$SCRATCH/rstudio_large_tmp:/tmp \-B$SCRATCH/rstudiotmp/var/lib:/var/lib/rstudio-server \-B$SCRATCH/rstudiotmp/var/run:/var/run/rstudio-server \$H2_CONTAINER_LOC/h2-rstudio_${RSTUDIO_VERSION}.sif# Connect to the compute node's port:# Open a another new SSH tunnel on your local computer by running:ssh-L 8787:nXXX:8787 username@hoffman2.idre.ucla.edu # Or whatever command was displayed earlier # Access RStudio in your web browser:http://localhost:8787#or whatever port number that was displayed# exit Rstudio, run[CTRL-C]
Alternative: Use Environment Variables in RStudio
Step 1:
Running RStudio: Automated Script
./h2_rstudio.sh-u H2USERNAME
This will start RStudio as a qrsh job, open a port tunnel, and allow you to access RStudio in your web browser.
Step 2:
Set the Library Path in RStudio.
Once RStudio is running inside the container, open the R console and run:
.libPaths(c("~/R/library", .libPaths()))
This tells R to first check your home directory (~/R/library) before using the default container library.
To make this permanent, add this to your .Rprofile in your home directory:
cat('.libPaths(c("~/R/library", .libPaths()))\n', file ="~/.Rprofile", append =TRUE)# You can verify the installation location using:.libPaths()