Notes on Usage of Hoffman2

Author

Jiachen Ai

Display machine information for reproducibility:

sessionInfo()
R version 4.3.1 (2023-06-16)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Sonoma 14.6.1

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: America/Los_Angeles
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
 [1] htmlwidgets_1.6.4 compiler_4.3.1    fastmap_1.2.0     cli_3.6.4        
 [5] tools_4.3.1       htmltools_0.5.8.1 rstudioapi_0.16.0 yaml_2.3.8       
 [9] rmarkdown_2.27    knitr_1.47        jsonlite_1.8.8    xfun_0.44        
[13] digest_0.6.35     rlang_1.1.5       evaluate_0.23    

Basic Commands

# log into the cluster
ssh jia@hoffman2.idre.ucla.edu

# Then enter the password

# request an interactive session (e.g., 4 hours and 4GB of memory)
qrsh -l h_rt=4:00:00,h_data=4G

# create a directory
mkdir mice_colon

# remove the directory
rm -rf Bed_Human_Blood_Reed-2023-9206

# remove the file
rm Bed_Human_Blood_Reed-2023-9206/CIVR_UGA6_009C.bed

# display the content of the directory
cat /u/home/j/jia/CellFi/Bed_Human_Blood_Reed-2023-9206/CIVR_UGA6_009C.bed

# copy the directory/file/folder to the cluster
# when uploading to Hoffman2, making sure that I'm in the local directory
scp -r /Users/jacenai/Desktop/Matteo_Lab/LabResourse/CellFi/samples jia@dtn2.hoffman2.idre.ucla.edu:/u/home/j/jia/CellFi

# rename the file
mv hoffman2_indiv_network.py hoffman2_indiv_network_1_12.py

# move the file to the directory
mv /u/home/j/jia/mice_colon/age_seperated_cor_matrix /u/home/j/jia/mice_colon/age_cluster/

# load the python module on hoffman2
module load python/2.7.15

# activate the virtual environment
source ./Env/bin/activate

# check the python version
python --version

# check the pandas version
python -c "import pandas as pd; print(pd.__version__)"
pip show pandas

# check the numpy version
python -c "import numpy as np; print(np.__version__)"
pip show numpy

# check the scipy version
python -c "import scipy as sp; print(sp.__version__)"
pip show scipy

# Get the information of my resource usage
myresources

# quit hoffman2
exit

Example: Run the .sh file on the cluster

module load R

qsub -l h_rt=4:00:00,h_data=8G -o $SCRATCH/metilene_job.out -j y -M jiachenai@g.ucla.edu -m bea -b y /u/home/j/jia/mice_colon/run_metilene_comparisons_hoffman2.sh
# $SCRATCH is an environment variable that refers to a user-specific directory 
# within the high-performance scratch storage area. This directory is used for 
# temporary files and data that are needed during a job's execution. 
# The scratch space is not backed up and is intended for intermediate or 
# temporary storage while running jobs. You can access your scratch directory 
# by using the $SCRATCH variable in paths, or directly via /u/scratch/username, 
#   where username is your Hoffman2 username.

# quit the running job
qdel 4671427

# check the status of the job
qstat -u jia

# edit the .sh file using vi
vi run_metilene_comparisons_hoffman2.sh
# looking for cheatsheet for vi


# download the output file from the cluster to the local machine
scp -r jia@dtn2.hoffman2.idre.ucla.edu:/u/scratch/j/jia/metilene_job.out jacenai@jacens-air.ad.medctr.ucla.edu:/Users/jacenai/Desktop/GSR_Project/Mice_Colon_Data/Hoffman2_output_short/
# remember to Enable Remote Login on macOS: System Preferences > Sharing > Remote Login
# Then enter the password of the local machine


# check the local machine's user name and hostname
whoami

hostname


# Better way: run the following command on local machine
# Use `scp` to Pull the File from the Cluster
scp jia@dtn2.hoffman2.idre.ucla.edu:/u/home/j/jia/mice_colon/wasserstein_network/output/wasserstein_network_results_78_82.csv /Users/jacenai/Desktop/GSR_Project/Mice_Colon_Data/wasserstein_network/

# Using `rsync` for Large Files
rsync -avz jia@dtn2.hoffman2.idre.ucla.edu:/u/home/j/jia/mice_colon/wasserstein_network/output/wasserstein_network_results_78_82.csv /Users/jacenai/Desktop/GSR_Project/Mice_Colon_Data/wasserstein_network/

# Verify the File Transfer with `ls` command 
ls /Users/jacenai/Desktop/GSR_Project/Mice_Colon_Data/wasserstein_network/

Cheat sheet for vi

Details on the run_metilene_comparisons_hoffman2.sh file.

Here I used a loop to compare the methylation levels of different groups of samples. The script is as follows:

#!/bin/bash

groups=("3M" "9M" "15M" "24M" "28M")
datafile="/u/home/j/jia/mice_colon/whole_data.tsv"
metilene_path="/u/home/j/jia/mice_colon/metilene_v0.2-8"
output_dir="/u/home/j/jia/mice_colon/output"

for ((i=0; i<${#groups[@]}-1; i++)); do
    for ((j=i+1; j<${#groups[@]}; j++)); do
        group_a=${groups[i]}
        group_b=${groups[j]}
        output_file="${output_dir}/${group_a}_vs_${group_b}_output.txt"
        filtered_file="${output_dir}/${group_a}_vs_${group_b}_filtered.bed"
        
        echo "Running comparison: $group_a vs $group_b"
        
        # Run the comparison
        $metilene_path/metilene_linux64 -a "$group_a" -b "$group_b" -m 8 "$datafile" | sort -V -k1,1 -k2,2n > "$output_file"
        
        # Run the filtering process
        echo "Filtering results for $group_a vs $group_b"
        perl $metilene_path/metilene_output.pl -q "$output_file" -o "$filtered_file" -p 0.05 -c 8 -l 8 -a "$group_a" -b "$group_b"
        
    done
done

Run a python file on the cluster

First, prepare all the necessary files and data on the cluster. For example, I have a python file named hoffman2_indiv_network.py.

import pandas as pd
import networkx as nx
import matplotlib.pyplot as plt
import community.community_louvain as community_louvain  # Correct import for modularity calculation
import glob  # For file handling

# Function to process each file
def process_file(file_path):
    
    # Load the adjacency matrix
    adjacency_matrix = pd.read_csv(file_path, index_col=0)
    
    # Create a graph from the adjacency matrix
    G = nx.from_pandas_adjacency(adjacency_matrix)

    # 1. Calculate Number of Edges
    num_edges = G.number_of_edges()

    # 2. Centrality (Degree Centrality)
    degree_centrality = nx.degree_centrality(G)

    # Calculate Average Degree Centrality
    average_degree_centrality = sum(degree_centrality.values()) / len(degree_centrality)

    # 3. Modularity
    partition = community_louvain.best_partition(G)
    modularity = community_louvain.modularity(partition, G)

    # 4. Clustering Coefficient
    average_clustering_coefficient = nx.average_clustering(G)

    # Collect results in a dictionary
    results = {
        # Extract just the file name and also keep the part before _adjacency_matrix.csv
        'Individual': file_path.split('/')[-1].split('_adjacency_matrix.csv')[0],
        'Number of Edges': num_edges,
        'Average Degree Centrality': average_degree_centrality,
        'Modularity': modularity,
        'Average Clustering Coefficient': average_clustering_coefficient
    }

    return results


# Create an empty DataFrame to store results
results_df = pd.DataFrame(columns=['Individual', 
                                   'Number of Edges', 
                                   'Average Degree Centrality', 
                                   'Modularity', 
                                   'Average Clustering Coefficient'])

# Update the path to where your input files are located on Hoffman2
input_path = '/u/home/j/jia/mice_colon/indiv_network/individual_network_stat_1_12/' 

# Collect all CSV files in the input directory
file_paths = glob.glob(input_path + '*.csv')

# Process each file and append results to the DataFrame
for file_path in file_paths:
    result = process_file(file_path)
    results_df = pd.concat([results_df, pd.DataFrame([result])], ignore_index=True)

# Save the results to a CSV file instead of displaying them
output_path = '/u/home/j/jia/mice_colon/indiv_network/output/'
results_df.to_csv(output_path + 'individual_network_results_1_12.csv',
                  index=False)

print(f"Results saved to {output_path}")

Make sure that I have all packages installed on the cluster. If not, install them using the following commands in the terminal (on the cluster not in the python)

# first check all the modules available
module av

# check all the python modules available
module av python

# here's the example output:
# ------------------------- /u/local/Modules/modulefiles -------------------------
# python/2.7.15  python/3.6.8  python/3.9.6(default)  
# python/2.7.18  python/3.7.3 

# load the python module
module load python/3.9.6

pip3 install python-louvain(package name) --user

# upload the script/code to the cluster
scp -r /Users/jacenai/Desktop/GSR_Project/Colon_Methylation/indiv_network_submit.sh jia@dtn2.hoffman2.idre.ucla.edu:/u/home/j/jia/mice_colon/indiv_network

Then, create a .sh file (submission script) to run the python file. The .sh file is as follows:

#!/bin/bash
#$ -cwd                        # Run the job in the current directory
#$ -o joblog.$JOB_ID           # Save the job output in a file named "joblog.<job_id>"
#$ -j y                        # Combine output and error logs
#$ -l h_rt=24:00:00,h_data=8G  # Set runtime to 24 hours and memory to 8 GB
#$ -pe shared 1                # Request 1 CPU core
#$ -M jiachenai@g.ucla.edu     # Email notifications to your UCLA email
#$ -m beas                     # Send email at the start, end, abort, or suspend

# Print job start information
echo "Job $JOB_ID started on:   " `hostname -s`
echo "Job $JOB_ID started on:   " `date `
echo " "

# Initialize the module environment
. /u/local/Modules/default/init/modules.sh

# Load Python module
module load python/3.9.6

# Install required Python packages if not already installed
echo "Installing required Python packages..."
pip3 install --user pandas numpy scipy tqdm

# Optional: Verify the Python environment
python3 --version
pip3 list | grep -E "pandas|numpy|scipy|tqdm"

# Run the Python script
echo "Running wasserstein_distance_hoffman2.py..."
python3 /u/home/j/jia/mice_colon/wasserstein_network/wasserstein_distance_hoffman2.py

# Check for errors during execution
if [ $? -ne 0 ]; then
  echo "Job $JOB_ID failed on:   " `hostname -s`
  echo "Job $JOB_ID failed on:   " `date `
  echo "Please check the log file for details: joblog.$JOB_ID"
  exit 1
fi

# Print job end information
echo "Job $JOB_ID completed successfully on:   " `hostname -s`
echo "Job $JOB_ID ended on:   " `date `

Another example (with creating virtual environment):

#!/bin/bash
#$ -cwd                        # Run the job in the current directory
#$ -o joblog.$JOB_ID           # Save the job output in a file named "joblog.<job_id>"
#$ -j y                        # Combine output and error logs
#$ -l h_rt=24:00:00,h_data=8G  # Set runtime to 24 hours and memory to 8 GB
#$ -pe shared 1                # Request 1 CPU core
#$ -M jiachenai@g.ucla.edu     # Email notifications to your UCLA email
#$ -m beas                     # Send email at the start, end, abort, or suspend

# Print job start information
echo "Job $JOB_ID started on:   " `hostname -s`
echo "Job $JOB_ID started on:   " `date`
echo " "

# Initialize the module environment
. /u/local/Modules/default/init/modules.sh

# Load Python module
module load python/3.9.6

# Set up a virtual environment
VENV_DIR="/u/home/j/jia/python_envs/wasserstein_env"
if [ ! -d "$VENV_DIR" ]; then
  echo "Creating virtual environment..."
  python3 -m venv $VENV_DIR
fi

# Activate the virtual environment
source $VENV_DIR/bin/activate

# Install required Python packages (if not already installed)
REQUIRED_PACKAGES="pandas numpy scipy tqdm networkx python-louvain"
for PACKAGE in $REQUIRED_PACKAGES; do
  pip3 show $PACKAGE > /dev/null 2>&1 || pip3 install $PACKAGE
done

# Optional: Verify the Python environment
echo "Python version:"
python3 --version
echo "Installed packages:"
pip3 list | grep -E "pandas|numpy|scipy|tqdm|networkx|python-louvain"

# Run the Python script
echo "Running wasserstein_indiv_network_1_7.py..."
python3 /u/home/j/jia/mice_colon/wasserstein_network/wasserstein_indiv_network_1_7.py

# Check for errors during execution
if [ $? -ne 0 ]; then
  echo "Job $JOB_ID failed on:   " `hostname -s`
  echo "Job $JOB_ID failed on:   " `date`
  echo "Please check the log file for details: joblog.$JOB_ID"
  exit 1
fi

# Print job end information
echo "Job $JOB_ID completed successfully on:   " `hostname -s`
echo "Job $JOB_ID ended on:   " `date`

Then, submit the job to the cluster using the following command:

# make the .sh file executable
chmod +x hoffman2_indiv_network.sh

# run the .sh file directly
./organize_wasserstein.sh

# before submitting the job, make sure the .sh file is executable and 
# request enough resources the resources requested not necessarily 
# the same as the ones in the .sh file if you're going to 
# use `qsub` submission script, but if you're going to use `qrsh` 
# then you need to request the resources in the .sh file
qrsh -l h_rt=4:00:00,h_data=4G

module load python/3.9.6

# go to the directory where the .sh file is located
cd /u/home/j/jia/mice_colon/indiv_network

qsub hoffman2_indiv_network.sh

After submitting the job, you can check the status of the job using the following command:

# check the log file for the job
# in the run directory
ls -lh
cat joblog.<job_id>


# check the status of your own jobs
qstat -u $USER

# or check the status of all jobs
qstat

# or check the status of a specific job
qstat <job_id>

# or check the status of all jobs in a queue
qstat -q

# or check the status of all jobs in a queue with more details
qstat -g c

# or check the status of all jobs in a queue with even more details
qstat -g t

# or check the status of all jobs in a queue with the most details
qstat -f

# or check the status of all jobs in a queue with the most details and filter by user
qstat -f -u $USER

If the storage is full, you can delete the files that are not needed using the following command:

# Ensure you have sufficient disk space and are not exceeding your quota on Hoffman2
quota -v

quota output indicates that you have exceeded your disk usage quota on several filesystems. The * next to the blocks column confirms that you are over quota

# Check your disk usage
du -sh /u/home/j/jia/mice_colon/indiv_network

# Check your home directory for large or unnecessary files
du -sh /u/home/j/jia/* | sort -h
du -sh /u/home/j/jia/mice_colon/indiv_network/* | sort -h

# Delete files that are no longer needed
rm /u/home/j/jia/mice_colon/indiv_network/output/individual_network_results_1_12.csv

# Delete a directory
rm -r /u/home/j/jia/mice_colon/indiv_network/output

# Check your disk usage again
du -sh /u/home/j/jia/mice_colon/indiv_network

# check the most space-consuming files
du -ah ~ | sort -h | tail -n 20

Run code on Jupyter notebook: (run this command on local machine)

# run the command
./h2jupynb -u jia -t 24 -m 10  # -t runtime in hours, -m memory in GB per core

# check the help message
./h2jupynb --help

h2jupynb [-u ] [-v ] [-t <time, integer number of hours>] [-m <memory, integer number of GB per core>] [-e <parallel environment: 1 for shared, 2 for distributed>] [-s ] [-o ] [-x ] [-a ] [-d ] [-g ] [-c ] [-l ] [-p ] [-j ] [-k ] [-b ] [-z <write ssh debug files?:yes/no>].

How to reconnect to a Jupyter notebook session on Hoffman2:

# First, get the information of the running Jupyter notebook 
# run this on Hoffman2
qstat -u $USER

# Knowing the job ID
qstat -j 7164149
qstat -j <job_id>

# Find the information at the bottom of the output: 
# exec_host_list        1:    n7361:1
# n7361 is the exec host where the Jupyter notebook is running on Hoffman2


# Second, ssh to the exec host # ssh means secure shell
ssh jia@n7361
ssh jia@<exec_host>

# Load the Appropriate Python Module
# For instance, you can load the Anaconda module, which comes with Jupyter Notebook pre-installed.
module load anaconda3/2020.11

# Verify the Jupyter Installation
jupyter --version

# Start the Jupyter Notebook Server
jupyter notebook --no-browser --ip=0.0.0.0 --port=8692
# This command starts the Jupyter Notebook server, listening on all IP addresses (0.0.0.0) and on the specified port (8692) on my local machine.



# Finally on your local machine, run the following command to connect to the Jupyter notebook
ssh -L 8692:n7361:8692 jia@hoffman2.idre.ucla.edu
ssh -L <local_port>:<exec_host>:<exec_host_port> <username>@hoffman2.idre.ucla.edu

# Open a web browser and navigate to the following URL:
http://localhost:8692
http://localhost:<local_port>
# This URL connects to the Jupyter Notebook server running on the exec host through the SSH tunnel.

RStudio on Hoffman2

Create a permanent library for R on Hoffman2

When you install an R package in RStudio, it typically goes to:

# in R, run this command
.libPaths()

A list of applications available via modules

all_apps_via_modules

# or the module for a particular application can be searched via:
modules_lookup -m <app-name>

Creating a Permanent R Library Directory.

Step 1:

Choose a location where you want to store R packages permanently:

# bash command

# Choose a location where you want to store R packages permanently:
mkdir -p $HOME/R/library

# Or, if you prefer scratch storage (faster, but not backed up):
mkdir -p $SCRATCH/R/library

Run code on RStudio on Hoffman2

Step 2:

Run RStudio inside the container

# bash command

# Request an interactive job:
qrsh -l h_rt=4:00:00,h_data=4G  # this does not influence the running storage

# Load the apptainer module:
module load apptainer

# Set up a large temporary directory for RStudio to use:
mkdir -p $SCRATCH/rstudio_large_tmp
export TMPDIR=$SCRATCH/rstudio_large_tmp

# set the R_LIBS_USER environment variable to the location you chose in Step 1:
export R_LIBS_USER=$HOME/R/library


# You can replace export RSTUDIO_VERSION=4.1.0 with any Rstudio version available on Hoffman2
# This will display information and an ssh -L ... command to run in a separate terminal.
export RSTUDIO_VERSION=4.1.0

# Then, launch RStudio:
apptainer run -B $SCRATCH/rstudio_large_tmp:/tmp \
              -B $SCRATCH/rstudiotmp/var/lib:/var/lib/rstudio-server \
              -B $SCRATCH/rstudiotmp/var/run:/var/run/rstudio-server \
              $H2_CONTAINER_LOC/h2-rstudio_${RSTUDIO_VERSION}.sif

              
# Connect to the compute node's port:
# Open a another new SSH tunnel on your local computer by running:
ssh  -L 8787:nXXX:8787 username@hoffman2.idre.ucla.edu # Or whatever command was displayed earlier 
# Access RStudio in your web browser:
http://localhost:8787 #or whatever port number that was displayed

# exit Rstudio, run
[CTRL-C]

Alternative: Use Environment Variables in RStudio

Step 1:

Running RStudio: Automated Script

./h2_rstudio.sh -u H2USERNAME

This will start RStudio as a qrsh job, open a port tunnel, and allow you to access RStudio in your web browser.

Step 2:

Set the Library Path in RStudio.

Once RStudio is running inside the container, open the R console and run:

.libPaths(c("~/R/library", .libPaths()))

This tells R to first check your home directory (~/R/library) before using the default container library.

To make this permanent, add this to your .Rprofile in your home directory:

cat('.libPaths(c("~/R/library", .libPaths()))\n', file = "~/.Rprofile", append = TRUE)

# You can verify the installation location using:
.libPaths()

Step 3:

Install Packages to the Permanent Directory

install.packages("ggplot2", lib="~/R/library")