// Using Prism

Prism is equipped with several powerful servers built specifically for accelerating AI/ML/DL workloads with GPUs. This cutting edge platform is easy to access and the preinstalled software/libraries provide foundational tools that enable scientists to maximize their workflows.

Environment

System Sockets Cores per socket Total Cores Memory (GB) NVME Storage (TB) GPUs (NVIDIA V100)
gpu[001-022] 2 20 40 768 3.8 4x32GB
gpu100 2 64 128 1024 14 8xNVIDIA
A100 40GB
Operating System and Software

The systems come with the following software and libraries pre-installed. However, users will have to use Anaconda environments to access python machine-learning packages.


Access

To gain access to the Prism GPU Cluster, please contact NCCS User Support and request access to Prism on ADAPT. You may connect by logging in to adapt.nccs.nasa.gov, then ssh to gpulogin1. Once you are connected to the login node, you will need to use SLURM in order to access the Prism GPU resources.

For more information on SLURM, see the 'SLURM' section below.

For more information on access and login, see NCCS Account Setup.

Tools

Ganglia
To see gpu0 utilization on every node, go to: Prism Ganglia

All nodes have four GPUs, you need to select “gpuX_util” for GPUs 1, 2 and 3. Once we get the DGX added, that will have GPUs 4-7, as well.

JupyterHub
See here for information on accessing and using the current JupyterHubs on ADAPT.

Anaconda
Anaconda environments have been used to install the Python machine learning frameworks. These environments can be accessed by loading the anaconda module with:

  • Prism GPU Cluster: ‘module load anaconda’

To activate an environment run ‘conda activate <ENV>’ on the environment of your choice once the module is loaded.

Users can inspect the complete list of packages and versions installed within an environment by running: $ conda list

Users can also inspect other available environments by running: $ conda env list

Users may also create Anaconda environments in their home directory. This will allow users to maintain the environment on their own.

For more information regarding Anaconda usage in ADAPT, see our Instructional Video and Tech Talk slides.

If you are experiencing issues with Anaconda, and/or if additional package installation support is needed, please contact NCCS support.

Modules
Users are recommended to load both the Anaconda and NVIDIA modules in Prism using 'module load nvidia' and 'module load anaconda'; use 'module spider' for more options. For more information on how to load modules, see the Tips & Info for New NCCS ADAPT Users Tech Talk slides under "Modules".

SLURM

SLURM allows for more efficient resource allocation, fairer sharing, and easier management of the resources on the GPU nodes. There are three main ways to interface with SLURM on the GPU nodes:

  1. ‘sbatch’: Submit a batch script to Slurm. Create a job script that can be submitted to the queue and call multiple tasks from within.

  2. ‘srun’: Specify resources for running a single command or execute a job step.

  3. ‘salloc’: Run interactively on allocated resources, or run a step by step job.

All three of these mechanisms share most, but not all, of the standard SLURM configuration flags. Some of the most useful are as follows:

Flag Description
-G<NUM> Specifies the total number of GPUs to be allocated to the job.
-t<TIME> Allows you to set a time limit for your jobs and allocations. Acceptable formats include ‘-t <MINUTES>’, ‘-t HH:MM:SS’,’ -t D-HH:MM:SS’
--nodelist=<NODES> Allows you to specify the nodes that you would like your jobs to run on. By default the pool includes all available nodes, but you could specify one (or more nodes separated by comma) to restrict the systems on which your work will run. This is not recommended though as you may end up waiting in a queue for a certain system when other resources may already be available.
-n <NUM_TASKS> Specifies the number of tasks to run. In sbatch this is the maximum number of tasks to be run at any given time. This allows adequate resources to be allocated upon job submission.
-c <CPUS> Specifies the number of processors to be allocated to each task.
-N <NUM_NODES> Specifies the number of nodes to run on.
-J <JOB_NAME> Allows you to name your job.
--mem Specifies the minimum required amount of memory allocated per node.


Examples & Links to Documentation:
Here are some examples to get you started. If you desire more advanced configuration to optimize your jobs, reference the documentation (linked below) for each command to see the complete list of available flags and usage options.

SBATCH: ‘sbatch job.sh’
#!/bin/bash
#SBATCH -G5 -t 60 -n5 -N1 -J myBatchScriptSLURMJob --export=ALL
module load anaconda
conda activate <ENV>
#Run the same task
#Run tasks in parallel with ‘&’ and 'wait'
srun -G3 -n1 python my_program.py1 &
srun -G2 -n1 python my_program2.py & wait
#Run tasks sequentially without ‘&’
srun -G5 -n1 python my_program3.py
srun -G5 -n1 python my_program4.py

SRUN:
srun -G2 -t 60 -n1 --mem-per-cpu=100 -J myOneLineSLURMJob python myScript.py

SALLOC:
salloc -G1 -t 60 -n1 -c6 –mem-per-cpu=1028 --nodelist=gpu001 -J myInteractiveSLURMJob

SALLOC for the DGX (NVIDIA A100s):
salloc -G2 -p dgx

Running salloc will give you an interactive shell with access to specified resources. This is similar to sshing into one of the nodes, however, the resources that you can use will be limited to those that are requested through your allocation.

Other Useful SLURM Commands for Managing Your Jobs:

  • squeue: Shows the list of jobs in the current queue.
  • scancel <JOB_ID>: Cancel one of your active jobs.
  • sinfo: Lists available resources in the SLURM cluster.

Learn more about SLURM on ADAPT

High Speed Data Storage

Should your workload be experiencing limitations as a result of file I/O, each system has several TB of high speed NVME based storage available for use. This storage is located at “/lscratch”. Run the command "mkdir /lscratch/$(whoami)" to create a directory owned by you. This storage is local to the node, and is not shared with other nodes in the cluster. Note that this space is temporary, and data stored there is subject to deletion at the completion of your job.