HPC Cluster

CGD/ISG maintains a small high performance computing (HPC) cluster that's ideal for model development and low resolution model runs. Best of all, it's FREE! Yep, we don't charge GAUs.

As of March, 2021, the cluster izumi.cgd.ucar.edu hardware specifications are:

  • 48 X Intel(R) Xeon(R) Gold 5118 CPU @ 2.30GHz
  • 14 nodes each with 96GB memory and 48 cores

As of August, 2016, the cluster hobart.cgd.ucar.edu hardware specifications are:

  • 32 x Dell R430 Server
  • R430 Specifications: 96 GB RAM, 48 x Intel(R) Xeon(R) CPU E5-2670 v3 @ 2.30GHz
  • Interconnect: 40 Mb/s Qlogic Infiniband
  • 1536 processors

For more information on the cluster, see:

Cluster Accounts

The cluster is designed for multi-threaded, multi-node code written specifically to take advantage of the architecture. In other words, software not written for high performance computing systems, e.g., Matlab, will not automatically run faster just because it's run on a cluster.

The CGD cluster is a comparitively small cluster that is ideal for testing, debugging, and scientific problems that are or lower resolution or smaller domain size. This provides a solid alternative to spending precious GAUs and/or cycles on other super computers.

Accounts on the CGD cluster are available to CGD members and designated collaborators. Please send a request to help@ucar.edu

Available Software

In addition to the software you'll normally find in /usr/local, these packages were compiled specifically to work on the cluster.

Compilers

  • GNU - gcc, gfortran
  • Intel Cluster Studio - icc, icpc, ifort
  • Portland Group - pgcc, pgCC, pgc++, pgfortran
  • NAG - nagfor

Debuggers

  • GDB
  • Intel
  • PGI
  • TotalView

MPI Libraries

  • mvapich
  • mvapich2
  • openmpi

NetCDF

NetCDF is compiled against each of the available compilers. Note that when you load the module for your compiler, the NetCDF libraries are also loaded.

Modules

Modules are used to quickly and easily setup your computing environment. Loading a module will correctly set up the paths to your binaries and libraries so you don't have to do it manually. To use modules, you only need a few simple commands or use "module help" to see all options:

  • module avail - see what environments are available
  • module load - load your environment
  • module list - see what you have loaded
  • module purge - remove all modules

Missing Software?

Are we missing software that you need to do your work? If so, send an email to help@ucar.edu and we can discuss its installation on the cluster.

Submitting Jobs

The following qsub script provides a comprehensive template for submitting jobs to the queue manager.

#!/bin/sh
#

### Job name
#
#PBS -N MostExcellentJob

### Declare job non-rerunable
#
#PBS -r n

### Output files - sort to top of directory.
#
#PBS -e AAA_MostExcellentJob.err
#PBS -o AAA_MostExcellentJob.log

# Mail to user
#
#PBS -m ae

### Queue name (short, medium, long, verylong)
#
#PBS -q medium
#
# Number of nodes, number of processors
#
# nodes = physical host
# ppn = processors per node (i.e., number of cores)
#
#PBS -l nodes=1:ppn=48

#
# This job's working directory
#
echo `date`
echo Working directory is $PBS_O_WORKDIR
cd $PBS_O_WORKDIR

# May be necessary for some OMP jobs.
#
#export KPM_STACKSIZE=50m

echo "Environment:"
echo "--------------"
echo ""

# Print out some job information for debugging.
#
echo Running $PROGNAME on host `hostname`
echo Time is `date`
echo Directory is `pwd`

# Configure the run environment.
#
module load compiler/intel/default

# Convert the host file to use IB
#
/cluster/bin/make_ib_hosts.sh

# Get the number of procs by counting the nodes file,
# which was generated from the #PBS -l line above.
#
NPROCS=`wc -l < $PBS_NODEFILE`

echo "Node File:"
echo "----------"
cat "$PBS_NODEFILE"
echo ""

# Run the parallel MPI executable
#
echo "`date` mpiexec - Start"

mpiexec -v -np $NPROCS ./MostExcellentJob

echo ""
echo "`date` MPIRUN - END"

exit 0

Then submit the the job using:

/usr/local/torque/bin/qsub <script>

Interactive Sessions

An interactive session for debugging is made with a qsub command in the form:

qsub -I -q <queue> -l nodes=<#nodes>:ppn=<processors per node>

See an example below:

% qsub -I -q medium -l nodes=2:ppn=48

Queues

Available queues and limits on the cluster can be displayed with 'qstat -q':

% qstat -q

server: hobart.cgd.ucar.edu

Queue Memory CPU Time Walltime Node Run Que Lm State
---------------- ------ -------- -------- ---- --- --- -- -----
short -- 16:00:00 02:00:00 8 2 0 50 E R
overnight -- 18432:00 32:00:00 16 0 0 10 E S
default -- -- -- -- 0 0 10 E R
monster -- 17280:00 3000:00: 32 0 0 10 E R
shared -- 12:00:00 12:00:00 1 0 0 16 E R
medium -- 72:00:00 08:00:00 6 0 0 50 E R
verylong -- 480:00:0 80:00:00 8 0 0 50 E R
long -- 240:00:0 40:00:00 8 2 0 50 E R

The monster queue is reserved for systems testing and is not available for general use.

The overnight queue is turned on at 6:00pm nightly and off at 6:00am daily. This queue allows more nodes per job and for longer run times than during the day. Jobs still running in the overnight queue at 6:00am are killed. Other queues are not affected.