Welcome to CNQO!

Here are some notes for how to use the CNQO lab computers and servers.

Accessing the System

Let a member of Physics IT Support know your username, and they will get you onto the system.

Ways to log in

Once you are set up, you can login using:

SSH terminal (for example, PuTTY)
ThinLinc remote desktop app
The PCs in the CNQO labs (JA7.13)

Password-less Access

Once logged in, you can ssh between servers without a password, provided you use the servername-s name. For example:

ssh phys-vole-s

which uses the CNQO private network between servers.

Servers configured with host-based authentication are:

wildebeest-s
phys-vole-s
ribbo1-17
hippo2
phys-porpoise-s

CNQO Servers and what they do

wildebeest.phys.strath.ac.uk is the main access server, but don't run anything serious on here. Instead SSH from here to one of the other servers below to perform calculations:

Wildebeest

Phys-vole

Server for submitting jobs to SLURM queue
Login access for on-campus users

CNQO_intel queue

Node	Cores	Memory
Ribbo1	24 cores	96GB RAM
Ribbo2	24 cores	96GB RAM
Ribbo3	24 cores	128GB RAM
Ribbo8	40 cores	128GB RAM
Ribbo9	40 cores	128GB RAM
Ribbo10	16 cores	192GB RAM
Ribbo14	20 cores	64GB RAM
Ribbo15	20 cores	64GB RAM
Ribbo16	20 cores	64GB RAM
Ribbo17	20 cores	64GB RAM

Compute nodes in the cnqo_intel queue
Intel chips

SMTS_intel queue

Compute nodes Ribbo4-7
Intel chips, 40 cores on each node
192GB RAM

Marine_intel queue

Compute nodes Ribbo11-13
Intel chips, 40 cores on each node
256GB or 384GB RAM

Hippo2

Compute node
AMD chips, 48 cores
128GB RAM

Phys-porpoise

Server for standalone non-SLURM work
JupyterHub server - please ask if you want access

Olinguito

Workstation with dedicated graphics card, for visualisation (JA7.18)

Armadillo

File server, with 22TB of space for user home folders
Application server

Phys-tapir

Ubuntu virtual server for users to mount I drive in order to move files off the cluster

Phys-mongoose

Ansible configuration management

Applications

The shared /opt/local folder hosts applications:

Matlab R2018a
Mathematica 13.3 and Wolfram 14.2
Intel oneAPI
IDL
GCC 5.5.0/7.3.0/8.4.0/11.3.0/14.2.0
OpenMPI
Scalapack
HDF
FFTW
Puffin
Miniconda for Python3

To use the applications load the module:

module avail - shows available modules
module load <**> - load the module to add it to the path
module list - shows loaded modules

SLURM job scripts

To use, log into vole as yourself.

ribbo1-3 - three compute nodes, 24 cores per node, ~90GB per node
ribbo4-7 - four compute nodes, 40 cores per node, 192GB per node
ribbo8-9 - two compute nodes, 40 cores per node, 128GB per node
ribbo10 - one compute node, 16 cores, 192GB
ribbo11-13 - three compute nodes, 40 cores per node, 256+GB per node
ribbo14-17 - four compute nodes, 20 cores per node, 64GB per node
hippo2 - one compute node, 48 cores, 128GB

squeue shows the progress PD for pending, and R for running

sinfo for how the queue is looking

To use the queue, you must ask for your username to be added to the queue user list.

To submit jobs to the queue, make a script. For example:

#!/bin/bash
#
#SBATCH --job-name=CNQO_test_1
#SBATCH --output=CNQO_test_1-%j.txt
#
#SBATCH --mail-type=ALL
#SBATCH --mail-user=user_email_address
#
#SBATCH --ntasks=1
# Double ## is for comments
##SBATCH --time=10:00
##SBATCH --mem-per-cpu=10

srun hostname
srun sleep 20

Where srun calls your program from the compute node, it's probably best to give to the whole path to the home folder.

To submit the job:

sbatch <name of script>

I've put an environment module on the system with GCC 5.5.0, so you can try loading that as well:

#!/bin/bash
#
#SBATCH --job-name=CNQO_test2
#SBATCH --output=CNQO_test_2-%j.txt
#
#SBATCH --ntasks=1
##SBATCH --time=10:00
##SBATCH --mem-per-cpu=10

module load compilers/gcc/5.5.0

srun gfortran --version
date

User can have up to 160 running jobs at a time - more than that will be queued until those have finished
Currently there is no limit on job time, but if you specify a job time in your script and don't finish before it, the job is cancelled
Limit on memory - each of the CNQO compute nodes is slightly different in the memory available:
- ribbo1 - 24 cores, 4000MB/core
- ribbo2 - 24 cores, 3916MB/core
- ribbo3 - 24 cores, 5333MB/core
- ribbo4, 5, 6 and 7 - 40 cores, 4800MB/core
- ribbo8 and 9 - 40 cores, 3200MB/core
- ribbo10 - 16 cores, 11600MB/core
- ribbo11 and 12 - 40 cores, 9600MB/core
- ribbo13 - 40 cores, 6400MB/core
- ribbo14, 15, 16 and 17 - 20 cores, 3200MB/core
- hippo2 - 48 cores, 2666MB/core

The default MemPerCPU is set to 2666MB, the lowest per core amount, while the MaxPerCPU as 5333MB. If you want more than 2666MB per core, you need to define it in your batch script (e.g. #SBATCH - mem-per-cpu=4000) up to the Maximum allowed. If memory usage is exceeded, the script will stop.

Parallel jobs with Openmpi

#!/bin/bash
##
# Propagate environment variables to the compute node
#SBATCH --export=ALL
# Run in the cnqo_intel partition (queue)
#SBATCH --partition=cnqo_intel
# Distribute processes in round-robin fashion for load balancing
#SBATCH --distribution=cyclic
# No. of tasks required 
#SBATCH --ntasks=48
#SBATCH --cpus-per-task=1
#SBATCH --nodes=1
# Specify (hard) runtime (HH:MM:SS)
#SBATCH --time=00:10:00 
# Job name 
#SBATCH --job-name=CNQO_mpi_test_1 
# Output file 
#SBATCH --output=CNQO_mpi_test_1-%j.out

pwd; hostname; date

echo "Running MPI test program on $SLURM_JOB_NUM_NODES nodes with $SLURM_NTASKS tasks, each with $SLURM_CPUS_PER_TASK cores."

module load libs/gcc/5.5.0/openmpi/3.1.0

mpirun -np $SLURM_NTASKS /home/users/username/mpi_hello

date

This should run 48 tasks on 1 node of the mpi_hello application.

Options for your scripts include:

Use the --exclusive flag if you require a whole node for your job
Specify the nodes you want to use with --nodelist=server1,server2

Next are three different ways for running multiple serial jobs.

JobFarm

A job farm runs a number of identical jobs on a node that take roughly the same time.

#!/bin/sh
#################
#
# Use this script in SLURM by running 'sbatch cnqo_doc_slurm3.sh'
# Access the details of the running serial jobs by using 'squeue -j SLURM_JOB_ID -s'
#
#################
# Requesting the number of nodes needed, and asking for exclusive access to those nodes
#SBATCH -N 1
#SBATCH --tasks-per-node=24
#SBATCH --exclusive
#
# Job time, change for what your job farm requires - here it's 5 minutes
#SBATCH -t 00:05:00
#
# Job name and output file names
#SBATCH -J cnqo_test_jobFarm
#SBATCH -o CNQO_test_jobFarm-%j.out
#SBATCH -e CNQO_test_jobFarm-%j.out

# Set the number of jobs
export number_of_jobs=$SLURM_NTASKS

# Loop over the serial job number
for ((i=0; i<$number_of_jobs; i++))
do
    # Run the script quietly, exclusively, for one core on one node, passing in the serial job number, and setting the output to a file with the SLURM job number and the serial job number
    srun -Q --exclusive -n 1 -N 1 \
        cnqo_test_jobFarm_task $i &> worker_${SLURM_JOB_ID}_${i} &
    sleep 1
done

# Keep the wait statement, it is important!
wait

where cnqo_test_jobFarm_task is


#!/bin/sh
# this script echoes some useful output so we can see what parallel
# and srun are doing

sleepsecs=$[($RANDOM % 10) + 40]s

# We output the sleep time, hostname, and date for more info
echo sleep:$sleepsecs host:$(hostname) date:$(date)

# sleep a random amount of time
sleep $sleepsecs

To use three nodes, make the following changes:

#################
# Requesting the number of nodes needed, and asking for exclusive access to those nodes
#SBATCH -N 3
#SBATCH --tasks-per-node=24
#SBATCH --exclusive
#
...
#
# Set the number of jobs
export number_of_jobs=72

Array

Whereas the job farm runs as one job in the scheduler, array jobs are run separately in the queue, with an array number to identify them. The array id, $SLURM_ARRAY_TASK_ID, means you can identify a different input file or parameters for each array job.

#!/bin/sh
##
# Job time, change for what your job farm requires - here it's 5 minutes
#SBATCH -t 00:05:00
#
# Job name and output file names
#SBATCH -J cnqo_test_arrayJob
#SBATCH -o CNQO_test_arrayJob_%A_%a.out
#SBATCH -e CNQO_test_arrayJob_%A_%a.out
# Use the % separator to limit the concurrent jobs to 8 for an array of 30 jobs
#SBATCH --array=1-30%8

sleepsecs=$[($RANDOM % 10) + 10]s

# We output the sleep time, hostname, and date for more info&gt;
echo sleep:$sleepsecs host:$(hostname) date:$(date)

# More usefully, run something like
#./myprogram < input_file_$SLURM_ARRAY_TASK_ID

# sleep a random amount of time
sleep $sleepsecs

By adding

# Requesting the number of nodes needed, and asking for exclusive access to those nodes
#SBATCH -N 1
#SBATCH --exclusive
#SBATCH --tasks-per-node=24

you can reserve a node for the one array job, allowing multi-threaded computation.

Multiprog

The --multi-prog option lets you assign each task in your job a different option, as specified in a conf file.

#!/bin/sh
#SBATCH -n 72 
#SBATCH -t 00:10:00
#SBATCH -J cnqo_test_multi_prog

srun -l --multi-prog cnqo_multiprog.conf

where cnqo_multiprog.conf looks like:

0-10             hostname
11,12            echo            task:%t
13               echo            task:%t-%o
14               echo            task:%o
15-71            hostname

Keys	Action
`?`	Open this help
`n`	Next page
`p`	Previous page
`s`	Search