System Configuration

Agave

There are currently three kinds of compute nodes available on the Agave cluster, 286 traditional x86 based nodes, 20 Xeon Phi based nodes, and 2 GPU nodes, over 300 TFLOPs of computational power.   The traditional x86 compute nodes contain two Intel Xeon E5-2680 v4 CPUs running at 2.40GHz. These provide 28 high speed Broadwell class CPU cores per node. These are well suited to workloads that do not scale well over multiple cpu cores.  The Xeon Phi nodes contain a single Intel Xeon Phi 7210 processor running at 1.3Ghz. This processor provides 64 low-speed Atom class x86 CPU cores per node. Each of these cores can handle up to 4 threads at once. The result is a compute node that can process 256 threads at once, but at a much lower speed per-thread than the traditional compute nodes. These Phi nodes are well suited to workloads that do scale well over multiple cpu cores. When a workload can be spread over more cores without losing much computational efficiency, the results are computed more quickly, even when the performance of each individual core is relatively modest.  The GPU nodes are described in detail below.

Despite these differences, there are some things that these two node types have in common. Each node, regardless of type, has a full-speed Omni-Path connection to every other node in its partition. The Omni-Path connections between partitions are more limited. We therefore restrict MPI jobs to a specific partition. These jobs can run on any partition, but they cannot run on nodes across different partitions.

CPU model CPU sockets CPU cores GFLOPS
totals 564 N/A 332000
E5-2680v4 536 7504 270000
Xeon Phi 7210 20 512 52000

Saguaro

The HPC capacity includes multi-processor commodity systems interconnected to an Ethernet fabric via 10/40GbE high performance SFP+ optical connectors. Over 92 TFLOPS of computational power is yielded by 285 converged x86 HPC nodes, powered by several families of Intel Xeon processors. These nodes are scaled in ephemeral memory values from 48GB to 2TB. A dedicated pool of 400TB high performance parallel LUSTRE fastscratch storage is presented to the HPC cluster over 56GB FDR INFINIBAND cabling and switching. This computational environment is interlinked to the campus Science DMZ, Internet2 and a pool of High Transfer Volume (ESNET) Data Transfer Nodes (DTN) by a 40GbE spine-leaf network.

CPU model CPU sockets CPU cores GFLOPS
totals 564 4448 92533.76
E5-2650 64 512 8192
E5-2665 22 176 3379.2
E5-2660 138 1104 19430.4
E5-2680 4 32 691.2
E5-2650v2 12 96 1996.8
E5-2683v3 96 1344 43008
E5-2690v3 6 72 2995.2
E7-4860 4 40 361.6
X5570 122 488 5719.36
X5670 92 552 6469.44
X7560 4 32 290.56

Data intensive compute

The Big Data Analytics Engine utilizes a Hadoop cluster with multiple dense datanodes interconnected with high bandwidth 10GbE Ethernet and adds additional per node memory to facilitate in-memory computation required for advanced data analysis and machine learning with Apache Spark. The raw capacity of the Data Intensive Ecosystem is supported by 44 nodes with Intel Xeon E5-2640 (88 sockets/528 Cores/6TB DDR3 2.5 GHz) with a capacity approaching 1PB of raw storage or 300TB of HDFS, (Hadoop Distributed File System @ 3x replication factor). Similarly, the Data Intensive Ecosystem is interlinked by 10GbE and linked to the spine network by 40GbE SFP+ direct attached cables to the campus Science DMZ, Internet2 and ESNET Data Transfer Nodes (DTN).

Internet2 and Science DMZ

All components in the Research Computing Ecosystem are directly connected to Internet2 via the Sun Corridor at 100GbE. Offering both friction-free zones for downloads as well as safe zones supporting the high bandwidth required for research collaboration. All network components are part of the software defined network compliant with OpenFlow 1.3 for advanced metering and monitoring as well as PerfSonar statistics for inter-intra university collaboration. ASU partners with the OpenDaylight Foundation and utilizes OpenFlow in production for mechanisms such as firewall bypass and load determinate path selection. Research Computing also holds a research associate role in Open Networking Foundation (ONF).

Filesystems

A home directory with a 100GB limit (/home/<ASURITE>) and a scratch directory with no quota but purged every 30 days (/scratch/<ASURITE>) are provided to all users with accounts.

Connect

To log in to Agave, you will need an SSH client application installed on your local system.

Agave uses ASURITE accounts. You will need to use your ASURITE login and password to connect to Agave.


SSH for Windows

The recommended SSH client application for Windows is PuTTY

A tutorial on how to install and use PuTTY can be found here: Install PuTTY SSH (Secure Shell) Client in Windows 7 (YouTube)

Use PuTTY to connect to agave.asu.edu with your ASURITE login and password.


SSH on Mac or Linux

Mac OS X has an SSH client built in.  A simple tutorial on how to use this SSH client can be found here: How To Use SSH on Mac OS X

Virtually every Linux distribution includes an SSH client.  The process for Linux is very similar to a Mac. You have to open a terminal window from which the SSH client will be launched.  However, the process for launching a terminal window varies from one Linux distribution to the next and is beyond the scope of what we can hope to document here.

Once you have a terminal window open on your Mac or Linux system, enter the following command, replacing ASURITE with your own ASURITE login name:

ssh ASURITE@agave.asu.edu

To log in with X11 forwarding turned on, use the -Y option as follows:

ssh -Y ASURITE@agave.asu.edu

I can't log in! What do I do?

If your attempts to log in are not succeeding, and you're sure you are following the instructions above, then our support team will need the following information:

  • The cluster you are connecting to : Agave in this case
  • The operating system you are using (e.g. Windows, Mac, Linux)
  • The SSH client you are using (e.g. Putty, MobaXterm, Native, etc)
  • The IP address you are connecting from ( use www.whatsmyip.org to find it )
  • Connection type (Wifi, Campus Ethernet, Home connection, etc)

Accessing Compute Nodes

Job accounting

Upon login, a balance of CPU-hours billed is printed to the screen.   CPU-hours = ( # CPUs ) x ( job duration in wall clock hours).  This can also be shown with the command "mybalance".  While the system charges only for the resources actually used, before a job begins, the requested CPU-hours are pre-deducted from the balance to determine if the expected CPU-hours can be covered by the balance.   This determines if the job will be launched as non-preemptable (positive pre-deducted balance) or preemptable (negative pre-deducted balance).

Slurm scheduler

The Slurm Workload Manager supports user commands to submit, control, monitor and cancel jobs.

Sbatch scripts

Sbatch scripts are the normal way to submit a non-interactive job to the cluster.

Below is an example of an sbatch script, that should be saved as the file myscript.sh

This script performs performs the simple task of generating a file of random numbers and then sorting it.

#!/bin/bash
 
#SBATCH -n 1                        # number of cores
#SBATCH -t 0-12:00                  # wall time (D-HH:MM)
##SBATCH -A drzuckerman             # Account hours will be pulled from (commented out with double # in front)
#SBATCH -o slurm.%j.out             # STDOUT (%j = JobId)
#SBATCH -e slurm.%j.err             # STDERR (%j = JobId)
#SBATCH --mail-type=ALL             # Send a notification when the job starts, stops, or fails
#SBATCH --mail-user=myemail@asu.edu # send-to address

module load gcc/4.9.2

for i in {1..100000}; do
  echo $RANDOM >> SomeRandomNumbers.txt
done

sort SomeRandomNumbers.txt

This script uses the #SBATCH flag to specify a few key options:

  • The number of cpu cores the job should use:
    • #SBATCH -n 1
  • The runtime of the job in Days-Hours:Minutes:
    • #SBATCH -t 0-12:00
  • The account the job pulls hours from. It has been commented out with an extra # for this example since this account does not actually exist:
    • ##SBATCH -A drzuckerman
  • A file based on the jobid (%j) where the normal output of the program (STDOUT) should be saved:
    • #SBATCH -o slurm.%j.out
  • A file based on the jobid (%j) where the error output of the program (STDERR) should be saved:
    • #SBATCH -e slurm.%j.err
  • That email notifications should be sent out when the job ends, or when it fails:
    • #SBATCH --mail-type=END,FAIL
  • The address where email should be sent:
    • #SBATCH --mail-user=myemail@asu.edu
  • The script also uses the Software Modules (see below) system to make gcc version 4.9.2 available:
    • module load gcc/4.9.2

 

Assuming that the above has been saved as the file myscript.sh, and you intend to run it on the conventional x86 compute nodes, you can submit this script with the following command:

sbatch myscript.sh

The job will be run on the first available conventional x86 cluster group partition.

Interactive Sessions

Depending on whether your application is a text based console program, or uses a GUI interface, you'll want to do one of the following:

For text based console programs log in with SSH

For GUI based programs log in with NoMachine

Once logged in, to launch an interactive compute session, simply run the following command:

interactive

This will launch an interactive compute session on one of the conventional x86 compute nodes

Once the session launches, you can begin using the system.

There is no special switch or option needed for X11 forwarding to work, it is always enabled.

So if you need to run an X11 base program, just launch an interactive session and run the program from within it.

However, this will only work if you are logged in through Nomachine or through SSH with X11 forwarding turned on.


Interactive Session Options

The interactive command will work with many of the same options and switches as other slurm job-launching commands, such as salloc or sbatch

In particular, you can specify how many cpu cores your want your interactive session to use with the -n number option.

You can also specify how long you would like your session to run with the -t days-hours option.

If you want the session to run under a different account you can specify it with the -A accountname option.

An example of using these options to launch an interactive session that uses 8 cpu cores on one node, runs for zero days and 4 hours, and uses the "drzuckerman" account can be seen below:

interactive -n 8 -N 1 -t 0-4:00 -A drzuckerman

This session will be launched on one of the conventional x86 compute nodes

Monitoring Jobs and Queues

Below are some sample commands.  A more complete list can be found here.

Show all jobs in all queues

squeue

Show all jobs for a specific user

squeue -u <ASURITE>

Cancel a job

scancel <JOBID>

Give detailed information on a job

scontrol show job=<JOBID>

Using Software Modules

There are many software packages ( list as of June 2018 ) installed on Agave that are available through the Software Modules system.

New packages are added regularly.


Listing available modules

The following command will list the software modules that are available on Agave:


module -l avail

Loading a module

The following command will load the module forgcc/4.9.2


module load gcc/4.9.2

Listing Loaded Modules

The following command will list the modules that are currently loaded:


module list

Purging Loaded Modules

To clear out the modules that are currently loaded, perhaps because you want to load others, use the following command:

module purge

Using Modules in SBATCH scripts

Many sbatch scripts will include a module load command as part of the script.

Managing/Transferring Files

scp and rsync

To copy a file from your local machine:

scp file <ASURITE>@agave.asu.edu:/home/<ASURITE>/targetdirectory

To copy an entire directory

scp -r directory <ASURITE>@agave.asu.edu:/home/<ASURITE>/targetdirectory

To transfer large files, or to update a small portion of a large dataset:

rsync -atr bigdir  <ASURITE>@agave.asu.edu:/home/<ASURITE>/targetdirectory