System Configuration

Agave

CPU Resource Overview:

  • 800 FP64 CPU Teraflops
  • 498 Compute Nodes
  • 372 CPU nodes with 2 x Intel Broadwell or newer CPU per node
  • 4 CPU nodes with 2 x AMD Epyc CPU per node
  • 20 CPU nodes with 2 x Intel Xeon Phi CPU per node
  • 128-256GB RAM on most CPU nodes
  • 1TB+ RAM on 3 CPU Fat Nodes
  • Intel OmniPath and Mellanox EDR InfiniBand

A more detailed description of our CPU resources can be seen in the table below

GPU Resource Overview:

  • 889 FP64 GPU Teraflops
  • 297 GPUs
  • GPUs range from K40s to 32GB V100s
  • 12 - 32 GB ram per GPU

A more detailed description of our GPU resources can be seen in the table below

Storage Resource Overview:

  • 3.3 PB home directory storage on Qumulo NFS
  • 1.2 PB temporary scratch storage on BeeGFS

 

Agave CPU Resources:

CPU model CPU Arch Nodes Cores / Node Ram / Node (GB) CPU cores FP64 GFLOPS
Totals N/A 498 N/A N/A 18,480 808,897
Xeon E5-2680 v4 Broadwell 297 28 128 - 256 8,316 319,334
Xeon Phi 7210 Knights Landing 20 256 200 5,120 212,992
Xeon Gold 6230 CascadeLake 44 20 192 1,760 118,272
Xeon Gold 6252 CascadeLake 16 48 192 768 51,610
Xeon E5-2687W v4 Broadwell 20 24 64 - 256 480 23,040
Xeon E5-2640 v0 SandyBridge 30 12 128 360 7,200
Xeon Silver 4214 CascadeLake 8 24 96 - 192 192 6,700
AMD EPYC 7551 Naples 4 64 256 256 4,096
Xeon Silver 4114 Skylake 9 20 96 - 384 180 6,336
Xeon Silver 4110 Skylake 19 8 96 152 5,107
Xeon E5-2650 v4 Broadwell 6 24 64 - 256 144 5,069
Xeon Gold 6136 Skylake 5 24 96 120 11,520
Xeon Gold 6132 SkyLake 1 112 1,500 112 9,318
Xeon Gold 5120 SkyLake 3 28 96 84 5,914
Xeon Gold 6248 CascadeLake 2 40 192 80 6,400
Xeon Gold 6148 SkyLake 2 40 384 80 6,144
Xeon E5-2650 v2 IvyBridge 3 16 256 48 998
Xeon Platinum 8160 SkyLake 1 48 384 48 3,226
Xeon E7- 4860 Westmere 1 40 2,000 40 363
Xeon Gold 6240 CascadeLake 1 36 96 36 2,995
Xeon X7560 Nehalem 1 32 1,000 32 291
Xeon Silver 4116 SkyLake 1 20 196 20 672
Xeon E5-2630 v4 Broadwell 1 20 64 20 704
Xeon E5-2640 v2 IvyBridge 2 8 16 16 256
Xeon E5-2660 v0 SandyBridge 1 16 64 16 282

Agave GPU Resources

GPU Model GPU Arch GPU Count GPU Cores FP64 GFLOPS FP32 GFLOPS FP16 GFLOPS
Totals N/A 297 1,140,672 889,555 3,198,700 4,037,600
GeForce GTX 1080 Pascal 12 30,720 3,324 106,476 1,668
GeForce GTX 1080 Ti Pascal 76 272,384 26,904 806,284 13,452
GeForce RTX 2080 Turing 20 58,880 5,580 178,400 356,800
GeForce RTX 2080 Ti Turing 28 121,856 10,276 329,000 658,000
Tesla K20m Tesla 1 2,496 1,175 3,524 N/A
Tesla K40 Tesla 8 23,040 13,456 40,368 N/A
Tesla K80 Tesla 56 139,776 76,776 230,328 N/A
Tesla V100 16GB Tesla 74 378,880 579,716 1,159,580 2,318,420
Tesla V100 32GB Tesla 22 112,640 172,348 344,740 689,260

Saguaro

The HPC capacity includes multi-processor commodity systems interconnected to an Ethernet fabric via 10/40GbE high performance SFP+ optical connectors. Over 92 TFLOPS of computational power is yielded by 285 converged x86 HPC nodes, powered by several families of Intel Xeon processors. These nodes are scaled in ephemeral memory values from 48GB to 2TB. A dedicated pool of 400TB high performance parallel LUSTRE fastscratch storage is presented to the HPC cluster over 56GB FDR INFINIBAND cabling and switching. This computational environment is interlinked to the campus Science DMZ, Internet2 and a pool of High Transfer Volume (ESNET) Data Transfer Nodes (DTN) by a 40GbE spine-leaf network.

CPU model CPU sockets CPU cores GFLOPS
totals 564 4448 92533.76
E5-2650 64 512 8192
E5-2665 22 176 3379.2
E5-2660 138 1104 19430.4
E5-2680 4 32 691.2
E5-2650v2 12 96 1996.8
E5-2683v3 96 1344 43008
E5-2690v3 6 72 2995.2
E7-4860 4 40 361.6
X5570 122 488 5719.36
X5670 92 552 6469.44
X7560 4 32 290.56

Data intensive compute

The Big Data Analytics Engine utilizes a Hadoop cluster with multiple dense datanodes interconnected with high bandwidth 10GbE Ethernet and adds additional per node memory to facilitate in-memory computation required for advanced data analysis and machine learning with Apache Spark. The raw capacity of the Data Intensive Ecosystem is supported by 44 nodes with Intel Xeon E5-2640 (88 sockets/528 Cores/6TB DDR3 2.5 GHz) with a capacity approaching 1PB of raw storage or 300TB of HDFS, (Hadoop Distributed File System @ 3x replication factor). Similarly, the Data Intensive Ecosystem is interlinked by 10GbE and linked to the spine network by 40GbE SFP+ direct attached cables to the campus Science DMZ, Internet2 and ESNET Data Transfer Nodes (DTN).

Internet2 and Science DMZ

All components in the Research Computing Ecosystem are directly connected to Internet2 via the Sun Corridor at 100GbE. Offering both friction-free zones for downloads as well as safe zones supporting the high bandwidth required for research collaboration. All network components are part of the software defined network compliant with OpenFlow 1.3 for advanced metering and monitoring as well as PerfSonar statistics for inter-intra university collaboration. ASU partners with the OpenDaylight Foundation and utilizes OpenFlow in production for mechanisms such as firewall bypass and load determinate path selection. Research Computing also holds a research associate role in Open Networking Foundation (ONF).

Filesystems

A home directory with a 100GB limit (/home/<ASURITE>) and a scratch directory with no quota but purged every 30 days (/scratch/<ASURITE>) are provided to all users with accounts.

Connect

To log in to Agave, you will need an SSH client application installed on your local system.

Agave uses ASURITE accounts. You will need to use your ASURITE login and password to connect to Agave.


SSH for Windows

The recommended SSH client application for Windows is PuTTY

A tutorial on how to install and use PuTTY can be found here: Install PuTTY SSH (Secure Shell) Client in Windows 7 (YouTube)

Use PuTTY to connect to agave.asu.edu with your ASURITE login and password.


SSH on Mac or Linux

Mac OS X has an SSH client built in.  A simple tutorial on how to use this SSH client can be found here: How To Use SSH on Mac OS X

Virtually every Linux distribution includes an SSH client.  The process for Linux is very similar to a Mac. You have to open a terminal window from which the SSH client will be launched.  However, the process for launching a terminal window varies from one Linux distribution to the next and is beyond the scope of what we can hope to document here.

Once you have a terminal window open on your Mac or Linux system, enter the following command, replacing ASURITE with your own ASURITE login name:

ssh ASURITE@agave.asu.edu

To log in with X11 forwarding turned on, use the -Y option as follows:

ssh -Y ASURITE@agave.asu.edu

I can't log in! What do I do?

If your attempts to log in are not succeeding, and you're sure you are following the instructions above, then our support team will need the following information:

  • The cluster you are connecting to : Agave in this case
  • The operating system you are using (e.g. Windows, Mac, Linux)
  • The SSH client you are using (e.g. Putty, MobaXterm, Native, etc)
  • The IP address you are connecting from ( use www.whatsmyip.org to find it )
  • Connection type (Wifi, Campus Ethernet, Home connection, etc)

Accessing Compute Nodes

Job accounting

Upon login, a balance of CPU-hours billed is printed to the screen.   CPU-hours = ( # CPUs ) x ( job duration in wall clock hours).  This can also be shown with the command "mybalance".  While the system charges only for the resources actually used, before a job begins, the requested CPU-hours are pre-deducted from the balance to determine if the expected CPU-hours can be covered by the balance.   This determines if the job will be launched as non-preemptable (positive pre-deducted balance) or preemptable (negative pre-deducted balance).

Slurm scheduler

The Slurm Workload Manager supports user commands to submit, control, monitor and cancel jobs.

Sbatch scripts

Sbatch scripts are the normal way to submit a non-interactive job to the cluster.

Below is an example of an sbatch script, that should be saved as the file myscript.sh

This script performs performs the simple task of generating a file of random numbers and then sorting it.

#!/bin/bash
 
#SBATCH -n 1                        # number of cores
#SBATCH -t 0-12:00                  # wall time (D-HH:MM)
##SBATCH -A drzuckerman             # Account hours will be pulled from (commented out with double # in front)
#SBATCH -o slurm.%j.out             # STDOUT (%j = JobId)
#SBATCH -e slurm.%j.err             # STDERR (%j = JobId)
#SBATCH --mail-type=ALL             # Send a notification when the job starts, stops, or fails
#SBATCH --mail-user=myemail@asu.edu # send-to address

module load gcc/4.9.2

for i in {1..100000}; do
  echo $RANDOM >> SomeRandomNumbers.txt
done

sort SomeRandomNumbers.txt

This script uses the #SBATCH flag to specify a few key options:

  • The number of cpu cores the job should use:
    • #SBATCH -n 1
  • The runtime of the job in Days-Hours:Minutes:
    • #SBATCH -t 0-12:00
  • The account the job pulls hours from. It has been commented out with an extra # for this example since this account does not actually exist:
    • ##SBATCH -A drzuckerman
  • A file based on the jobid (%j) where the normal output of the program (STDOUT) should be saved:
    • #SBATCH -o slurm.%j.out
  • A file based on the jobid (%j) where the error output of the program (STDERR) should be saved:
    • #SBATCH -e slurm.%j.err
  • That email notifications should be sent out when the job ends, or when it fails:
    • #SBATCH --mail-type=END,FAIL
  • The address where email should be sent:
    • #SBATCH --mail-user=myemail@asu.edu
  • The script also uses the Software Modules (see below) system to make gcc version 4.9.2 available:
    • module load gcc/4.9.2

 

Assuming that the above has been saved as the file myscript.sh, and you intend to run it on the conventional x86 compute nodes, you can submit this script with the following command:

sbatch myscript.sh

The job will be run on the first available conventional x86 cluster group partition.

Interactive Sessions

Depending on whether your application is a text based console program, or uses a GUI interface, you'll want to do one of the following:

For text based console programs log in with SSH

For GUI based programs log in with NoMachine

Once logged in, to launch an interactive compute session, simply run the following command:

interactive

This will launch an interactive compute session on one of the conventional x86 compute nodes

Once the session launches, you can begin using the system.

There is no special switch or option needed for X11 forwarding to work, it is always enabled.

So if you need to run an X11 base program, just launch an interactive session and run the program from within it.

However, this will only work if you are logged in through Nomachine or through SSH with X11 forwarding turned on.


Interactive Session Options

The interactive command will work with many of the same options and switches as other slurm job-launching commands, such as salloc or sbatch

In particular, you can specify how many cpu cores your want your interactive session to use with the -n number option.

You can also specify how long you would like your session to run with the -t days-hours option.

If you want the session to run under a different account you can specify it with the -A accountname option.

An example of using these options to launch an interactive session that uses 8 cpu cores on one node, runs for zero days and 4 hours, and uses the "drzuckerman" account can be seen below:

interactive -n 8 -N 1 -t 0-4:00 -A drzuckerman

This session will be launched on one of the conventional x86 compute nodes

Monitoring Jobs and Queues

Below are some sample commands.  A more complete list can be found here.

Show all jobs in all queues

squeue

Show all jobs for a specific user

squeue -u <ASURITE>

Cancel a job

scancel <JOBID>

Give detailed information on a job

scontrol show job=<JOBID>

Using Software Modules

There are many software packages ( list as of June 2018 ) installed on Agave that are available through the Software Modules system.

New packages are added regularly.


Listing available modules

The following command will list the software modules that are available on Agave:

module -l avail

Loading a module

The following command will load the module forgcc/4.9.2

module load gcc/4.9.2

Listing Loaded Modules

The following command will list the modules that are currently loaded:

module list

Purging Loaded Modules

To clear out the modules that are currently loaded, perhaps because you want to load others, use the following command:

module purge

Using Modules in SBATCH scripts

Many sbatch scripts will include a module load command as part of the script.

Managing/Transferring Files

scp and rsync

To copy a file from your local machine:

scp file <ASURITE>@agave.asu.edu:/home/<ASURITE>/targetdirectory

To copy an entire directory

scp -r directory <ASURITE>@agave.asu.edu:/home/<ASURITE>/targetdirectory

To transfer large files, or to update a small portion of a large dataset:

rsync -atr bigdir  <ASURITE>@agave.asu.edu:/home/<ASURITE>/targetdirectory