Competition within the High-Efficiency Computing GPGPU market has emerged with GPGPUs from Evolved Micro Devices (AMD) and Intel focusing on future Exascale class methods.
The original AMD Radeon Instinct MI50 hints at the capabilities of AMD’s future GPUs.

This see takes a first glance at the MI50 efficiency on attribute scientific and machine discovering out purposes.

The Poster

ISC-HPC 20 Poster

We evaluated four reference methods representing a unfold of fashionable user- and skilled-grade hardware:

  1. a GTX 1080Ti system with PCIe gen 3×16 GPU-GPU IC,
  2. a RTX 2080Ti system with no GPU-GPU verbal exchange,
  3. a Tesla V100 system with NVLink, and
  4. a AMD Radeon Instinct MI50 system with two xGMI hives.

All methods are exhaust the most sharp-root, twin-socket SuperMicro SYS-4029GP-TRT2 system.
Map particulars, diagrams, and benchmarking results might maybe maybe presumably additionally be present within the poster.

We ran two case analysis that screech overall workloads we mosey in our lab:
(a) a GPU-optimized rotating detonation engine simulation, and
(b) a compute-heavy deep discovering out practising job.
Implementation and methodology particulars are described below.

Running the Machine Studying Benchmarks

Our machine discovering out benchmarks had been mosey the exhaust of TensorFlow 1.15 and TensorFlow 1.x CNN benchmarks.
Whereas the tensorflow benchmarks are no longer updated for TensorFlow 2.x, they’ve been optimized for TensorFlow 1.15, making this a precious and replicable job for evaluating GPGPU efficiency.

Due to our tests are mosey on a single node, we exhaust the default TensorFlow Dispensed MirrorStrategy with the NCCL/RCCL all-reduce algorithm.

The benchmark job is practising ResNet50-v1 on an synthetic ImageNet dataset the exhaust of a momentum optimizer.
This compute-heavy job is attribute of many other deep pc imaginative and prescient tasks with its dense image inputs and a deep, feed-forward, largely convolutional structure that interprets neatly to GPGPUs.

Singularity Containers

We aged the HPC-oriented container platform Singularity (v3.5.2) to administer our ambiance and dependencies for this see.
Singularity ≥3.5 is required for ROCm strengthen.

All reported results had been serene the exhaust of salubrious TensorFlow and ROCm images accessible on Docker Hub. Singularity images might maybe maybe presumably additionally be pulled with:

$ singularity pull docker://$IMAGE_WITH_TAG

You might maybe maybe be in a dispute to originate a shell or mosey a script with:

# originate a shell within the container ambiance w/ NVIDIA GPU entry
$ singularity shell --nv $PATH_TO_SIMG

# mosey a python script within the container ambiance w/ ROCm GPU entry
$ singularity exec --rocm $PATH_TO_SIMG python3 mosey.py

The next containers had been aged to win reported results:

  • CUDA: tensorflow/tensorflow:1.15.2-gpu-py3 (link)
  • ROCm: rocm/tensorflow:rocm3.1-tf1.15-dev (link)
Coaching Throughput

We measured computational efficiency of every system the exhaust of practising images per second.

An iteration includes every forward and backward passes by means of the network.
We aged the perfect vitality-of-2 batch measurement that can slot in GPU reminiscence: 64 images/tool for the GTX and RTX methods (11gb) and 256 images/tool for the V100 and MI50 methods (32gb).
We ran enough heat-up iterations for the practising mosey to look real (5 steps for the NVIDIA hardware and 100 steps for AMD hardware).
The final practising throughput is the median of three runs with 500 steps every.

The next script will mosey ResNet50 practising benchmarks on 1-8 GPUs.
Bear out the variables at the top (container_path, gpu_flag, and batch_size) in accordance along with your teach system.

container_path=...
gpu_flag=...  # --nv or --rocm
batch_size=... # 64 for gtx or rtx (11gb), 256 for mi50 or v100 (32gb)

# mosey benchmarks on 1-8 GPUs
for n in {1..8}; maintain
    singularity exec $gpu_flag $container_path 
        python tf_cnn_benchmarks.py --num_gpus $n --batch_size $batch_size 
            --variable_update replicated --all_reduce_spec nccl 
            --mannequin resnet50 --data_name imagenet --optimizer momentum 
            --nodistortions --gradient_repacking 1 --ml_perf
carried out

Another apt ambiance variables:

  • CUDA_VISIBLE_DEVICES and HIP_VISIBLE_DEVICES support watch over which GPUs are viewed to TensorFlow.
  • NCCL_DEBUG=INFO will print out the GPU-GPU interconnects and NCCL ring topology aged for the all-reduce operations, which is precious for verification purposes.
  • NCCL_P2P_LEVEL controls when to exhaust reveal GPU-to-GPU transport by atmosphere the max allowable distance.
    A worth of 0 (or LOC) disables all P2P communications.
  • TF_XLA_FLAGS=--tf_xla_auto_jit=2 will force XLA compilation, optimizing the graph to your given hardware.
    This is usually fine in blended-precision mode when the exhaust of GPUs with Tensor Cores.
  • Other NCCL flags

Another apt benchmark alternate ideas:

  • --trace_file=hint.json will place a tfprof hint of your practising direction of, averaged over the most major 10 steps.
    The consequences might maybe maybe presumably additionally be viewed at chrome://tracing within the Chrome browser.
    This is precious for debugging allotted efficiency disorders.
  • --use_fp16 will mosey the practising in blended-precision mode.
    This would maybe presumably additionally exhaust NVIDIA Tensor Cores on supported hardware.
  • Fleshy benchmark alternate ideas might maybe maybe presumably additionally be listed with python tf_cnn_benchmarks.py --helpfull.
Energy Effectivity

Efficiency per Watt is an especially well-known metric when evaluating HPC methods.
This is in overall reported in FLOPS/W (Floating Level Operations per 2nd per Watt) the exhaust of a benchmark equivalent to LINPACK.
For this see, we exhaust machine discovering out analog: practising images per second per Watt.

We approximate vitality consumption as

  • Non-GPU: Running Realistic Energy Limit, or RAPL, is an Intel processor feature that provides knowledge on vitality and vitality consumption of assorted physical domains.
    Realistic vitality plan became serene the exhaust of the powercap interface: we queried energy_uj as soon as per second over a 1-minute interval of a given workload, calculating sensible vitality over every timestep pair.
    Energy files became serene over kit-0 (core), kit-1 (uncore), and the DRAM vitality airplane.
    This excludes GPU vitality plan, which became recorded one at a time.

    We serene our files the exhaust of code modified from the powerstat instrument.
    Other utilities for having access to RAPL metrics consist of perf, turbostat, or powertop.

  • GPU: Realistic GPU vitality plan became serene the exhaust of the nvidia-smi and rocm-smi utilities.

    # For NVIDIA
    timeout 60 nvidia-smi --inquire-gpu=timestamp,title,index,vitality.plan --structure=csv --loop=1 -f $LOGFILE
    
    # For ROCm
    for i in {1..60}; maintain rocm-smi -P --json >> $LOGFILE; sleep 1; carried out
    

    The vitality consumption measurements might maybe maybe presumably additionally be retrieved and averaged from these files.

The utility scripts that we aged might maybe maybe presumably additionally be realized here.
To win vitality knowledge for quite lots of modes,
we started a practising mosey as described in Coaching Throughput,
waited till iteration 10 of practising,
then manually started the vitality consumption monitoring tools.

Citation

This poster became supplied at ISC High Efficiency in June 2020.
To cite our findings, please exhaust:

@misc{ obenschain20,
       writer = "Keith Obenschain and Douglas Schwer and Alisha Sharma",
       title = "Preliminary overview of the AMD MI50 GPGPUs for scientific and machine discovering out purposes",
       yr = "2020",
       howpublished = "Study poster supplied at ISC High Efficiency 2020" }

Contact

For any questions, please contact the authors at emerging_architectures@nrl.navy.mil.