ssh glogin.ibex.kaust.edu · 2018-09-26 · module avail gpu software: modules dev core apps pgi...
TRANSCRIPT
● ssh glogin.ibex.kaust.edu.sa● First login auto-generates keys & ssh config
– .ssh/config● Host glogin #GPU login nodes
Hostname glogin.ibex.kaust.edu.saUser $USERIdentityFile ~/.ssh/ksl-internalStrictHostKeyChecking noForwardX11 yesForwardX11Trusted yes
Getting Started: GPU Login
https://www.hpc.kaust.edu.sa/ibex/new_user https://www.hpc.kaust.edu.sa/ibex/faq
● Modules– Customized to login node (GPU, Intel, AMD)
● glogin: /sw/csg/modulefiles/*
– Improved GPU App Stack is here● Make requests: [email protected] ● Stay connected: https://kaust-ibex.slack.com/
– #announce, #general, #gpu
– Prefer default modules (/sw/csg/modulefiles/*)● /cbrc/modules/* will be deprecated
GPU Software: Modules
● Modules
module availmodule load module/version
GPU Software: Modules
https://www.hpc.kaust.edu.sa/ibex/appNvidia / show all
module avail
GPU Software: Modules
DEV CORE APPS
pgi (OpenACC)gccintelcmakegitjavamaven
NVIDIA (OpenGL / EGL)cudacudnnncclopenmpi
anaconda3machine_learning tensorflow keras torch caffe* caffe2 theano* scipy, numpy, scikit-learn, etc.
paraviewbclfastq2cp2kgromacslammpsmapdnamdpysparkrelionseismic_unixsphire
* NVIDIA EGL supported; X11+GL support is missing...
● sinfo --partition=batch --format="%n %f" | fgrep -v nogpu
● dgpu501-22-r cpu_intel_e5_2670,gpu,...,tesla_k40mdgpu502-01-l cpu_intel_e5_2670,gpu,...,tesla_k20mdgpu702-16 cpu_intel_e5_2699_v3,gpu,...,gtx1080tidgpu703-01 cpu_intel_e5_2699_v3,gpu,...,p100dgpu703-25 cpu_intel_e5_2699_v3,gpu,...,p6000
GPU Jobs + Constraints
https://www.hpc.kaust.edu.sa/ibex/job
● srun --time=30:00 --mem=64GB--gres=gpu:p100:1 --pty bash -l
● sbatch --time=60:00 --mem=128GB--gres=gpu:2--constraint="[p100|p6000]"runjob.sbat
GPU Jobs + Constraints
https://www.hpc.kaust.edu.sa/ibex/job
● sbatch --time=60:00 runjob.sbat● runjob.sbat
#SBATCH --job-name=gpujob#SBATCH --gres=gpu:gtx1080i:4#SBATCH --constraint="[local_500G]"#SBATCH --mem=128GB#SBATCH --nodes=2 --ntasks-per-node=2
GPU Jobs + Constraints
https://www.hpc.kaust.edu.sa/ibex/job
● CMake– module load cmake
● C++– System default: GCC v4.8.5
– module load gcc/6.4.0
– module load pgi/17.10
GPU Software: Modules & Compilers
● CUDA– module load cuda
– nvcc -std=c++11 -o example example.cu● cuDNN
– module load cudnn
– nvcc -std=c++11 -o example example.cu
GPU Software: Modules & Compilers
GPU Visualization Analytic Apps
● ParaView (HPC visualization / analytics)– module load paraview
– https://wiki.vis.kaust.edu.sa/training/2017-18/advancedparaviewworkshop
● MapD (GPU Database)– Available for early-user testing...
https://wiki.vis.kaust.edu.sa/training
GPU Python Environments
● anaconda3– module load anaconda3
– conda list
– ipython
● Custom Python environments:
– conda --help– https://conda.io/docs/_downloads/conda-cheatsheet.pdf
– https://conda.io/docs/
GPU Machine Learning Apps
● machine_learning– module av machine_learning
● <year>.<num>-cudnn<ver>-cuda<ver>-py<ver>
– module load machine_learning
– conda list
– Contains: ● TensorFlow, Keras, Caffe2, Torch, etc. +
numpy, scipy, scikit-learn, pandas, matplotlib, etc.
GPU Machine Learning Apps
● tensorflow– module load tensorflow
– ipython
>>> import tensorflow as tf
– python <model.py>
GPU Performance Tools
● General Information (not scalable)
– nvidia-smi+-----------------------------------------------------------------------------+| NVIDIA-SMI 384.98 Driver Version: 384.98 ||-------------------------------+----------------------+----------------------+| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC || Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. ||===============================+======================+======================|| 0 GeForce GTX TIT... On | 0000:0D:00.0 Off | N/A || 37% 56C P2 153W / 189W | 135MiB / 6081MiB | 86% Default |+-------------------------------+----------------------+----------------------+| 1 GeForce GTX TIT... On | 0000:0E:00.0 Off | N/A || 31% 47C P8 34W / 189W | 2MiB / 6082MiB | 0% Default |+-------------------------------+----------------------+----------------------++-----------------------------------------------------------------------------+| Processes: GPU Memory || GPU PID Type Process name Usage ||=============================================================================|| 0 72633 C ../../build.cudnntraining.teneen/trainlenet 133MiB |+-----------------------------------------------------------------------------+
KSL provides profiling training...
GPU Performance Monitoring
● Modify Batch Script:
● View / Truncate logs
tail -f gpu-dmon.log
truncate --size=0 gpu-dmon.log
# SBATCH ...
# After SBATCH section, but before running main program# Pipe nvidia-smi logging into *.log file.# Must run nvidia-smi in background
nvidia-smi dmon >> gpu-dmon.log &
# Run primary GPU application here...# Don't run primary application in background
# After primary GPU application# kill nvidia-smi monitor to allow batch job to terminate early.
pkill nvidia-smi For Testi
ng ONLY
For Testi
ng ONLY
NOTNOT fo
r Pro
duction
for P
roductio
n