what is driving heterogeneity in hpc?
TRANSCRIPT
What is driving heterogeneity in HPC?
Wen-meiHwuProfessorandSanders-AMDChair,ECE,NCSAUniversityofIllinoisatUrbana-Champaign
withSimonGarcia,andCarlPearson
1
Agenda
• RevoluHonaryparadigmshiIinapplicaHons• Post-Dennardtechnologypivot-heterogeneity• AnexampleofposiHveapplicaHon-technologyspiral• Lessonslearned
2
A major paradigm shi9
§ In the 20th Century, we were able to understand, design, and manufacture what we can measure • Physical instruments and compuDng systems allowed us to see farther, capture
more, communicate beEer, understand natural processes, control arDficial processes…
A major paradigm shi9
§ In the 20th Century, we were able to understand, design, and manufacture what we can measure • Physical instruments and compuDng systems allowed us to see farther, capture
more, communicate beEer, understand natural processes, control arDficial processes…
§ In the 21st Century, we are able to understand, design, and create what we can compute • ComputaDonal models are allowing us to see even farther, going back and
forth in Dme, learn beEer, test hypothesis that cannot be verified any other way, create safe arDficial processes…
Examples of Paradigm Shi9 20th Century
§ Small mask paEerns
§ Electronic microscope and Crystallography with computaDonal image processing
§ Anatomic imaging with computaDonal image processing
§ Teleconference
§ GPS
21st Century
§ OpDcal proximity correcDon
§ ComputaDonal microscope with iniDal condiDons from Crystallography
§ Metabolic imaging sees disease before visible anatomic change
§ Tele-emersion
§ Self-driving cars
Diving deeper into computaDonal microscope
• Largeclusters(scaleout)allowsimulaHonofbiologicalsystemsofrealisHcspacedimensions
• 0.5Å(0.05nm)laScespacingneededforaccuracy• InteresHngbiologicalsystemshavedimensionsofmmorlarger• Thousandsofnodesarerequiredtoholdandupdateallthegridpoints.
• Fastnodes(scaleup)allowsimulaHonatrealisHcHmescales• SimulaHonHmestepsatfemtosecond(10-15second)levelneededforaccuracy
• Biologicalprocessestakemillisecondsorlonger• CurrentmoleculardynamicssimulaHonsprogressataboutonedayforeach10-100microsecondsofthesimulatedprocess.
6
Blue Waters Science Breakthrough Example § DeterminaDon of the structure of the HIV
capsid at atomic-level § CollaboraDve effort of experimental groups at
the U. of PiEsburgh and Vanderbilt U., and the Schulten’s computaDonal team at the U. of Illinois.
§ 64-million-atom HIV capsid simulaDon of the process through which the capsid disassembles, releasing its geneDc material
§ a criDcal step in understanding HIV infecDon and finding a target for anDviral drugs.
Dennard Scaling of MOS Devices
§ In this ideal scaling, as L → α*L • VDD → α*VDD, C → α*C, i → α*i • Delay = CVDD/I scales by α, so f → 1/α • Power for each transistor is CV2*f and scales by α2
• keeping total power constant for same chip area
JSSCOct1974,page256
9
Frequency Scaled Too Fast 1993-2003
Clock Frequency (MHz)
10
100
1000
10000
85 87 89 91 93 95 97 99 01 03 05
10
Total Processor Power Increased (super-scaling of frequency and chip size)
1
10
100
85 87 89 91 93 95 97 99 01 03
11
Post-Dennard PivoDng
§ MulDple cores with more moderate clock frequencies § Heavy use of vector execuDon § Employ both latency-oriented and throughput-oriented cores § 3D packaging for more memory bandwidth
Blue Waters CompuDng System OperaDonal at Illinois since 3/2013
Sonexion: 26 PBs
>1 TB/sec
100 GB/sec
10/40/100 Gb Ethernet Switch
Spectra Logic: 300 PBs
120+ Gb/sec
WAN
IB Switch 12.5 PF 1.6 PB DRAM
$250M
49,504CPUs--4,224GPUs
13
CPUs: Latency Oriented Design
§ High clock frequency
§ Large caches • Convert long latency memory accesses
to short latency cache accesses
§ SophisDcated control • Branch predicDon for reduced branch
latency • Data forwarding for reduced data
latency
§ Powerful ALU • Reduced operaDon latency
Cache
ALU Control
ALU
ALU
ALU
DRAM
CPU
GPUs: Throughput Oriented Design
§ Moderate clock frequency § Small caches
• To boost memory throughput
§ Simple control • No branch predicDon • No data forwarding
§ Energy efficient ALUs • Many, long latency but heavily pipelined
for high throughput
§ Require massive number of threads to tolerate latencies
DRAM
GPU
ApplicaDons Benefit from Both CPU and GPU
§ CPUs for sequenDal parts where latency maEers • CPUs can be 10+X faster than GPUs
for sequenDal code
§ GPUs for parallel parts where throughput wins • GPUs can be 10+X faster than CPUs
for parallel code
IniDal ProducDon Use Results Applica6onDescrip6on Applica6onSpeedup
NAMD100millionatombenchmarkwithLangevindynamicsandPMEonceevery4steps,fromlaunchtofinish,allI/O
included1.8
Chroma LaSceQCDparameters:gridsizeof483x512runningatthephysicalvaluesofthequarkmasses 2.4
QMCPACK FullrunGraphite4x4x1(256electrons),QMCfollowedbyVMC 2.7
ChaNGa CollisionlessN-bodystellardynamicswithmulHpoleexpansionandhydrodynamics 2.1
AWP AnelasHcwavepropagaHonwithstaggered-gridfinite-differenceandrealisHcplasHcyielding 1.2
17
19
DEEP LEARNING IN COMPUTER VISION
Deep Learning Object Detection DNN + Data + HPC
Traditional Computer Vision Experts + Time
Deep Learning Achieves “Superhuman” Results
0% 10% 20% 30% 40% 50% 60% 70% 80% 90%
100%
2009 2010 2011 2012 2013 2014 2015 2016
Traditional CV Deep Learning
ImageNet
Slide courtesy of Steve Oberlin, NVIDIA
20
DIFFERENT MODALITIES OF REAL-WORLD DATA
Image" Vision features" Detection"
Images/video"
Audio" Audio features" Speaker ID"
Audio"
Text"
Text" Text features"
Text classification, machine translation, information retrieval, ...."
Slide courtesy of Andrew Ng, Stanford University
A long way to go towards cogniDve compuDng
ImageRecogniHon
TextExtracHon
HumanInstrucHons
SpeechRecogniHon
NaturalLanguageProcessing
DiagramUnderstanding
IR
KnowledgeIndexing
KnowledgeInferencing
ProgrammingFramework
HardwarePlaiorm
21
More Heterogeneity Is Coming
§ Beyond tradiDonal CPUs and GPUs • FPGAs (e.g., Microso9 FPGA cloud) • ASICs (e.g., Google’s TPU)
§ Beyond tradiDonal DRAM • Stacked DRAM for more memory bandwidth • Non-volaDle RAM for memory capacity • Near/in memory compuDng for reduced power used in data movement
Summary and Outlook
• ThroughputcompuHngusingGPUscanresultin2-3Xend-to-endapplicaHon-levelperformanceimprovement
• GPUs,bigdataanddeeplearninghaveformedaposiHvespiralfortheindustry
• ThisisanexcepHonalHmetobeagraduatestudent• ParadigmshiI,partlythankstothegeneraHonofsuper-Denardscaling• Butyouhavetoworkmuchharder,alsothankstothegenera6onofsuper-Dennardscaling
23