what is driving heterogeneity in hpc?

24
What is driving heterogeneity in HPC? Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign with Simon Garcia, and Carl Pearson 1

Upload: others

Post on 24-Mar-2022

0 views

Category:

Documents


0 download

TRANSCRIPT

What is driving heterogeneity in HPC?

Wen-meiHwuProfessorandSanders-AMDChair,ECE,NCSAUniversityofIllinoisatUrbana-Champaign

withSimonGarcia,andCarlPearson

1

Agenda

• RevoluHonaryparadigmshiIinapplicaHons• Post-Dennardtechnologypivot-heterogeneity• AnexampleofposiHveapplicaHon-technologyspiral•  Lessonslearned

2

A major paradigm shi9

§  In the 20th Century, we were able to understand, design, and manufacture what we can measure •  Physical instruments and compuDng systems allowed us to see farther, capture

more, communicate beEer, understand natural processes, control arDficial processes…

A major paradigm shi9

§  In the 20th Century, we were able to understand, design, and manufacture what we can measure •  Physical instruments and compuDng systems allowed us to see farther, capture

more, communicate beEer, understand natural processes, control arDficial processes…

§  In the 21st Century, we are able to understand, design, and create what we can compute •  ComputaDonal models are allowing us to see even farther, going back and

forth in Dme, learn beEer, test hypothesis that cannot be verified any other way, create safe arDficial processes…

Examples of Paradigm Shi9 20th Century

§  Small mask paEerns

§  Electronic microscope and Crystallography with computaDonal image processing

§  Anatomic imaging with computaDonal image processing

§  Teleconference

§  GPS

21st Century

§  OpDcal proximity correcDon

§  ComputaDonal microscope with iniDal condiDons from Crystallography

§  Metabolic imaging sees disease before visible anatomic change

§  Tele-emersion

§  Self-driving cars

Diving deeper into computaDonal microscope

•  Largeclusters(scaleout)allowsimulaHonofbiologicalsystemsofrealisHcspacedimensions

•  0.5Å(0.05nm)laScespacingneededforaccuracy•  InteresHngbiologicalsystemshavedimensionsofmmorlarger•  Thousandsofnodesarerequiredtoholdandupdateallthegridpoints.

•  Fastnodes(scaleup)allowsimulaHonatrealisHcHmescales•  SimulaHonHmestepsatfemtosecond(10-15second)levelneededforaccuracy

•  Biologicalprocessestakemillisecondsorlonger•  CurrentmoleculardynamicssimulaHonsprogressataboutonedayforeach10-100microsecondsofthesimulatedprocess.

6

Blue Waters Science Breakthrough Example §  DeterminaDon of the structure of the HIV

capsid at atomic-level §  CollaboraDve effort of experimental groups at

the U. of PiEsburgh and Vanderbilt U., and the Schulten’s computaDonal team at the U. of Illinois.

§  64-million-atom HIV capsid simulaDon of the process through which the capsid disassembles, releasing its geneDc material

§  a criDcal step in understanding HIV infecDon and finding a target for anDviral drugs.

Post-Dennard technology pivot - heterogeneity

8

Dennard Scaling of MOS Devices

§  In this ideal scaling, as L → α*L • VDD → α*VDD, C → α*C, i → α*i • Delay = CVDD/I scales by α, so f → 1/α • Power for each transistor is CV2*f and scales by α2

•  keeping total power constant for same chip area

JSSCOct1974,page256

9

Frequency Scaled Too Fast 1993-2003

Clock Frequency (MHz)

10

100

1000

10000

85 87 89 91 93 95 97 99 01 03 05

10

Total Processor Power Increased (super-scaling of frequency and chip size)

1

10

100

85 87 89 91 93 95 97 99 01 03

11

Post-Dennard PivoDng

§  MulDple cores with more moderate clock frequencies §  Heavy use of vector execuDon §  Employ both latency-oriented and throughput-oriented cores §  3D packaging for more memory bandwidth

Blue Waters CompuDng System OperaDonal at Illinois since 3/2013

Sonexion: 26 PBs

>1 TB/sec

100 GB/sec

10/40/100 Gb Ethernet Switch

Spectra Logic: 300 PBs

120+ Gb/sec

WAN

IB Switch 12.5 PF 1.6 PB DRAM

$250M

49,504CPUs--4,224GPUs

13

CPUs: Latency Oriented Design

§  High clock frequency

§  Large caches •  Convert long latency memory accesses

to short latency cache accesses

§  SophisDcated control •  Branch predicDon for reduced branch

latency •  Data forwarding for reduced data

latency

§  Powerful ALU •  Reduced operaDon latency

Cache

ALU Control

ALU

ALU

ALU

DRAM

CPU

GPUs: Throughput Oriented Design

§  Moderate clock frequency §  Small caches

•  To boost memory throughput

§  Simple control •  No branch predicDon •  No data forwarding

§  Energy efficient ALUs •  Many, long latency but heavily pipelined

for high throughput

§  Require massive number of threads to tolerate latencies

DRAM

GPU

ApplicaDons Benefit from Both CPU and GPU

§  CPUs for sequenDal parts where latency maEers •  CPUs can be 10+X faster than GPUs

for sequenDal code

§  GPUs for parallel parts where throughput wins •  GPUs can be 10+X faster than CPUs

for parallel code

IniDal ProducDon Use Results Applica6onDescrip6on Applica6onSpeedup

NAMD100millionatombenchmarkwithLangevindynamicsandPMEonceevery4steps,fromlaunchtofinish,allI/O

included1.8

Chroma LaSceQCDparameters:gridsizeof483x512runningatthephysicalvaluesofthequarkmasses 2.4

QMCPACK FullrunGraphite4x4x1(256electrons),QMCfollowedbyVMC 2.7

ChaNGa CollisionlessN-bodystellardynamicswithmulHpoleexpansionandhydrodynamics 2.1

AWP AnelasHcwavepropagaHonwithstaggered-gridfinite-differenceandrealisHcplasHcyielding 1.2

17

An example of posiDve applicaDon-technology spiral

18

19

DEEP LEARNING IN COMPUTER VISION

Deep Learning Object Detection DNN + Data + HPC

Traditional Computer Vision Experts + Time

Deep Learning Achieves “Superhuman” Results

0% 10% 20% 30% 40% 50% 60% 70% 80% 90%

100%

2009 2010 2011 2012 2013 2014 2015 2016

Traditional CV Deep Learning

ImageNet

Slide courtesy of Steve Oberlin, NVIDIA

20

DIFFERENT MODALITIES OF REAL-WORLD DATA

Image" Vision features" Detection"

Images/video"

Audio" Audio features" Speaker ID"

Audio"

Text"

Text" Text features"

Text classification, machine translation, information retrieval, ...."

Slide courtesy of Andrew Ng, Stanford University

A long way to go towards cogniDve compuDng

ImageRecogniHon

TextExtracHon

HumanInstrucHons

SpeechRecogniHon

NaturalLanguageProcessing

DiagramUnderstanding

IR

KnowledgeIndexing

KnowledgeInferencing

ProgrammingFramework

HardwarePlaiorm

21

More Heterogeneity Is Coming

§  Beyond tradiDonal CPUs and GPUs •  FPGAs (e.g., Microso9 FPGA cloud) •  ASICs (e.g., Google’s TPU)

§  Beyond tradiDonal DRAM •  Stacked DRAM for more memory bandwidth •  Non-volaDle RAM for memory capacity •  Near/in memory compuDng for reduced power used in data movement

Summary and Outlook

•  ThroughputcompuHngusingGPUscanresultin2-3Xend-to-endapplicaHon-levelperformanceimprovement

• GPUs,bigdataanddeeplearninghaveformedaposiHvespiralfortheindustry

•  ThisisanexcepHonalHmetobeagraduatestudent•  ParadigmshiI,partlythankstothegeneraHonofsuper-Denardscaling•  Butyouhavetoworkmuchharder,alsothankstothegenera6onofsuper-Dennardscaling

23

Please aEend the Tutorial by Simon and Carl this a9ernoon at 3:30pm

24