toward a sustainable architecture at extreme scale
DESCRIPTION
Toward a Sustainable Architecture at Extreme Scale. Zhimin Tang, CTO [email protected]. Outline. Sustainable (Cost Effective) HPC Counter-examples in the history Current and Future Challenges New computing forms from sensor to cloud - PowerPoint PPT PresentationTRANSCRIPT
Toward a Sustainable Architecture at Extreme Scale
Zhimin Tang, [email protected]
Sustainable (Cost Effective) HPCCounter-examples in the history
Current and Future ChallengesNew computing forms from sensor to cloudSilicon based IC process approaching its
physical limitStrategy
Abandon HPC only acceleration features Design sustainable architecture for HPC and
other applications
Outline
Application (Algorithm) RequirementsHigh performance
Technology ConstraintsCMOS vs. bipolar, Moore’s LawCommercial MPU vs. customed ASIP
Economical FeasibilityGood eco-systemMass productionLow energy consumption
Considerations of Cost Effectiveness or Sustainability
Vector SupercomputersCMOS Dominated, SIMD Weakness
HPCs in the History
SIMD PE ArrayOptimal only for some
AlgorithmsCustom chips, tiny processor
Connection Machine
Chip Level Integration (SoC)nCube/2, KSR-1 (COMA), …High NRE cost due to custom design without
mass productionLow node processor performance
MIMD with Custom CPUs
HPC Is a Small MarketArchitectures Designed Only for HPC
Lower volume, higher cost (NRE)No enough resource to implement a top level
(wrt performance) solutionLonger time-to-market, behind Moore’s Law
Result: COTS Solutions in Last 20 YearsCommercial off-the-shelf
Co-design with the IT EcosystemFrom Cloud computers to sensors
Why No Cost Effectiveness
High Performance and Low CostLow cost is continuing a must
New factors of cost: energy/power, big NREPerformance no longer the bottleneck
for most applicationslike car, train, airplane in transportation
New appearances of performanceComputing: MIPS/MFLOPSTransaction processing: TPMCloud applications: requests serviced in unit time
Ecosystem Requirements
Two Ends of Computing SystemCloud: large scale power dissipationTerminal: limited battery life
Energy: compute < memory < communicationFor each FLOP in LinpackFPU spends 10pJ, Memory access 475pJ
Wireless Sensor NetworkRF radio consumes most of the power
What We Need Besides Locality?
Energy Efficiency
Architecture Consuming Less EnergyMany core, custom designed for applicationsFlattened software stack
Architecture for New Performance MetricsHigh volume throughput computers
New Algorithms and MethodologyComplexity of computationComplexity of memory access and
communication
Needs New Architecture
Existing Software Ecosystem standard or de facto interfaces
e.g., ISA: Instruction Set ArchitecturePro: Compatibility of SoftwareCon: Obstacles of Innovation, legacy
Huge Expenses of Developmentnew architecture needs new processorsNRE of chip development increasing rapidly,
as CMOS process approaching its limitNRE: Non-Recurring Engineering
Constraints to Innovation
Approaching Limit, And No Replacement!Moore’s law: 7nm@2024, ~30 atoms
Different with the Transfer in 1990’sBipolar (ECL/TTL) is faster, but consumes
much powerCMOS developed for 20 years, no too slow,
low cost, and low powerBut Now, Liquid Cooling for CMOS
In the foreseeable future, still CMOS
CMOS Technology
More and More than Moore
2011 ITRS Exec. Summary Fig. 4
Dark Silicon
ISCA’11, IEEE Micro’12, CACM’13
At 8nm, above half of transistors must be turned off
Speedup of 4-8 for 5 process generations
Moore’s Law Provides More TransistorsBut switching speed no longer fasterProcess development in nanometer scale
increases NRE tremendouslyMass Production Is Essential
Otherwise, chip business is not sustainableAdvantages of general-purposed processors
How about Many-core Processors?GPU, Tilera, MIC, …
Economical Feasibility
Most Advanced Process, Mass ProductStable, reliable, low costMature ecosystem and solutions
Not Optimal for Many ApplicationsAim: not too bad for most applicationsOver allocation of resourcesWaste of resources, Consumption of more
energy
Pros and Cons of MPU
High L1-I Cache Miss RateProcessor idle (instruction starvation)
Small ILP and MLPWide issue not effective
Low Efficiency of Memory AccessLarge L3 takes ½ chip area, no help to
improve performanceUseless High Bandwidth On-chip
Few Data sharing among cores
MPU not good for Cloud
Only 1/3 are frequently used
Low Utilization of Resources
GPU
L3 Cache
L2 Cache
L2 Cache
L2 Cache
L2 Cache
OOOFPU
OOOFPU
OOOFPU
OOOFPU
Optimal Designed for Some Applicationshigh efficiency, low resource, low power
But No Lunches Are FreeMuch design/verification workStability/Reliability?May affect the time to marketHow to amortize the huge NRESmall market means high cost
Pros and Cons of ASIP
GPUPro: mass productionCon: PCIE overhead, small memory size
MIC PHIMass production possible?
FPGAResource utilizationEase of programmingMPU interface, e.g., QPI or PCIE
MPU + Accelerator
Crossing the Gap between General and Special
M any Simple CoresReduce power consumption
Multiple Hardware Thread in Each CoreMassive threads on chipExploit concurrency, tolerate latency
Dynamic Scheduling of On-chip ThreadsImprove performance for general apps
Design of New Processors
流水向量处理引擎
PCPCPCPC
PCPCPC
指令寄存器
指令缓存
指令译码
PCPCPC寄存器堆
ALU
FPU
LSU
数据缓存/SPM
Combining Multithreadingand Vector Pipelining
Switch to single threadDeep scalar pipelineSwitch to vector pipeline
I$IR I
DRF
Vector Registers
D$/SPM
PCPCPCPC
PCPCPC
指令寄存器
指令缓存
指令译码
PCPCPC寄存器堆
ALU
FPU
LSU
数据缓存/SPM
Thread Parallelism and DataParallelism in Two dimensions
Deep thread parallelism and data parallelism
Wide data parallelism
I$
IR
ID
RF
D$/SPM
PCPCPCPC
PCPCPC
指令寄存器
指令缓存
指令译码
PCPCPC寄存器堆
ALU
FPU
LSU
数据缓存/SPM
Wide thread parallelism
I$
IR
ID
RF
D$/SPM
Vector Register File
A Universal ArchitectureScalable and reconfigurable processor arraySupports thread and data level parallelism
Fulfill All Requirements from Terminal to Cloud Data CenterHigh performance computersCloud computing serversEquipment in Core networkTerminals for Cloud and mobile Internet
In Conclusion
Thanks!