toward a sustainable architecture at extreme scale

Toward a Sustainable Architecture at Extreme Scale

Zhimin Tang, [email protected]

Sustainable (Cost Effective) HPCCounter-examples in the history

Current and Future ChallengesNew computing forms from sensor to cloudSilicon based IC process approaching its

physical limitStrategy

Abandon HPC only acceleration features Design sustainable architecture for HPC and

other applications

Outline

Application (Algorithm) RequirementsHigh performance

Technology ConstraintsCMOS vs. bipolar, Moore’s LawCommercial MPU vs. customed ASIP

Economical FeasibilityGood eco-systemMass productionLow energy consumption

Considerations of Cost Effectiveness or Sustainability

Vector SupercomputersCMOS Dominated, SIMD Weakness

HPCs in the History

SIMD PE ArrayOptimal only for some

AlgorithmsCustom chips, tiny processor

Connection Machine

Chip Level Integration (SoC)nCube/2, KSR-1 (COMA), …High NRE cost due to custom design without

mass productionLow node processor performance

MIMD with Custom CPUs

HPC Is a Small MarketArchitectures Designed Only for HPC

Lower volume, higher cost (NRE)No enough resource to implement a top level

(wrt performance) solutionLonger time-to-market, behind Moore’s Law

Result: COTS Solutions in Last 20 YearsCommercial off-the-shelf

Co-design with the IT EcosystemFrom Cloud computers to sensors

Why No Cost Effectiveness

High Performance and Low CostLow cost is continuing a must

New factors of cost: energy/power, big NREPerformance no longer the bottleneck

for most applicationslike car, train, airplane in transportation

New appearances of performanceComputing: MIPS/MFLOPSTransaction processing: TPMCloud applications: requests serviced in unit time

Ecosystem Requirements

Two Ends of Computing SystemCloud: large scale power dissipationTerminal: limited battery life

Energy: compute < memory < communicationFor each FLOP in LinpackFPU spends 10pJ, Memory access 475pJ

Wireless Sensor NetworkRF radio consumes most of the power

What We Need Besides Locality?

Energy Efficiency

Architecture Consuming Less EnergyMany core, custom designed for applicationsFlattened software stack

Architecture for New Performance MetricsHigh volume throughput computers

New Algorithms and MethodologyComplexity of computationComplexity of memory access and

communication

Needs New Architecture

Existing Software Ecosystem standard or de facto interfaces

e.g., ISA: Instruction Set ArchitecturePro: Compatibility of SoftwareCon: Obstacles of Innovation, legacy

Huge Expenses of Developmentnew architecture needs new processorsNRE of chip development increasing rapidly,

as CMOS process approaching its limitNRE: Non-Recurring Engineering

Constraints to Innovation

Approaching Limit, And No Replacement!Moore’s law： 7nm@2024, ~30 atoms

Different with the Transfer in 1990’sBipolar (ECL/TTL) is faster, but consumes

much powerCMOS developed for 20 years, no too slow,

low cost, and low powerBut Now, Liquid Cooling for CMOS

In the foreseeable future, still CMOS

CMOS Technology

More and More than Moore

2011 ITRS Exec. Summary Fig. 4

Dark Silicon

ISCA’11, IEEE Micro’12, CACM’13

At 8nm, above half of transistors must be turned off

Speedup of 4-8 for 5 process generations

Moore’s Law Provides More TransistorsBut switching speed no longer fasterProcess development in nanometer scale

increases NRE tremendouslyMass Production Is Essential

Otherwise, chip business is not sustainableAdvantages of general-purposed processors

How about Many-core Processors?GPU, Tilera, MIC, …

Economical Feasibility

Most Advanced Process, Mass ProductStable, reliable, low costMature ecosystem and solutions

Not Optimal for Many ApplicationsAim: not too bad for most applicationsOver allocation of resourcesWaste of resources, Consumption of more

energy

Pros and Cons of MPU

High L1-I Cache Miss RateProcessor idle (instruction starvation)

Small ILP and MLPWide issue not effective

Low Efficiency of Memory AccessLarge L3 takes ½ chip area, no help to

improve performanceUseless High Bandwidth On-chip

Few Data sharing among cores

MPU not good for Cloud

Only 1/3 are frequently used

Low Utilization of Resources

GPU

L3 Cache

L2 Cache

L2 Cache

L2 Cache

L2 Cache

OOOFPU

OOOFPU

OOOFPU

OOOFPU

Optimal Designed for Some Applicationshigh efficiency, low resource, low power

But No Lunches Are FreeMuch design/verification workStability/Reliability?May affect the time to marketHow to amortize the huge NRESmall market means high cost

Pros and Cons of ASIP

GPUPro: mass productionCon: PCIE overhead, small memory size

MIC PHIMass production possible?

FPGAResource utilizationEase of programmingMPU interface, e.g., QPI or PCIE

MPU + Accelerator

Crossing the Gap between General and Special

Ｍ any Simple CoresReduce power consumption

Multiple Hardware Thread in Each CoreMassive threads on chipExploit concurrency, tolerate latency

Dynamic Scheduling of On-chip ThreadsImprove performance for general apps

Design of New Processors

流水向量处理引擎

PCPCPCPC

PCPCPC

指令寄存器

指令缓存

指令译码

PCPCPC寄存器堆

ALU

FPU

LSU

数据缓存/SPM

Combining Multithreadingand Vector Pipelining

Switch to single threadDeep scalar pipelineSwitch to vector pipeline

I$IR I

DRF

Vector Registers

D$/SPM

PCPCPCPC

PCPCPC

指令寄存器

指令缓存

指令译码

PCPCPC寄存器堆

ALU

FPU

LSU

数据缓存/SPM

Thread Parallelism and DataParallelism in Two dimensions

Deep thread parallelism and data parallelism

Wide data parallelism

I$

IR

ID

RF

D$/SPM

PCPCPCPC

PCPCPC

指令寄存器

指令缓存

指令译码

PCPCPC寄存器堆

ALU

FPU

LSU

数据缓存/SPM

Wide thread parallelism

I$

IR

ID

RF

D$/SPM

Vector Register File

A Universal ArchitectureScalable and reconfigurable processor arraySupports thread and data level parallelism

Fulfill All Requirements from Terminal to Cloud Data CenterHigh performance computersCloud computing serversEquipment in Core networkTerminals for Cloud and mobile Internet

In Conclusion

Thanks!

toward a sustainable architecture at extreme scale

Documents