computer architecture: the road ahead clocks are not primarily the result of technology scaling, ......
TRANSCRIPT
Computer Architecture & Arithmetic Group 1 Stanford University
Architecture & Technology: some thoughts on the road ahead
M. J. Flynn
University of TexasAustin
March 17, 2003
Computer Architecture & Arithmetic Group 2 Stanford University
Outline• The road ahead • System, not processor, emphasis• Time: Performance and its limits• Power• Area• Beyond Time x Power x Area: computational integrity, design time and wireless interconnections• The shape of the “new” system
Computer Architecture & Arithmetic Group 3 Stanford University
The present and the future of the technology
Projections from the Semiconductor Industry Association
Computer Architecture & Arithmetic Group 4 Stanford University
SIA Roadmap - 1Total Transistors per Chip
010002000300040005000
130nm 90nm 65nm 32nm
2002 2004 2007 2013
Technology / Year of First Shipment
No.
of T
rans
isto
rs
(in m
illio
n)
Computer Architecture & Arithmetic Group 5 Stanford University
SIA Roadmap - 2On-Chip Clocks, global and local clock rates
0
5000
10000
15000
20000
25000
130nm 90nm 65nm 32nm
2003 2004 2007 2013
Technology / Year of First Shipment
Freq
uenc
y (in
MHz
)
Computer Architecture & Arithmetic Group 6 Stanford University
Semiconductor Industry Roadmap
Year 2001 2004 2007 2013Technology generation (nm) 130 90 65 32Wafer size (mm) 300 300 300 450Defect density (per m2) 1356 1356 1356 1116µP die size (mm2) 310 310 310 310Chip Frequency (MHz) 1767 4000 6700 19350MTx per Chip (Microprocessor) 276 552 1204 4424MaxPwr(W) High Performance 130 160 190 251
Semiconductor Technology Roadmap (2001)
Computer Architecture & Arithmetic Group 7 Stanford University
Limits on Scaling(at 10-30 nm)
Any effect such as:• Thermal noise effects at low voltages (VDD)• Dielectric material breakdown• Quantum effects
… may limit scaling Beyond scalable planar technology lie 3D technologies.
Computer Architecture & Arithmetic Group 8 Stanford University
What this means to Computer Architecture
Tradeoffs in time (performance), power and area (cost). (TPA product)
Computer Architecture & Arithmetic Group 9 Stanford University
Message: System not Processor Architecture
• Power not clock rate• System not processor integration• Integration of what?
– Wireless and optical interconnections– Voice and sensor I/O– Crypto, Video, Audio, etc.– Multi user displays, keyboards, sensors, etc.
Computer Architecture & Arithmetic Group 10 Stanford University
Performance: Time, Area and Power Tradeoffs
Computer Architecture & Arithmetic Group 11 Stanford University
Performance lesson: wires not gates determine delay.
Del
ay (p
s)
0
100
200
300
400
0.7 0.5 0.35 0.25 0.18 0.13
Process Generation (um)
Gate Delay
Interconnect delay forstandard functional unit
Computer Architecture & Arithmetic Group 12 Stanford University
Wires and coupling capacitance
Computer Architecture & Arithmetic Group 13 Stanford University
High Speed Clocking
Fast clocks are not primarily the result of technology scaling, but rather of circuit/logic techniques:
– Dynamic logic, Wave pipelining type
• Modern microprocessors are increasing clock speed more rapidly than anyone predicted…local clocks (SIA) or hyperclocking.
• But fast clocks do not necessarily increase system performance
Computer Architecture & Arithmetic Group 14 Stanford University
Computer Architecture & Arithmetic Group 15 Stanford University
CMOS
1
10
100
1000
10000
67 73 83 93 '03
year
freq
uenc
y
CMOS
Computer Architecture & Arithmetic Group 16 Stanford University
Bipolar and CMOS
1
10
100
1000
10000
67 73 83 93 '03
year
freq
uenc
y
BipolarCMOSLimit
Bipolar power limit
Computer Architecture & Arithmetic Group 17 Stanford University
SIA global clock rates
100
1000
10000
100000
'98 '01 '04 '07 '13
year
freq
uenc
y
global'01SIA '97
Computer Architecture & Arithmetic Group 18 Stanford University
SIA local & global clock rates
100
1000
10000
100000
'98 '01 '04 '07 '13
year
freq
uenc
y
globallocal
hyper frequency
Computer Architecture & Arithmetic Group 19 Stanford University
Lower clock overhead enables more pipe segments & higher freq.
• Let the total instruction execution without pipelining and associated clock overhead be T
• Let S be the number of segments and then S - k is number of cycles lost due to a pipeline break (S >k)
• Let b = prob of break/instr., C = clock o’head incl skewT
T/S C CT+CS
S-kcycles
b
Computer Architecture & Arithmetic Group 20 Stanford University
Performance, issue rate and frequencyperformance is largely determined by m and bf(s-k)
1+ mbf(s-k)1
T/s+C+mo1Perf= m *
d Perf d Perf= 0 = 0d s d m Optimum S (and max freq.)
is largely determined by C
Sopt= f(k) T Similarly optimum mis largely determined by
obC
Computer Architecture & Arithmetic Group 21 Stanford University
Architecturally, what’s happened to T, b and C?
• T has increased, probably 25-50%, due to added processor complexity
• b has decreased significantly, probably from .25 to closer to .05, due to better compilers, prediction and cache management.
• C has at least halved due to better clocking.• So Sopt has increased by more than 2x
decreasing the number of F04s/cycle by ½
Computer Architecture & Arithmetic Group 22 Stanford University
Optimum Pipelining and ILP
Optimum pipe segment size is determined by clock overhead, C
Optimum issue rate, m, is determined by issue overhead, o
Computer Architecture & Arithmetic Group 23 Stanford University
Performance vs ILP
Computer Architecture & Arithmetic Group 24 Stanford University
Performance and the memory “wall”
• BUT regardless of clock rate, performance is limited by the predictability & supply of instructions and data operands from memory and cache.
• The access time to memory depends on wire delay which remains constant with scaling.
• Even with significant hardware support (area) unpredicted events (missed branches, etc) force the processor to access memory. At some point limiting the performance.
Computer Architecture & Arithmetic Group 25 Stanford University
Power: and the curse of frequency
While Vdd and C decrease, frequency increases at an acceleratingrate thereby increasing power density
As Vdd decreases so does Vth;this increases I leakage and static power
Net: while increasing frequency may not increaseperformance by much it certainly does increase power
Computer Architecture & Arithmetic Group 26 Stanford University
Power: the new frontier
• Cooled high power: >100w/ die• High power: 30-100w/ die … plug in supply• Low power: 0.1-2w / die.. rechargeable
battery• Very low power: 1-100mw /die ..AA size
batteries• Extremely low power: 1-100 microwatt/die
and below (nano watts) .. button batteries
Computer Architecture & Arithmetic Group 27 Stanford University
Battery Energy & use
1uw5 years (always on)
40mAhbutton
1-10 mw½ year (10-20% duty)
4000 mAh2xAA
400mw-4w
50 hours (10-20% duty)
10,000 mAh
recharageable
powertimeenergy capacity
type
Computer Architecture & Arithmetic Group 28 Stanford University
Power is important!freq2
freq1
=P2
P1
3By Vdd and device scaling
• By scaling alone a 4x slower implementation may need only 1/64 as much power.
• Gating power to functional units and other techniques should enable 100MHz processors to operate at O(10-3) watts.
• Goal: O(10-6) watts.
Computer Architecture & Arithmetic Group 29 Stanford University
Extremely Low PowerSystem: suppose low power was the performance
metric, rather than cycle time.Very Low power (µW) can be achieved:
– Perhaps sub thresh hold circuits– Clock rate may be limited to 10 MHz (?)– Achieve 1 to 5 year battery life– Lots of area, BUT how to get performance
• Very efficient architecture and segmented powering.• Very efficient software & operating system• Very efficient arithmetic & signal processing• Very effective parallel processing
Computer Architecture & Arithmetic Group 30 Stanford University
Area: scaling silicon
Wafer size + die size = costFeature size + die size = performance
Computer Architecture & Arithmetic Group 31 Stanford University
The basics of wafer and die size
• Wafers, 30 cm or so in diameter produce (πd2/4)/A dice. Die area, A, is 0.5- 2cm2
about 700 1cm2 dice per wafer.
• Yield (% of good die) is e -ρ A where ρ is the defect density, @ ρA=1; Yield is 37%ρA =.2; Yield is 80%
Computer Architecture & Arithmetic Group 32 Stanford University
The basics of wafer fab
• A 30 cm, state of the art (ρ = 0.2-.5) wafer fab facility might cost $3B and require $5B pa to be profitable…that’s at least 5M wafer starts and almost 5B 1cm die /per year. At O($1000) per wafer that’s $1/die with, say a die (f=0.1u) has 100-500 M tx /cm2.
• So how to use them?
Computer Architecture & Arithmetic Group 33 Stanford University
Area and Cost
Is efficient use of die area important? Is processor cost important?NO − server processors (cost dominatedby memory, power/cooling, etc.)YES – “client”, everything in the middle.NO – small, embedded die which are
package limited. (10-100Mtx/die)
Computer Architecture & Arithmetic Group 34 Stanford University
Area
Indeed, looking ahead how to use 1-5 billion transistors/cm2 die?
– With performance limited by power, memory & program behavior servers use multiple processors per die, limited by memory.
– For client systems either “system on a chip” or hybrid system are attractive.
– Low cost embedded systems need imaginative packaging and enhanced function.
Computer Architecture & Arithmetic Group 35 Stanford University
Hybrid System
• Core, very low power units comprise limited storage, limited processing, cryptoand wireless: “watch” type package
• Multi user displays, keyboards, voice entry and network access can be implemented with higher power.
• Wireless, secure interconnects complete the system, but with major power management implications.
Computer Architecture & Arithmetic Group 36 Stanford University
Beyond T x A x P …other dimensions
• Computational integrity and RAS• Design time and FPL• Configure ability / system connections and
wireless technology.
Computer Architecture & Arithmetic Group 37 Stanford University
Computational Integrity
Design for– Reliability– Testability– Serviceability– Process recoverability– Fail-safe computation
Computer Architecture & Arithmetic Group 38 Stanford University
CAD productivity limitations favor FPL
1
Logi
c Tr
ansi
stor
s pe
r Chi
p(K
)
Prod
uctiv
ityTr
ans.
/Sta
ff - M
onth
10
100
1,000
10,000
100,000
1,000,000
10,000,000
10,000
100,000
1,000,000
Logic Transistors/ChipTransistor/Staff Month
58%/Yr. compoundComplexity growth rate
21%/Yr. compoundProductivity growth rate
Source: S. Malik, orig. SEMATECH
1981
1983
1985
1987
1989
1991
1993
1995
1997
1999
2003
2001
2005
2007
2009
xx xx x
x
x
2.5µ
.10µ
.35µ
100,000,000
10,000,000
1,000
100
10
Computer Architecture & Arithmetic Group 39 Stanford University
Wireless technology
• Essential to realizing efficient client and package limited systems
• Can be power consumptive, requires very efficient channel management.– Focused, adaptive radiation patterns– Digital management of RF– Probably better suited to MHz rather than GHz
RF… in conflict with antenna goals– Many issues: security, multi signal formats, etc.
Computer Architecture & Arithmetic Group 40 Stanford University
The “New” Server SystemServer Processor
– Cost insensitive – High power, fast cycle time– Multiple (order of 10’s to 100) processors on
single die, together with portion of memory space– MIMD extensions via network– Base processor using ILP with SIMD extensions– Very high integrity
Computer Architecture & Arithmetic Group 41 Stanford University
Client System• Invisible core... integrated into
– Appliances– Communications devices– Notebooks – “Watches”, wearable cores
• Always secure and available• Very low power• Integrated wireless/networked
Computer Architecture & Arithmetic Group 42 Stanford University
The “New” Client SystemClient
– “System on a chip,” integrated, hybrid or embedded
– Many frequency- power design points from 100 µw with cycle time 2 ns to 1 µw with cycle of 200 ns
– Small die low cost– Several base processors
• Core and signal processors– High integrity– Signal processors VLIW/SIMD
Computer Architecture & Arithmetic Group 43 Stanford University
System Design: The CAD Challenge
• IP’s – not IC’s. Amalgamate existing designs from multiple sources– Validate, test, scale, implement
• Buses and Wires– Placement and coupling, wire length,
delay modules… some role for optics?
• Effective floorplanning
Computer Architecture & Arithmetic Group 44 Stanford University
System Design: Other Challenges
• Really good power management• Low power wireless interconnections• Integrity – Security• Scene/ pattern recognition• Software efficiency: Op Systems, compilers• Software productivity: rules based, analysis,
heuristic, etc.
Computer Architecture & Arithmetic Group 45 Stanford University
Summary
• Processor design with deep sub-micron (f < 90nm) technology offers major advantages: 10x speed or 10-6 power and 100x circuit density. Indeed, SIA projections have consistently underestimated the future.
• But the real challenge is in system not processor design: new system products and concepts, ip management and design tools.