multi-core processor technology: maximizing cpu performance in a power-constrained world paul teich...
Post on 23-Dec-2015
219 Views
Preview:
TRANSCRIPT
Multi-Core Processor Technology:Maximizing CPU Performance in aPower-Constrained World
Paul TeichBusiness StrategyCPG Server/Workstation
AMDpaul.teich @ amd.com
The IssuesThe Issues
Silicon designers can choose a variety of methods to increase processor performance
Commercial end-customers are demandingMore capable systems with more capable processors
That new systems stay within their existing power/thermal infrastructure
Processor frequency and power consumption seem to be scaling in lockstep
How can the industry-standard PC and Server industries stay on our historic performance curve without burning a hole in our motherboards?
This session is not about process technology
Session OutlineSession Outline
Definition: What is a processor?
Core Design
System Architecture
Manufacturing, Power, and Thermals
Multi-Core Processor Architecture
Performance Impacts
What is a Processor? What is a Processor?
A single chip package that fits in a socket
≥1 core (not much point in <1 core…)Cores can have functional units, cache, etc.associated with them, just as today
Cores can be fast or slow, just as today
Shared resourcesMore cache
Other integration: Northbridge, memory controllers, high-speed serial links, etc.
One system interface no matter how many coresNumber of signal pins doesn’t scale with number of cores
A Representative Multi-Core ProcessorA Representative Multi-Core Processor
Dual-core AMD Opteron™ processor is 199mm2 in 90nm
Single-core AMD Opteron processor is 193mm2 in 130nm
Multi-Core Processor ArchitectureMulti-Core Processor Architecture
Core DesignCore Design
FrequencyIs only as good as the rest of the core architecture
AGUAGU
Int Decode & Rename
FADD FMISCFMUL
44-entryLoad/Store
Queue
36-entry FP scheduler
FP Decode & Rename
ALU
AGU
ALU
MULT
ALU
Res Res Res
L1Icache64KB
L1Dcache64KB
FetchBranch
Prediction
Instruction Control Unit (72 entries)
Fastpath Microcode EngineScan/Align/Decode
µops
AMD Opteron processor core architecture
Core DesignCore Design
Functional unitsSuperscalar is known territory
Diminishing returns for adding more functional blocks
Alternatives like VLIW have been considered and rejected by the market
Single-threaded architectural performance is pegged
Data pathsIncreasing bandwidth between functional units in a core makes a difference
Such as comprehensive 64-bit design, but then where to?
Core DesignCore Design
PipelineDeeper pipeline buys frequency at expense of increased cache miss penalty and lower instructions per clock
Shallow pipeline gives better instructions per clock at the expense of frequency scaling
Max frequency per core requires deeper pipelines
Industry converging on middle ground…9 to 11 stagesSuccessful RISC CPUs are in the same range
CacheCache size buys performance at expense of die size, it’s a direct hit to manufacturing cost
Deep pipeline cache miss penalties are reduced by larger caches
Not always the best match for shallow pipeline cores, as cache misses penalties are not as steep
ManufacturingManufacturing
Moore’s Law isn’t dead, more transistors for everyone!
But…it doesn’t really mention scaling transistor power
Chemistry and physics at nano-scaleStretching materials science
Voltage doesn’t scale yet
Transistor leakage current is increasing
As manufacturing economies and frequency increase, power consumption is increasing disproportionately
There are no process or architectural quick-fixes
Transistors Are Not FreeTransistors Are Not Free
The number of transistors in a core determines basic power consumption
Architectural efficiency matters a lot when designing new cores
More functional units means more transistors
Deeper pipelines mean more transistors
Larger caches mean more transistors
Static Current vs. FrequencyStatic Current vs. Frequency
Frequency
Sta
tic
Cu
rren
t
Embedded Embedded PartsParts
Very High LeakageVery High Leakageand Powerand Power
Fast, High Fast, High PowerPower
Fast, Low Fast, Low PowerPower
1.0 1.5
15
0
Non-linear as processors approach max frequency
Power vs. FrequencyPower vs. Frequency
In AMD’s process, for 200MHz frequency steps, two steps back on frequency cuts power consumption by ~40% from maximum frequency
(Gross relative numbers summarized from a mountain of real data)
0.5
1.0
1.5
2.0
n-5 n-4 n-3 n-2 n-1 n
Frequency
Po
wer
C
on
sum
pti
on
Thermal Density DecreasesThermal Density Decreases
Hot spotsTwice as many as in single-core
Farther apart than in single-core
With freq delta, cooler than in single-core
ΘCA same for single-core at n and dual-core at n-2
Larger die spreads heat more evenly in package
Use identical heat sink, slightly better cooling with dual-core
Works for this processor generation and next, ΘCA
changes over major generations
Thermal diode accuracy becomes an issue with dual-core
Total Effect on Dual-Core FrequenciesTotal Effect on Dual-Core Frequencies
Substantially lower power with lower frequency
Thermals easier to handle at any frequency
Result is dual-core running at n-2 in same thermal envelope as single-core running at top speed
Multi-Core Processor ArchitectureMulti-Core Processor Architecture
Why integrate?Most functions are really small compared to the cores and cacheAll integrated logic runs at core frequency regardless of I/O speeds
What to integrate?Northbridge crossbar switch is key
Look for innovation and differentiation in how cores areconnected on-chipMust integrate Northbridge to integrate anything else…
Memory controller to reduce memory latency and further reduce the need for cacheHigh-speed serial links for system I/O
What not to integrate?Most Southbridge functionsGraphics
AMD Opteron Processor Integrated NorthbridgeAMD Opteron Processor Integrated Northbridge
SystemRequestInterface
(SRI)
AdvancedProgrammable
InterruptController
(APIC)
CPU 0Data
CPU 1Data
CPU 0Probes
CPU 1Probes
CPU 0Requests
CPU 1Requests
CPU 0Int
CPU 1Int
MemoryController
(MCT)
DRAMController
(DCT)
DRAM DataRAS/CAS/Cntl
Crossbar(XBAR)
HyperTransport™ Link 0
64-bit Data
64-bit Command/Address
16-bit Data/Command/Address
HyperTransport Link 2
HyperTransport Link 1
Multi-Core: Where Processor and System CollideMulti-Core: Where Processor and System Collide
Scales performanceDedicated resources for two simultaneous threadsMultiple cores will contend for memory and I/O bandwidth
Northbridge is the bottleneckIntegrating Northbridge eliminates much of bottleneckNorthbridge architecture has significant impact on performance
Cores, cache and Northbridge must be balanced for optimal performance
More aggregate performance for: Multi-threaded apps Transactions: many instances of same app Multi-tasking
Thread scheduling handled by OSBIOS notifies Windows of thread execution resources
Early Benchmark EstimatesEarly Benchmark Estimates
Decoder2P/2C – 2 proc. single-core
4P/4C – 4 proc. single-core
2P/4C – 2 proc. dual-core
4P/8C – 4 proc. dual-core
FrequenciesSingle-core = 2.4GHz
Dual-core = 2.0GHz
Identical system configsMemory, disks, network, etc.
Early dual-core validation system used, different motherboards
SPECint_rate2000 Peak Win SP1
100163
184308
2P/2C2P/4C4P/4C4P/8C
Pla
tfo
rms
% Comparison to Base 2P/2C
SPEC and the benchmark name SPECint are registered trademarks of the Standard Performance Evaluation Corporation. SPEC scores for AMD Opteron Model 270 and 870 based systems are estimated
OLTP Workload
100
148
165
244
2P/2C
2P/4C
4P/4C
4P/8C
Pla
tfo
rms
% Comparison to Base 2P/2C
Call to ActionCall to Action
Most application software doesn’t need to do anything to benefit from dual-coreBe aware that, for a processor within a given power envelope
Fewer cores will clock faster than more coresSingle-threaded performance-sensitive applications
More cores will out-perform fewer cores forMulti-threaded applicationsMulti-tasking response timesTransaction processing
Processor architecture impacts multi-core performance
Process technology is only the anteIntegration enables a balanced high-performance architecture
Community ResourcesCommunity Resources
Windows Hardware & Driver Central (WHDC)www.microsoft.com/whdc/default.mspx
Technical Communitieswww.microsoft.com/communities/products/default.mspx
Non-Microsoft Community Siteswww.microsoft.com/communities/related/default.mspx
Microsoft Public Newsgroupswww.microsoft.com/communities/newsgroups
Technical Chats and Webcastswww.microsoft.com/communities/chats/default.mspx
www.microsoft.com/webcasts
Microsoft Blogswww.microsoft.com/communities/blogs
Additional ResourcesAdditional Resources
Email: paul.teich @ amd.com
WinHEC Presentations“x86 Everywhere,” Chris Herring, AMD
“Maximizing Desktop Application Performance onDual-Core PC Platforms,” Rich Brunner, AMD
Web ResourcesAMD http://www.amd.com/
AMD Multi-Core http://www.amd.com/multicore/
AMD Opteron™ Processor http://www.amd.com/opteron/
AMD Multi-Core White Paper http://enterprise.amd.com/downloadables/33211A_Multi-Core_WP.pdf
HyperTransport™ Consortium http://www.hypertransport.org/
top related