winter 2004 class representation for advanced vlsi course instructor : dr s.m.fakhraie presented by...
TRANSCRIPT
Winter 2004Winter 2004
Class Representation For Class Representation For Advanced VLSI Course Advanced VLSI Course
Instructor Dr SMFakhraieInstructor Dr SMFakhraiePresented by Naser SedaghatiPresented by Naser Sedaghati
Major Reference Major Reference
Design and Implementation of theDesign and Implementation of thePOWER5POWER5TMTM Microprocessor Microprocessor
J Clabes1 J Friedrich1 M Sweet1 J DiLullo1 S Chu1 D Plass2 JJ Clabes1 J Friedrich1 M Sweet1 J DiLullo1 S Chu1 D Plass2 JDawson2 P Muench2 L Powell1 M Floyd1 B Sinharoy2 M Lee1 MDawson2 P Muench2 L Powell1 M Floyd1 B Sinharoy2 M Lee1 M
Goulet1 J Wagoner1 N Schwarz1 S Runyon1 G Gorman1 P Restle3Goulet1 J Wagoner1 N Schwarz1 S Runyon1 G Gorman1 P Restle3Kalla1 J McGill1 S Dodson1Kalla1 J McGill1 S Dodson1
1IBM System Group Austin TX1IBM System Group Austin TX2IBM System Group Poughkeepsie NY2IBM System Group Poughkeepsie NY3IBM Research Yorktown Heights NY3IBM Research Yorktown Heights NY
IEEE International Solid-State Circuits Conference 2004IEEE International Solid-State Circuits Conference 2004
OutlineOutline
Motivation
Background
Threading Fundamentals
Enhancement SMT Implementation in POWER5
Memory Subsystem Enhancements
Power Efficiency
Additional SMT Considerations
Summary
Microprocessor Design Optimization Focus Areas
Memory latency Increased processor speeds make memory appear further away Longer stalls possible
Branch processing Mispredict more costly as pipeline depth increases resulting in stalls and
wasted power Predication drives increased power and larger chip area
Execution Unit Utilization Currently 20-25 execution unit utilization common
Simultaneous multi-threading (SMT) and POWER architecture address these areas
bull Motivation hellip
POWER4 --- Shipped in Systems December 2001Technology 180nm lithography Cu SOI
POWER4+ shipping in 130nm today 267mm2 185M transistors
Dual processor core8-way superscalar
Out of Order execution Load Store units 2 Fixed Point units 2 Floating Point units Logical operations on Condition
Register Branch Execution unit
gt 200 instructions in flightHardware instruction and data prefetch
bull Background hellip
POWER5 --- The Next StepTechnology 130nm lithography Cu SOI
389mm2 276M Transistors
Dual processor core
8-way superscalar
Simultaneous multithreaded (SMT) core Up to 2 virtual processors per real
processor Natural extension to POWER4
design
bull Background hellip
System-level view of POWER5
bull Background hellip
Multi-threading Evolution
bull Threading hellip
Changes Going From ST to SMT Core
SMT easily added to Superscalar Micro-architecture Second Program Counter (PC) added to share I-fetch bandwidth GPRFPR rename mapper expanded to map second set of registers (High order
address bit indicates thread) Completion logic replicated to track two threads Thread bit added to most addresstag buses
bull Enhancement hellip
POWER5 Resources Size Enhancements
Enhanced caches and translation resources I-cache 64 KB 2-way set associative LRU
D-cache 32 KB 4-way set associative LRU
First level Data Translation 128 entries fully associative LRU
L2 Cache 192 MB 10-way set associative LRU
Larger resource pools Rename registers GPRs FPRs increased to 120 each
L2 cache coherency engines increased by 100
Enhanced data stream prefetching
Memory controller moved on chip
bull Enhancement hellip
Thread Priority
Instances when unbalanced execution desirable
No work for opposite thread Thread waiting on lock Software determined non uniform
balance Power management hellip
Solution Control instruction decode rate Softwarehardware controls 8
priority levels for each thread
bull Enhancement hellip
Modifications to POWER4 System Structurebull Memory hellip
Power Efficient Design Implementation
DC power mitigation 1048599Leverage triple Vt technology
Decrease low Vt usage by 90
Increase high Vt usage by 30
1048599Leverage triple Tox technology
Thick Tox usage for decoupling capacitors
1048599 AC power mitigation 1048599Minimal usage of dynamic circuits 1048599Reduce loading on clock mesh 1048599Incorporation of dynamic clock gating
bull Power hellip
Thermal control logic and sample thermal responsebull Power hellip
16-way Building Blockbull Additional hellip
POWER5 Multi-Chip Module
95mm 95mm
Four POWER5 chips
Four cache chips
4491 signal IOs
89 layers of metal
bull Additional hellip
64-way SMP Interconnection
bull Additional hellip
Interconnection exploits enhanced distributed switch
bull All chip interconnections operate at half processor frequency and scale with processor frequency
POWER4 and POWER5 Storage Hierarchybull Additional hellip
POWER4 POWER5
L2 Cache L2 Cache Capacity Line Size Capacity Line Size Associativity ReplacementAssociativity Replacement
144 MB 128 B Line 144 MB 128 B Line 8-way LRU8-way LRU
192 MB 128 B Line192 MB 128 B Line10 ndash way LRU10 ndash way LRU
Off-Chip L3 CacheOff-Chip L3 CacheCapacity Line Size Capacity Line Size Associativity ReplacementAssociativity Replacement
32 MB 512 B Line32 MB 512 B Line8-way LRU8-way LRU
36 MB 256 B Line 36 MB 256 B Line 12 ndash way LRU12 ndash way LRU
Chip interconnectChip interconnectType Type Intra-MCM data busesIntra-MCM data busesInter-MCM data busesInter-MCM data buses
Distributed SwitchDistributed Switchfrac12 processor speed frac12 processor speed frac12 processor speedfrac12 processor speed
Enhanced Distributed SwitchEnhanced Distributed SwitchProcessor speedProcessor speedProcessor speedProcessor speed
MemoryMemory 512 GB maximum512 GB maximum 1024 GB (1 TB) Maximum1024 GB (1 TB) Maximum
POWER Server Roadmap
bull Additional hellip
Summary
First dual core SMT microprocessor
Extended SMP to 64-way
Operating in laboratory
Power dynamically managed with no performance penalty
Implementation permits future technology scalability from
circuit and power perspective
Innovative approach leveraging technology with system
focus for high performance in a power efficient design
bull Summary hellip
Other ReferencesOther References
[1] R Kalla B Sinharoy J Tendler ldquoIBM POWER5 CHIP A DUAL-CORE [1] R Kalla B Sinharoy J Tendler ldquoIBM POWER5 CHIP A DUAL-CORE MULTITHREADED PROCESSORrdquo IEE Computer Society MARCH-APRIL 2004 MULTITHREADED PROCESSORrdquo IEE Computer Society MARCH-APRIL 2004 [2] R Kalla IBM System Group ldquoIBMrsquos POWER5 Design and Methodologyrdquo IBM [2] R Kalla IBM System Group ldquoIBMrsquos POWER5 Design and Methodologyrdquo IBM Corporation 2003Corporation 2003
OutlineOutline
Motivation
Background
Threading Fundamentals
Enhancement SMT Implementation in POWER5
Memory Subsystem Enhancements
Power Efficiency
Additional SMT Considerations
Summary
Microprocessor Design Optimization Focus Areas
Memory latency Increased processor speeds make memory appear further away Longer stalls possible
Branch processing Mispredict more costly as pipeline depth increases resulting in stalls and
wasted power Predication drives increased power and larger chip area
Execution Unit Utilization Currently 20-25 execution unit utilization common
Simultaneous multi-threading (SMT) and POWER architecture address these areas
bull Motivation hellip
POWER4 --- Shipped in Systems December 2001Technology 180nm lithography Cu SOI
POWER4+ shipping in 130nm today 267mm2 185M transistors
Dual processor core8-way superscalar
Out of Order execution Load Store units 2 Fixed Point units 2 Floating Point units Logical operations on Condition
Register Branch Execution unit
gt 200 instructions in flightHardware instruction and data prefetch
bull Background hellip
POWER5 --- The Next StepTechnology 130nm lithography Cu SOI
389mm2 276M Transistors
Dual processor core
8-way superscalar
Simultaneous multithreaded (SMT) core Up to 2 virtual processors per real
processor Natural extension to POWER4
design
bull Background hellip
System-level view of POWER5
bull Background hellip
Multi-threading Evolution
bull Threading hellip
Changes Going From ST to SMT Core
SMT easily added to Superscalar Micro-architecture Second Program Counter (PC) added to share I-fetch bandwidth GPRFPR rename mapper expanded to map second set of registers (High order
address bit indicates thread) Completion logic replicated to track two threads Thread bit added to most addresstag buses
bull Enhancement hellip
POWER5 Resources Size Enhancements
Enhanced caches and translation resources I-cache 64 KB 2-way set associative LRU
D-cache 32 KB 4-way set associative LRU
First level Data Translation 128 entries fully associative LRU
L2 Cache 192 MB 10-way set associative LRU
Larger resource pools Rename registers GPRs FPRs increased to 120 each
L2 cache coherency engines increased by 100
Enhanced data stream prefetching
Memory controller moved on chip
bull Enhancement hellip
Thread Priority
Instances when unbalanced execution desirable
No work for opposite thread Thread waiting on lock Software determined non uniform
balance Power management hellip
Solution Control instruction decode rate Softwarehardware controls 8
priority levels for each thread
bull Enhancement hellip
Modifications to POWER4 System Structurebull Memory hellip
Power Efficient Design Implementation
DC power mitigation 1048599Leverage triple Vt technology
Decrease low Vt usage by 90
Increase high Vt usage by 30
1048599Leverage triple Tox technology
Thick Tox usage for decoupling capacitors
1048599 AC power mitigation 1048599Minimal usage of dynamic circuits 1048599Reduce loading on clock mesh 1048599Incorporation of dynamic clock gating
bull Power hellip
Thermal control logic and sample thermal responsebull Power hellip
16-way Building Blockbull Additional hellip
POWER5 Multi-Chip Module
95mm 95mm
Four POWER5 chips
Four cache chips
4491 signal IOs
89 layers of metal
bull Additional hellip
64-way SMP Interconnection
bull Additional hellip
Interconnection exploits enhanced distributed switch
bull All chip interconnections operate at half processor frequency and scale with processor frequency
POWER4 and POWER5 Storage Hierarchybull Additional hellip
POWER4 POWER5
L2 Cache L2 Cache Capacity Line Size Capacity Line Size Associativity ReplacementAssociativity Replacement
144 MB 128 B Line 144 MB 128 B Line 8-way LRU8-way LRU
192 MB 128 B Line192 MB 128 B Line10 ndash way LRU10 ndash way LRU
Off-Chip L3 CacheOff-Chip L3 CacheCapacity Line Size Capacity Line Size Associativity ReplacementAssociativity Replacement
32 MB 512 B Line32 MB 512 B Line8-way LRU8-way LRU
36 MB 256 B Line 36 MB 256 B Line 12 ndash way LRU12 ndash way LRU
Chip interconnectChip interconnectType Type Intra-MCM data busesIntra-MCM data busesInter-MCM data busesInter-MCM data buses
Distributed SwitchDistributed Switchfrac12 processor speed frac12 processor speed frac12 processor speedfrac12 processor speed
Enhanced Distributed SwitchEnhanced Distributed SwitchProcessor speedProcessor speedProcessor speedProcessor speed
MemoryMemory 512 GB maximum512 GB maximum 1024 GB (1 TB) Maximum1024 GB (1 TB) Maximum
POWER Server Roadmap
bull Additional hellip
Summary
First dual core SMT microprocessor
Extended SMP to 64-way
Operating in laboratory
Power dynamically managed with no performance penalty
Implementation permits future technology scalability from
circuit and power perspective
Innovative approach leveraging technology with system
focus for high performance in a power efficient design
bull Summary hellip
Other ReferencesOther References
[1] R Kalla B Sinharoy J Tendler ldquoIBM POWER5 CHIP A DUAL-CORE [1] R Kalla B Sinharoy J Tendler ldquoIBM POWER5 CHIP A DUAL-CORE MULTITHREADED PROCESSORrdquo IEE Computer Society MARCH-APRIL 2004 MULTITHREADED PROCESSORrdquo IEE Computer Society MARCH-APRIL 2004 [2] R Kalla IBM System Group ldquoIBMrsquos POWER5 Design and Methodologyrdquo IBM [2] R Kalla IBM System Group ldquoIBMrsquos POWER5 Design and Methodologyrdquo IBM Corporation 2003Corporation 2003
Microprocessor Design Optimization Focus Areas
Memory latency Increased processor speeds make memory appear further away Longer stalls possible
Branch processing Mispredict more costly as pipeline depth increases resulting in stalls and
wasted power Predication drives increased power and larger chip area
Execution Unit Utilization Currently 20-25 execution unit utilization common
Simultaneous multi-threading (SMT) and POWER architecture address these areas
bull Motivation hellip
POWER4 --- Shipped in Systems December 2001Technology 180nm lithography Cu SOI
POWER4+ shipping in 130nm today 267mm2 185M transistors
Dual processor core8-way superscalar
Out of Order execution Load Store units 2 Fixed Point units 2 Floating Point units Logical operations on Condition
Register Branch Execution unit
gt 200 instructions in flightHardware instruction and data prefetch
bull Background hellip
POWER5 --- The Next StepTechnology 130nm lithography Cu SOI
389mm2 276M Transistors
Dual processor core
8-way superscalar
Simultaneous multithreaded (SMT) core Up to 2 virtual processors per real
processor Natural extension to POWER4
design
bull Background hellip
System-level view of POWER5
bull Background hellip
Multi-threading Evolution
bull Threading hellip
Changes Going From ST to SMT Core
SMT easily added to Superscalar Micro-architecture Second Program Counter (PC) added to share I-fetch bandwidth GPRFPR rename mapper expanded to map second set of registers (High order
address bit indicates thread) Completion logic replicated to track two threads Thread bit added to most addresstag buses
bull Enhancement hellip
POWER5 Resources Size Enhancements
Enhanced caches and translation resources I-cache 64 KB 2-way set associative LRU
D-cache 32 KB 4-way set associative LRU
First level Data Translation 128 entries fully associative LRU
L2 Cache 192 MB 10-way set associative LRU
Larger resource pools Rename registers GPRs FPRs increased to 120 each
L2 cache coherency engines increased by 100
Enhanced data stream prefetching
Memory controller moved on chip
bull Enhancement hellip
Thread Priority
Instances when unbalanced execution desirable
No work for opposite thread Thread waiting on lock Software determined non uniform
balance Power management hellip
Solution Control instruction decode rate Softwarehardware controls 8
priority levels for each thread
bull Enhancement hellip
Modifications to POWER4 System Structurebull Memory hellip
Power Efficient Design Implementation
DC power mitigation 1048599Leverage triple Vt technology
Decrease low Vt usage by 90
Increase high Vt usage by 30
1048599Leverage triple Tox technology
Thick Tox usage for decoupling capacitors
1048599 AC power mitigation 1048599Minimal usage of dynamic circuits 1048599Reduce loading on clock mesh 1048599Incorporation of dynamic clock gating
bull Power hellip
Thermal control logic and sample thermal responsebull Power hellip
16-way Building Blockbull Additional hellip
POWER5 Multi-Chip Module
95mm 95mm
Four POWER5 chips
Four cache chips
4491 signal IOs
89 layers of metal
bull Additional hellip
64-way SMP Interconnection
bull Additional hellip
Interconnection exploits enhanced distributed switch
bull All chip interconnections operate at half processor frequency and scale with processor frequency
POWER4 and POWER5 Storage Hierarchybull Additional hellip
POWER4 POWER5
L2 Cache L2 Cache Capacity Line Size Capacity Line Size Associativity ReplacementAssociativity Replacement
144 MB 128 B Line 144 MB 128 B Line 8-way LRU8-way LRU
192 MB 128 B Line192 MB 128 B Line10 ndash way LRU10 ndash way LRU
Off-Chip L3 CacheOff-Chip L3 CacheCapacity Line Size Capacity Line Size Associativity ReplacementAssociativity Replacement
32 MB 512 B Line32 MB 512 B Line8-way LRU8-way LRU
36 MB 256 B Line 36 MB 256 B Line 12 ndash way LRU12 ndash way LRU
Chip interconnectChip interconnectType Type Intra-MCM data busesIntra-MCM data busesInter-MCM data busesInter-MCM data buses
Distributed SwitchDistributed Switchfrac12 processor speed frac12 processor speed frac12 processor speedfrac12 processor speed
Enhanced Distributed SwitchEnhanced Distributed SwitchProcessor speedProcessor speedProcessor speedProcessor speed
MemoryMemory 512 GB maximum512 GB maximum 1024 GB (1 TB) Maximum1024 GB (1 TB) Maximum
POWER Server Roadmap
bull Additional hellip
Summary
First dual core SMT microprocessor
Extended SMP to 64-way
Operating in laboratory
Power dynamically managed with no performance penalty
Implementation permits future technology scalability from
circuit and power perspective
Innovative approach leveraging technology with system
focus for high performance in a power efficient design
bull Summary hellip
Other ReferencesOther References
[1] R Kalla B Sinharoy J Tendler ldquoIBM POWER5 CHIP A DUAL-CORE [1] R Kalla B Sinharoy J Tendler ldquoIBM POWER5 CHIP A DUAL-CORE MULTITHREADED PROCESSORrdquo IEE Computer Society MARCH-APRIL 2004 MULTITHREADED PROCESSORrdquo IEE Computer Society MARCH-APRIL 2004 [2] R Kalla IBM System Group ldquoIBMrsquos POWER5 Design and Methodologyrdquo IBM [2] R Kalla IBM System Group ldquoIBMrsquos POWER5 Design and Methodologyrdquo IBM Corporation 2003Corporation 2003
POWER4 --- Shipped in Systems December 2001Technology 180nm lithography Cu SOI
POWER4+ shipping in 130nm today 267mm2 185M transistors
Dual processor core8-way superscalar
Out of Order execution Load Store units 2 Fixed Point units 2 Floating Point units Logical operations on Condition
Register Branch Execution unit
gt 200 instructions in flightHardware instruction and data prefetch
bull Background hellip
POWER5 --- The Next StepTechnology 130nm lithography Cu SOI
389mm2 276M Transistors
Dual processor core
8-way superscalar
Simultaneous multithreaded (SMT) core Up to 2 virtual processors per real
processor Natural extension to POWER4
design
bull Background hellip
System-level view of POWER5
bull Background hellip
Multi-threading Evolution
bull Threading hellip
Changes Going From ST to SMT Core
SMT easily added to Superscalar Micro-architecture Second Program Counter (PC) added to share I-fetch bandwidth GPRFPR rename mapper expanded to map second set of registers (High order
address bit indicates thread) Completion logic replicated to track two threads Thread bit added to most addresstag buses
bull Enhancement hellip
POWER5 Resources Size Enhancements
Enhanced caches and translation resources I-cache 64 KB 2-way set associative LRU
D-cache 32 KB 4-way set associative LRU
First level Data Translation 128 entries fully associative LRU
L2 Cache 192 MB 10-way set associative LRU
Larger resource pools Rename registers GPRs FPRs increased to 120 each
L2 cache coherency engines increased by 100
Enhanced data stream prefetching
Memory controller moved on chip
bull Enhancement hellip
Thread Priority
Instances when unbalanced execution desirable
No work for opposite thread Thread waiting on lock Software determined non uniform
balance Power management hellip
Solution Control instruction decode rate Softwarehardware controls 8
priority levels for each thread
bull Enhancement hellip
Modifications to POWER4 System Structurebull Memory hellip
Power Efficient Design Implementation
DC power mitigation 1048599Leverage triple Vt technology
Decrease low Vt usage by 90
Increase high Vt usage by 30
1048599Leverage triple Tox technology
Thick Tox usage for decoupling capacitors
1048599 AC power mitigation 1048599Minimal usage of dynamic circuits 1048599Reduce loading on clock mesh 1048599Incorporation of dynamic clock gating
bull Power hellip
Thermal control logic and sample thermal responsebull Power hellip
16-way Building Blockbull Additional hellip
POWER5 Multi-Chip Module
95mm 95mm
Four POWER5 chips
Four cache chips
4491 signal IOs
89 layers of metal
bull Additional hellip
64-way SMP Interconnection
bull Additional hellip
Interconnection exploits enhanced distributed switch
bull All chip interconnections operate at half processor frequency and scale with processor frequency
POWER4 and POWER5 Storage Hierarchybull Additional hellip
POWER4 POWER5
L2 Cache L2 Cache Capacity Line Size Capacity Line Size Associativity ReplacementAssociativity Replacement
144 MB 128 B Line 144 MB 128 B Line 8-way LRU8-way LRU
192 MB 128 B Line192 MB 128 B Line10 ndash way LRU10 ndash way LRU
Off-Chip L3 CacheOff-Chip L3 CacheCapacity Line Size Capacity Line Size Associativity ReplacementAssociativity Replacement
32 MB 512 B Line32 MB 512 B Line8-way LRU8-way LRU
36 MB 256 B Line 36 MB 256 B Line 12 ndash way LRU12 ndash way LRU
Chip interconnectChip interconnectType Type Intra-MCM data busesIntra-MCM data busesInter-MCM data busesInter-MCM data buses
Distributed SwitchDistributed Switchfrac12 processor speed frac12 processor speed frac12 processor speedfrac12 processor speed
Enhanced Distributed SwitchEnhanced Distributed SwitchProcessor speedProcessor speedProcessor speedProcessor speed
MemoryMemory 512 GB maximum512 GB maximum 1024 GB (1 TB) Maximum1024 GB (1 TB) Maximum
POWER Server Roadmap
bull Additional hellip
Summary
First dual core SMT microprocessor
Extended SMP to 64-way
Operating in laboratory
Power dynamically managed with no performance penalty
Implementation permits future technology scalability from
circuit and power perspective
Innovative approach leveraging technology with system
focus for high performance in a power efficient design
bull Summary hellip
Other ReferencesOther References
[1] R Kalla B Sinharoy J Tendler ldquoIBM POWER5 CHIP A DUAL-CORE [1] R Kalla B Sinharoy J Tendler ldquoIBM POWER5 CHIP A DUAL-CORE MULTITHREADED PROCESSORrdquo IEE Computer Society MARCH-APRIL 2004 MULTITHREADED PROCESSORrdquo IEE Computer Society MARCH-APRIL 2004 [2] R Kalla IBM System Group ldquoIBMrsquos POWER5 Design and Methodologyrdquo IBM [2] R Kalla IBM System Group ldquoIBMrsquos POWER5 Design and Methodologyrdquo IBM Corporation 2003Corporation 2003
POWER5 --- The Next StepTechnology 130nm lithography Cu SOI
389mm2 276M Transistors
Dual processor core
8-way superscalar
Simultaneous multithreaded (SMT) core Up to 2 virtual processors per real
processor Natural extension to POWER4
design
bull Background hellip
System-level view of POWER5
bull Background hellip
Multi-threading Evolution
bull Threading hellip
Changes Going From ST to SMT Core
SMT easily added to Superscalar Micro-architecture Second Program Counter (PC) added to share I-fetch bandwidth GPRFPR rename mapper expanded to map second set of registers (High order
address bit indicates thread) Completion logic replicated to track two threads Thread bit added to most addresstag buses
bull Enhancement hellip
POWER5 Resources Size Enhancements
Enhanced caches and translation resources I-cache 64 KB 2-way set associative LRU
D-cache 32 KB 4-way set associative LRU
First level Data Translation 128 entries fully associative LRU
L2 Cache 192 MB 10-way set associative LRU
Larger resource pools Rename registers GPRs FPRs increased to 120 each
L2 cache coherency engines increased by 100
Enhanced data stream prefetching
Memory controller moved on chip
bull Enhancement hellip
Thread Priority
Instances when unbalanced execution desirable
No work for opposite thread Thread waiting on lock Software determined non uniform
balance Power management hellip
Solution Control instruction decode rate Softwarehardware controls 8
priority levels for each thread
bull Enhancement hellip
Modifications to POWER4 System Structurebull Memory hellip
Power Efficient Design Implementation
DC power mitigation 1048599Leverage triple Vt technology
Decrease low Vt usage by 90
Increase high Vt usage by 30
1048599Leverage triple Tox technology
Thick Tox usage for decoupling capacitors
1048599 AC power mitigation 1048599Minimal usage of dynamic circuits 1048599Reduce loading on clock mesh 1048599Incorporation of dynamic clock gating
bull Power hellip
Thermal control logic and sample thermal responsebull Power hellip
16-way Building Blockbull Additional hellip
POWER5 Multi-Chip Module
95mm 95mm
Four POWER5 chips
Four cache chips
4491 signal IOs
89 layers of metal
bull Additional hellip
64-way SMP Interconnection
bull Additional hellip
Interconnection exploits enhanced distributed switch
bull All chip interconnections operate at half processor frequency and scale with processor frequency
POWER4 and POWER5 Storage Hierarchybull Additional hellip
POWER4 POWER5
L2 Cache L2 Cache Capacity Line Size Capacity Line Size Associativity ReplacementAssociativity Replacement
144 MB 128 B Line 144 MB 128 B Line 8-way LRU8-way LRU
192 MB 128 B Line192 MB 128 B Line10 ndash way LRU10 ndash way LRU
Off-Chip L3 CacheOff-Chip L3 CacheCapacity Line Size Capacity Line Size Associativity ReplacementAssociativity Replacement
32 MB 512 B Line32 MB 512 B Line8-way LRU8-way LRU
36 MB 256 B Line 36 MB 256 B Line 12 ndash way LRU12 ndash way LRU
Chip interconnectChip interconnectType Type Intra-MCM data busesIntra-MCM data busesInter-MCM data busesInter-MCM data buses
Distributed SwitchDistributed Switchfrac12 processor speed frac12 processor speed frac12 processor speedfrac12 processor speed
Enhanced Distributed SwitchEnhanced Distributed SwitchProcessor speedProcessor speedProcessor speedProcessor speed
MemoryMemory 512 GB maximum512 GB maximum 1024 GB (1 TB) Maximum1024 GB (1 TB) Maximum
POWER Server Roadmap
bull Additional hellip
Summary
First dual core SMT microprocessor
Extended SMP to 64-way
Operating in laboratory
Power dynamically managed with no performance penalty
Implementation permits future technology scalability from
circuit and power perspective
Innovative approach leveraging technology with system
focus for high performance in a power efficient design
bull Summary hellip
Other ReferencesOther References
[1] R Kalla B Sinharoy J Tendler ldquoIBM POWER5 CHIP A DUAL-CORE [1] R Kalla B Sinharoy J Tendler ldquoIBM POWER5 CHIP A DUAL-CORE MULTITHREADED PROCESSORrdquo IEE Computer Society MARCH-APRIL 2004 MULTITHREADED PROCESSORrdquo IEE Computer Society MARCH-APRIL 2004 [2] R Kalla IBM System Group ldquoIBMrsquos POWER5 Design and Methodologyrdquo IBM [2] R Kalla IBM System Group ldquoIBMrsquos POWER5 Design and Methodologyrdquo IBM Corporation 2003Corporation 2003
System-level view of POWER5
bull Background hellip
Multi-threading Evolution
bull Threading hellip
Changes Going From ST to SMT Core
SMT easily added to Superscalar Micro-architecture Second Program Counter (PC) added to share I-fetch bandwidth GPRFPR rename mapper expanded to map second set of registers (High order
address bit indicates thread) Completion logic replicated to track two threads Thread bit added to most addresstag buses
bull Enhancement hellip
POWER5 Resources Size Enhancements
Enhanced caches and translation resources I-cache 64 KB 2-way set associative LRU
D-cache 32 KB 4-way set associative LRU
First level Data Translation 128 entries fully associative LRU
L2 Cache 192 MB 10-way set associative LRU
Larger resource pools Rename registers GPRs FPRs increased to 120 each
L2 cache coherency engines increased by 100
Enhanced data stream prefetching
Memory controller moved on chip
bull Enhancement hellip
Thread Priority
Instances when unbalanced execution desirable
No work for opposite thread Thread waiting on lock Software determined non uniform
balance Power management hellip
Solution Control instruction decode rate Softwarehardware controls 8
priority levels for each thread
bull Enhancement hellip
Modifications to POWER4 System Structurebull Memory hellip
Power Efficient Design Implementation
DC power mitigation 1048599Leverage triple Vt technology
Decrease low Vt usage by 90
Increase high Vt usage by 30
1048599Leverage triple Tox technology
Thick Tox usage for decoupling capacitors
1048599 AC power mitigation 1048599Minimal usage of dynamic circuits 1048599Reduce loading on clock mesh 1048599Incorporation of dynamic clock gating
bull Power hellip
Thermal control logic and sample thermal responsebull Power hellip
16-way Building Blockbull Additional hellip
POWER5 Multi-Chip Module
95mm 95mm
Four POWER5 chips
Four cache chips
4491 signal IOs
89 layers of metal
bull Additional hellip
64-way SMP Interconnection
bull Additional hellip
Interconnection exploits enhanced distributed switch
bull All chip interconnections operate at half processor frequency and scale with processor frequency
POWER4 and POWER5 Storage Hierarchybull Additional hellip
POWER4 POWER5
L2 Cache L2 Cache Capacity Line Size Capacity Line Size Associativity ReplacementAssociativity Replacement
144 MB 128 B Line 144 MB 128 B Line 8-way LRU8-way LRU
192 MB 128 B Line192 MB 128 B Line10 ndash way LRU10 ndash way LRU
Off-Chip L3 CacheOff-Chip L3 CacheCapacity Line Size Capacity Line Size Associativity ReplacementAssociativity Replacement
32 MB 512 B Line32 MB 512 B Line8-way LRU8-way LRU
36 MB 256 B Line 36 MB 256 B Line 12 ndash way LRU12 ndash way LRU
Chip interconnectChip interconnectType Type Intra-MCM data busesIntra-MCM data busesInter-MCM data busesInter-MCM data buses
Distributed SwitchDistributed Switchfrac12 processor speed frac12 processor speed frac12 processor speedfrac12 processor speed
Enhanced Distributed SwitchEnhanced Distributed SwitchProcessor speedProcessor speedProcessor speedProcessor speed
MemoryMemory 512 GB maximum512 GB maximum 1024 GB (1 TB) Maximum1024 GB (1 TB) Maximum
POWER Server Roadmap
bull Additional hellip
Summary
First dual core SMT microprocessor
Extended SMP to 64-way
Operating in laboratory
Power dynamically managed with no performance penalty
Implementation permits future technology scalability from
circuit and power perspective
Innovative approach leveraging technology with system
focus for high performance in a power efficient design
bull Summary hellip
Other ReferencesOther References
[1] R Kalla B Sinharoy J Tendler ldquoIBM POWER5 CHIP A DUAL-CORE [1] R Kalla B Sinharoy J Tendler ldquoIBM POWER5 CHIP A DUAL-CORE MULTITHREADED PROCESSORrdquo IEE Computer Society MARCH-APRIL 2004 MULTITHREADED PROCESSORrdquo IEE Computer Society MARCH-APRIL 2004 [2] R Kalla IBM System Group ldquoIBMrsquos POWER5 Design and Methodologyrdquo IBM [2] R Kalla IBM System Group ldquoIBMrsquos POWER5 Design and Methodologyrdquo IBM Corporation 2003Corporation 2003
Multi-threading Evolution
bull Threading hellip
Changes Going From ST to SMT Core
SMT easily added to Superscalar Micro-architecture Second Program Counter (PC) added to share I-fetch bandwidth GPRFPR rename mapper expanded to map second set of registers (High order
address bit indicates thread) Completion logic replicated to track two threads Thread bit added to most addresstag buses
bull Enhancement hellip
POWER5 Resources Size Enhancements
Enhanced caches and translation resources I-cache 64 KB 2-way set associative LRU
D-cache 32 KB 4-way set associative LRU
First level Data Translation 128 entries fully associative LRU
L2 Cache 192 MB 10-way set associative LRU
Larger resource pools Rename registers GPRs FPRs increased to 120 each
L2 cache coherency engines increased by 100
Enhanced data stream prefetching
Memory controller moved on chip
bull Enhancement hellip
Thread Priority
Instances when unbalanced execution desirable
No work for opposite thread Thread waiting on lock Software determined non uniform
balance Power management hellip
Solution Control instruction decode rate Softwarehardware controls 8
priority levels for each thread
bull Enhancement hellip
Modifications to POWER4 System Structurebull Memory hellip
Power Efficient Design Implementation
DC power mitigation 1048599Leverage triple Vt technology
Decrease low Vt usage by 90
Increase high Vt usage by 30
1048599Leverage triple Tox technology
Thick Tox usage for decoupling capacitors
1048599 AC power mitigation 1048599Minimal usage of dynamic circuits 1048599Reduce loading on clock mesh 1048599Incorporation of dynamic clock gating
bull Power hellip
Thermal control logic and sample thermal responsebull Power hellip
16-way Building Blockbull Additional hellip
POWER5 Multi-Chip Module
95mm 95mm
Four POWER5 chips
Four cache chips
4491 signal IOs
89 layers of metal
bull Additional hellip
64-way SMP Interconnection
bull Additional hellip
Interconnection exploits enhanced distributed switch
bull All chip interconnections operate at half processor frequency and scale with processor frequency
POWER4 and POWER5 Storage Hierarchybull Additional hellip
POWER4 POWER5
L2 Cache L2 Cache Capacity Line Size Capacity Line Size Associativity ReplacementAssociativity Replacement
144 MB 128 B Line 144 MB 128 B Line 8-way LRU8-way LRU
192 MB 128 B Line192 MB 128 B Line10 ndash way LRU10 ndash way LRU
Off-Chip L3 CacheOff-Chip L3 CacheCapacity Line Size Capacity Line Size Associativity ReplacementAssociativity Replacement
32 MB 512 B Line32 MB 512 B Line8-way LRU8-way LRU
36 MB 256 B Line 36 MB 256 B Line 12 ndash way LRU12 ndash way LRU
Chip interconnectChip interconnectType Type Intra-MCM data busesIntra-MCM data busesInter-MCM data busesInter-MCM data buses
Distributed SwitchDistributed Switchfrac12 processor speed frac12 processor speed frac12 processor speedfrac12 processor speed
Enhanced Distributed SwitchEnhanced Distributed SwitchProcessor speedProcessor speedProcessor speedProcessor speed
MemoryMemory 512 GB maximum512 GB maximum 1024 GB (1 TB) Maximum1024 GB (1 TB) Maximum
POWER Server Roadmap
bull Additional hellip
Summary
First dual core SMT microprocessor
Extended SMP to 64-way
Operating in laboratory
Power dynamically managed with no performance penalty
Implementation permits future technology scalability from
circuit and power perspective
Innovative approach leveraging technology with system
focus for high performance in a power efficient design
bull Summary hellip
Other ReferencesOther References
[1] R Kalla B Sinharoy J Tendler ldquoIBM POWER5 CHIP A DUAL-CORE [1] R Kalla B Sinharoy J Tendler ldquoIBM POWER5 CHIP A DUAL-CORE MULTITHREADED PROCESSORrdquo IEE Computer Society MARCH-APRIL 2004 MULTITHREADED PROCESSORrdquo IEE Computer Society MARCH-APRIL 2004 [2] R Kalla IBM System Group ldquoIBMrsquos POWER5 Design and Methodologyrdquo IBM [2] R Kalla IBM System Group ldquoIBMrsquos POWER5 Design and Methodologyrdquo IBM Corporation 2003Corporation 2003
Changes Going From ST to SMT Core
SMT easily added to Superscalar Micro-architecture Second Program Counter (PC) added to share I-fetch bandwidth GPRFPR rename mapper expanded to map second set of registers (High order
address bit indicates thread) Completion logic replicated to track two threads Thread bit added to most addresstag buses
bull Enhancement hellip
POWER5 Resources Size Enhancements
Enhanced caches and translation resources I-cache 64 KB 2-way set associative LRU
D-cache 32 KB 4-way set associative LRU
First level Data Translation 128 entries fully associative LRU
L2 Cache 192 MB 10-way set associative LRU
Larger resource pools Rename registers GPRs FPRs increased to 120 each
L2 cache coherency engines increased by 100
Enhanced data stream prefetching
Memory controller moved on chip
bull Enhancement hellip
Thread Priority
Instances when unbalanced execution desirable
No work for opposite thread Thread waiting on lock Software determined non uniform
balance Power management hellip
Solution Control instruction decode rate Softwarehardware controls 8
priority levels for each thread
bull Enhancement hellip
Modifications to POWER4 System Structurebull Memory hellip
Power Efficient Design Implementation
DC power mitigation 1048599Leverage triple Vt technology
Decrease low Vt usage by 90
Increase high Vt usage by 30
1048599Leverage triple Tox technology
Thick Tox usage for decoupling capacitors
1048599 AC power mitigation 1048599Minimal usage of dynamic circuits 1048599Reduce loading on clock mesh 1048599Incorporation of dynamic clock gating
bull Power hellip
Thermal control logic and sample thermal responsebull Power hellip
16-way Building Blockbull Additional hellip
POWER5 Multi-Chip Module
95mm 95mm
Four POWER5 chips
Four cache chips
4491 signal IOs
89 layers of metal
bull Additional hellip
64-way SMP Interconnection
bull Additional hellip
Interconnection exploits enhanced distributed switch
bull All chip interconnections operate at half processor frequency and scale with processor frequency
POWER4 and POWER5 Storage Hierarchybull Additional hellip
POWER4 POWER5
L2 Cache L2 Cache Capacity Line Size Capacity Line Size Associativity ReplacementAssociativity Replacement
144 MB 128 B Line 144 MB 128 B Line 8-way LRU8-way LRU
192 MB 128 B Line192 MB 128 B Line10 ndash way LRU10 ndash way LRU
Off-Chip L3 CacheOff-Chip L3 CacheCapacity Line Size Capacity Line Size Associativity ReplacementAssociativity Replacement
32 MB 512 B Line32 MB 512 B Line8-way LRU8-way LRU
36 MB 256 B Line 36 MB 256 B Line 12 ndash way LRU12 ndash way LRU
Chip interconnectChip interconnectType Type Intra-MCM data busesIntra-MCM data busesInter-MCM data busesInter-MCM data buses
Distributed SwitchDistributed Switchfrac12 processor speed frac12 processor speed frac12 processor speedfrac12 processor speed
Enhanced Distributed SwitchEnhanced Distributed SwitchProcessor speedProcessor speedProcessor speedProcessor speed
MemoryMemory 512 GB maximum512 GB maximum 1024 GB (1 TB) Maximum1024 GB (1 TB) Maximum
POWER Server Roadmap
bull Additional hellip
Summary
First dual core SMT microprocessor
Extended SMP to 64-way
Operating in laboratory
Power dynamically managed with no performance penalty
Implementation permits future technology scalability from
circuit and power perspective
Innovative approach leveraging technology with system
focus for high performance in a power efficient design
bull Summary hellip
Other ReferencesOther References
[1] R Kalla B Sinharoy J Tendler ldquoIBM POWER5 CHIP A DUAL-CORE [1] R Kalla B Sinharoy J Tendler ldquoIBM POWER5 CHIP A DUAL-CORE MULTITHREADED PROCESSORrdquo IEE Computer Society MARCH-APRIL 2004 MULTITHREADED PROCESSORrdquo IEE Computer Society MARCH-APRIL 2004 [2] R Kalla IBM System Group ldquoIBMrsquos POWER5 Design and Methodologyrdquo IBM [2] R Kalla IBM System Group ldquoIBMrsquos POWER5 Design and Methodologyrdquo IBM Corporation 2003Corporation 2003
POWER5 Resources Size Enhancements
Enhanced caches and translation resources I-cache 64 KB 2-way set associative LRU
D-cache 32 KB 4-way set associative LRU
First level Data Translation 128 entries fully associative LRU
L2 Cache 192 MB 10-way set associative LRU
Larger resource pools Rename registers GPRs FPRs increased to 120 each
L2 cache coherency engines increased by 100
Enhanced data stream prefetching
Memory controller moved on chip
bull Enhancement hellip
Thread Priority
Instances when unbalanced execution desirable
No work for opposite thread Thread waiting on lock Software determined non uniform
balance Power management hellip
Solution Control instruction decode rate Softwarehardware controls 8
priority levels for each thread
bull Enhancement hellip
Modifications to POWER4 System Structurebull Memory hellip
Power Efficient Design Implementation
DC power mitigation 1048599Leverage triple Vt technology
Decrease low Vt usage by 90
Increase high Vt usage by 30
1048599Leverage triple Tox technology
Thick Tox usage for decoupling capacitors
1048599 AC power mitigation 1048599Minimal usage of dynamic circuits 1048599Reduce loading on clock mesh 1048599Incorporation of dynamic clock gating
bull Power hellip
Thermal control logic and sample thermal responsebull Power hellip
16-way Building Blockbull Additional hellip
POWER5 Multi-Chip Module
95mm 95mm
Four POWER5 chips
Four cache chips
4491 signal IOs
89 layers of metal
bull Additional hellip
64-way SMP Interconnection
bull Additional hellip
Interconnection exploits enhanced distributed switch
bull All chip interconnections operate at half processor frequency and scale with processor frequency
POWER4 and POWER5 Storage Hierarchybull Additional hellip
POWER4 POWER5
L2 Cache L2 Cache Capacity Line Size Capacity Line Size Associativity ReplacementAssociativity Replacement
144 MB 128 B Line 144 MB 128 B Line 8-way LRU8-way LRU
192 MB 128 B Line192 MB 128 B Line10 ndash way LRU10 ndash way LRU
Off-Chip L3 CacheOff-Chip L3 CacheCapacity Line Size Capacity Line Size Associativity ReplacementAssociativity Replacement
32 MB 512 B Line32 MB 512 B Line8-way LRU8-way LRU
36 MB 256 B Line 36 MB 256 B Line 12 ndash way LRU12 ndash way LRU
Chip interconnectChip interconnectType Type Intra-MCM data busesIntra-MCM data busesInter-MCM data busesInter-MCM data buses
Distributed SwitchDistributed Switchfrac12 processor speed frac12 processor speed frac12 processor speedfrac12 processor speed
Enhanced Distributed SwitchEnhanced Distributed SwitchProcessor speedProcessor speedProcessor speedProcessor speed
MemoryMemory 512 GB maximum512 GB maximum 1024 GB (1 TB) Maximum1024 GB (1 TB) Maximum
POWER Server Roadmap
bull Additional hellip
Summary
First dual core SMT microprocessor
Extended SMP to 64-way
Operating in laboratory
Power dynamically managed with no performance penalty
Implementation permits future technology scalability from
circuit and power perspective
Innovative approach leveraging technology with system
focus for high performance in a power efficient design
bull Summary hellip
Other ReferencesOther References
[1] R Kalla B Sinharoy J Tendler ldquoIBM POWER5 CHIP A DUAL-CORE [1] R Kalla B Sinharoy J Tendler ldquoIBM POWER5 CHIP A DUAL-CORE MULTITHREADED PROCESSORrdquo IEE Computer Society MARCH-APRIL 2004 MULTITHREADED PROCESSORrdquo IEE Computer Society MARCH-APRIL 2004 [2] R Kalla IBM System Group ldquoIBMrsquos POWER5 Design and Methodologyrdquo IBM [2] R Kalla IBM System Group ldquoIBMrsquos POWER5 Design and Methodologyrdquo IBM Corporation 2003Corporation 2003
Thread Priority
Instances when unbalanced execution desirable
No work for opposite thread Thread waiting on lock Software determined non uniform
balance Power management hellip
Solution Control instruction decode rate Softwarehardware controls 8
priority levels for each thread
bull Enhancement hellip
Modifications to POWER4 System Structurebull Memory hellip
Power Efficient Design Implementation
DC power mitigation 1048599Leverage triple Vt technology
Decrease low Vt usage by 90
Increase high Vt usage by 30
1048599Leverage triple Tox technology
Thick Tox usage for decoupling capacitors
1048599 AC power mitigation 1048599Minimal usage of dynamic circuits 1048599Reduce loading on clock mesh 1048599Incorporation of dynamic clock gating
bull Power hellip
Thermal control logic and sample thermal responsebull Power hellip
16-way Building Blockbull Additional hellip
POWER5 Multi-Chip Module
95mm 95mm
Four POWER5 chips
Four cache chips
4491 signal IOs
89 layers of metal
bull Additional hellip
64-way SMP Interconnection
bull Additional hellip
Interconnection exploits enhanced distributed switch
bull All chip interconnections operate at half processor frequency and scale with processor frequency
POWER4 and POWER5 Storage Hierarchybull Additional hellip
POWER4 POWER5
L2 Cache L2 Cache Capacity Line Size Capacity Line Size Associativity ReplacementAssociativity Replacement
144 MB 128 B Line 144 MB 128 B Line 8-way LRU8-way LRU
192 MB 128 B Line192 MB 128 B Line10 ndash way LRU10 ndash way LRU
Off-Chip L3 CacheOff-Chip L3 CacheCapacity Line Size Capacity Line Size Associativity ReplacementAssociativity Replacement
32 MB 512 B Line32 MB 512 B Line8-way LRU8-way LRU
36 MB 256 B Line 36 MB 256 B Line 12 ndash way LRU12 ndash way LRU
Chip interconnectChip interconnectType Type Intra-MCM data busesIntra-MCM data busesInter-MCM data busesInter-MCM data buses
Distributed SwitchDistributed Switchfrac12 processor speed frac12 processor speed frac12 processor speedfrac12 processor speed
Enhanced Distributed SwitchEnhanced Distributed SwitchProcessor speedProcessor speedProcessor speedProcessor speed
MemoryMemory 512 GB maximum512 GB maximum 1024 GB (1 TB) Maximum1024 GB (1 TB) Maximum
POWER Server Roadmap
bull Additional hellip
Summary
First dual core SMT microprocessor
Extended SMP to 64-way
Operating in laboratory
Power dynamically managed with no performance penalty
Implementation permits future technology scalability from
circuit and power perspective
Innovative approach leveraging technology with system
focus for high performance in a power efficient design
bull Summary hellip
Other ReferencesOther References
[1] R Kalla B Sinharoy J Tendler ldquoIBM POWER5 CHIP A DUAL-CORE [1] R Kalla B Sinharoy J Tendler ldquoIBM POWER5 CHIP A DUAL-CORE MULTITHREADED PROCESSORrdquo IEE Computer Society MARCH-APRIL 2004 MULTITHREADED PROCESSORrdquo IEE Computer Society MARCH-APRIL 2004 [2] R Kalla IBM System Group ldquoIBMrsquos POWER5 Design and Methodologyrdquo IBM [2] R Kalla IBM System Group ldquoIBMrsquos POWER5 Design and Methodologyrdquo IBM Corporation 2003Corporation 2003
Modifications to POWER4 System Structurebull Memory hellip
Power Efficient Design Implementation
DC power mitigation 1048599Leverage triple Vt technology
Decrease low Vt usage by 90
Increase high Vt usage by 30
1048599Leverage triple Tox technology
Thick Tox usage for decoupling capacitors
1048599 AC power mitigation 1048599Minimal usage of dynamic circuits 1048599Reduce loading on clock mesh 1048599Incorporation of dynamic clock gating
bull Power hellip
Thermal control logic and sample thermal responsebull Power hellip
16-way Building Blockbull Additional hellip
POWER5 Multi-Chip Module
95mm 95mm
Four POWER5 chips
Four cache chips
4491 signal IOs
89 layers of metal
bull Additional hellip
64-way SMP Interconnection
bull Additional hellip
Interconnection exploits enhanced distributed switch
bull All chip interconnections operate at half processor frequency and scale with processor frequency
POWER4 and POWER5 Storage Hierarchybull Additional hellip
POWER4 POWER5
L2 Cache L2 Cache Capacity Line Size Capacity Line Size Associativity ReplacementAssociativity Replacement
144 MB 128 B Line 144 MB 128 B Line 8-way LRU8-way LRU
192 MB 128 B Line192 MB 128 B Line10 ndash way LRU10 ndash way LRU
Off-Chip L3 CacheOff-Chip L3 CacheCapacity Line Size Capacity Line Size Associativity ReplacementAssociativity Replacement
32 MB 512 B Line32 MB 512 B Line8-way LRU8-way LRU
36 MB 256 B Line 36 MB 256 B Line 12 ndash way LRU12 ndash way LRU
Chip interconnectChip interconnectType Type Intra-MCM data busesIntra-MCM data busesInter-MCM data busesInter-MCM data buses
Distributed SwitchDistributed Switchfrac12 processor speed frac12 processor speed frac12 processor speedfrac12 processor speed
Enhanced Distributed SwitchEnhanced Distributed SwitchProcessor speedProcessor speedProcessor speedProcessor speed
MemoryMemory 512 GB maximum512 GB maximum 1024 GB (1 TB) Maximum1024 GB (1 TB) Maximum
POWER Server Roadmap
bull Additional hellip
Summary
First dual core SMT microprocessor
Extended SMP to 64-way
Operating in laboratory
Power dynamically managed with no performance penalty
Implementation permits future technology scalability from
circuit and power perspective
Innovative approach leveraging technology with system
focus for high performance in a power efficient design
bull Summary hellip
Other ReferencesOther References
[1] R Kalla B Sinharoy J Tendler ldquoIBM POWER5 CHIP A DUAL-CORE [1] R Kalla B Sinharoy J Tendler ldquoIBM POWER5 CHIP A DUAL-CORE MULTITHREADED PROCESSORrdquo IEE Computer Society MARCH-APRIL 2004 MULTITHREADED PROCESSORrdquo IEE Computer Society MARCH-APRIL 2004 [2] R Kalla IBM System Group ldquoIBMrsquos POWER5 Design and Methodologyrdquo IBM [2] R Kalla IBM System Group ldquoIBMrsquos POWER5 Design and Methodologyrdquo IBM Corporation 2003Corporation 2003
Power Efficient Design Implementation
DC power mitigation 1048599Leverage triple Vt technology
Decrease low Vt usage by 90
Increase high Vt usage by 30
1048599Leverage triple Tox technology
Thick Tox usage for decoupling capacitors
1048599 AC power mitigation 1048599Minimal usage of dynamic circuits 1048599Reduce loading on clock mesh 1048599Incorporation of dynamic clock gating
bull Power hellip
Thermal control logic and sample thermal responsebull Power hellip
16-way Building Blockbull Additional hellip
POWER5 Multi-Chip Module
95mm 95mm
Four POWER5 chips
Four cache chips
4491 signal IOs
89 layers of metal
bull Additional hellip
64-way SMP Interconnection
bull Additional hellip
Interconnection exploits enhanced distributed switch
bull All chip interconnections operate at half processor frequency and scale with processor frequency
POWER4 and POWER5 Storage Hierarchybull Additional hellip
POWER4 POWER5
L2 Cache L2 Cache Capacity Line Size Capacity Line Size Associativity ReplacementAssociativity Replacement
144 MB 128 B Line 144 MB 128 B Line 8-way LRU8-way LRU
192 MB 128 B Line192 MB 128 B Line10 ndash way LRU10 ndash way LRU
Off-Chip L3 CacheOff-Chip L3 CacheCapacity Line Size Capacity Line Size Associativity ReplacementAssociativity Replacement
32 MB 512 B Line32 MB 512 B Line8-way LRU8-way LRU
36 MB 256 B Line 36 MB 256 B Line 12 ndash way LRU12 ndash way LRU
Chip interconnectChip interconnectType Type Intra-MCM data busesIntra-MCM data busesInter-MCM data busesInter-MCM data buses
Distributed SwitchDistributed Switchfrac12 processor speed frac12 processor speed frac12 processor speedfrac12 processor speed
Enhanced Distributed SwitchEnhanced Distributed SwitchProcessor speedProcessor speedProcessor speedProcessor speed
MemoryMemory 512 GB maximum512 GB maximum 1024 GB (1 TB) Maximum1024 GB (1 TB) Maximum
POWER Server Roadmap
bull Additional hellip
Summary
First dual core SMT microprocessor
Extended SMP to 64-way
Operating in laboratory
Power dynamically managed with no performance penalty
Implementation permits future technology scalability from
circuit and power perspective
Innovative approach leveraging technology with system
focus for high performance in a power efficient design
bull Summary hellip
Other ReferencesOther References
[1] R Kalla B Sinharoy J Tendler ldquoIBM POWER5 CHIP A DUAL-CORE [1] R Kalla B Sinharoy J Tendler ldquoIBM POWER5 CHIP A DUAL-CORE MULTITHREADED PROCESSORrdquo IEE Computer Society MARCH-APRIL 2004 MULTITHREADED PROCESSORrdquo IEE Computer Society MARCH-APRIL 2004 [2] R Kalla IBM System Group ldquoIBMrsquos POWER5 Design and Methodologyrdquo IBM [2] R Kalla IBM System Group ldquoIBMrsquos POWER5 Design and Methodologyrdquo IBM Corporation 2003Corporation 2003
Thermal control logic and sample thermal responsebull Power hellip
16-way Building Blockbull Additional hellip
POWER5 Multi-Chip Module
95mm 95mm
Four POWER5 chips
Four cache chips
4491 signal IOs
89 layers of metal
bull Additional hellip
64-way SMP Interconnection
bull Additional hellip
Interconnection exploits enhanced distributed switch
bull All chip interconnections operate at half processor frequency and scale with processor frequency
POWER4 and POWER5 Storage Hierarchybull Additional hellip
POWER4 POWER5
L2 Cache L2 Cache Capacity Line Size Capacity Line Size Associativity ReplacementAssociativity Replacement
144 MB 128 B Line 144 MB 128 B Line 8-way LRU8-way LRU
192 MB 128 B Line192 MB 128 B Line10 ndash way LRU10 ndash way LRU
Off-Chip L3 CacheOff-Chip L3 CacheCapacity Line Size Capacity Line Size Associativity ReplacementAssociativity Replacement
32 MB 512 B Line32 MB 512 B Line8-way LRU8-way LRU
36 MB 256 B Line 36 MB 256 B Line 12 ndash way LRU12 ndash way LRU
Chip interconnectChip interconnectType Type Intra-MCM data busesIntra-MCM data busesInter-MCM data busesInter-MCM data buses
Distributed SwitchDistributed Switchfrac12 processor speed frac12 processor speed frac12 processor speedfrac12 processor speed
Enhanced Distributed SwitchEnhanced Distributed SwitchProcessor speedProcessor speedProcessor speedProcessor speed
MemoryMemory 512 GB maximum512 GB maximum 1024 GB (1 TB) Maximum1024 GB (1 TB) Maximum
POWER Server Roadmap
bull Additional hellip
Summary
First dual core SMT microprocessor
Extended SMP to 64-way
Operating in laboratory
Power dynamically managed with no performance penalty
Implementation permits future technology scalability from
circuit and power perspective
Innovative approach leveraging technology with system
focus for high performance in a power efficient design
bull Summary hellip
Other ReferencesOther References
[1] R Kalla B Sinharoy J Tendler ldquoIBM POWER5 CHIP A DUAL-CORE [1] R Kalla B Sinharoy J Tendler ldquoIBM POWER5 CHIP A DUAL-CORE MULTITHREADED PROCESSORrdquo IEE Computer Society MARCH-APRIL 2004 MULTITHREADED PROCESSORrdquo IEE Computer Society MARCH-APRIL 2004 [2] R Kalla IBM System Group ldquoIBMrsquos POWER5 Design and Methodologyrdquo IBM [2] R Kalla IBM System Group ldquoIBMrsquos POWER5 Design and Methodologyrdquo IBM Corporation 2003Corporation 2003
16-way Building Blockbull Additional hellip
POWER5 Multi-Chip Module
95mm 95mm
Four POWER5 chips
Four cache chips
4491 signal IOs
89 layers of metal
bull Additional hellip
64-way SMP Interconnection
bull Additional hellip
Interconnection exploits enhanced distributed switch
bull All chip interconnections operate at half processor frequency and scale with processor frequency
POWER4 and POWER5 Storage Hierarchybull Additional hellip
POWER4 POWER5
L2 Cache L2 Cache Capacity Line Size Capacity Line Size Associativity ReplacementAssociativity Replacement
144 MB 128 B Line 144 MB 128 B Line 8-way LRU8-way LRU
192 MB 128 B Line192 MB 128 B Line10 ndash way LRU10 ndash way LRU
Off-Chip L3 CacheOff-Chip L3 CacheCapacity Line Size Capacity Line Size Associativity ReplacementAssociativity Replacement
32 MB 512 B Line32 MB 512 B Line8-way LRU8-way LRU
36 MB 256 B Line 36 MB 256 B Line 12 ndash way LRU12 ndash way LRU
Chip interconnectChip interconnectType Type Intra-MCM data busesIntra-MCM data busesInter-MCM data busesInter-MCM data buses
Distributed SwitchDistributed Switchfrac12 processor speed frac12 processor speed frac12 processor speedfrac12 processor speed
Enhanced Distributed SwitchEnhanced Distributed SwitchProcessor speedProcessor speedProcessor speedProcessor speed
MemoryMemory 512 GB maximum512 GB maximum 1024 GB (1 TB) Maximum1024 GB (1 TB) Maximum
POWER Server Roadmap
bull Additional hellip
Summary
First dual core SMT microprocessor
Extended SMP to 64-way
Operating in laboratory
Power dynamically managed with no performance penalty
Implementation permits future technology scalability from
circuit and power perspective
Innovative approach leveraging technology with system
focus for high performance in a power efficient design
bull Summary hellip
Other ReferencesOther References
[1] R Kalla B Sinharoy J Tendler ldquoIBM POWER5 CHIP A DUAL-CORE [1] R Kalla B Sinharoy J Tendler ldquoIBM POWER5 CHIP A DUAL-CORE MULTITHREADED PROCESSORrdquo IEE Computer Society MARCH-APRIL 2004 MULTITHREADED PROCESSORrdquo IEE Computer Society MARCH-APRIL 2004 [2] R Kalla IBM System Group ldquoIBMrsquos POWER5 Design and Methodologyrdquo IBM [2] R Kalla IBM System Group ldquoIBMrsquos POWER5 Design and Methodologyrdquo IBM Corporation 2003Corporation 2003
POWER5 Multi-Chip Module
95mm 95mm
Four POWER5 chips
Four cache chips
4491 signal IOs
89 layers of metal
bull Additional hellip
64-way SMP Interconnection
bull Additional hellip
Interconnection exploits enhanced distributed switch
bull All chip interconnections operate at half processor frequency and scale with processor frequency
POWER4 and POWER5 Storage Hierarchybull Additional hellip
POWER4 POWER5
L2 Cache L2 Cache Capacity Line Size Capacity Line Size Associativity ReplacementAssociativity Replacement
144 MB 128 B Line 144 MB 128 B Line 8-way LRU8-way LRU
192 MB 128 B Line192 MB 128 B Line10 ndash way LRU10 ndash way LRU
Off-Chip L3 CacheOff-Chip L3 CacheCapacity Line Size Capacity Line Size Associativity ReplacementAssociativity Replacement
32 MB 512 B Line32 MB 512 B Line8-way LRU8-way LRU
36 MB 256 B Line 36 MB 256 B Line 12 ndash way LRU12 ndash way LRU
Chip interconnectChip interconnectType Type Intra-MCM data busesIntra-MCM data busesInter-MCM data busesInter-MCM data buses
Distributed SwitchDistributed Switchfrac12 processor speed frac12 processor speed frac12 processor speedfrac12 processor speed
Enhanced Distributed SwitchEnhanced Distributed SwitchProcessor speedProcessor speedProcessor speedProcessor speed
MemoryMemory 512 GB maximum512 GB maximum 1024 GB (1 TB) Maximum1024 GB (1 TB) Maximum
POWER Server Roadmap
bull Additional hellip
Summary
First dual core SMT microprocessor
Extended SMP to 64-way
Operating in laboratory
Power dynamically managed with no performance penalty
Implementation permits future technology scalability from
circuit and power perspective
Innovative approach leveraging technology with system
focus for high performance in a power efficient design
bull Summary hellip
Other ReferencesOther References
[1] R Kalla B Sinharoy J Tendler ldquoIBM POWER5 CHIP A DUAL-CORE [1] R Kalla B Sinharoy J Tendler ldquoIBM POWER5 CHIP A DUAL-CORE MULTITHREADED PROCESSORrdquo IEE Computer Society MARCH-APRIL 2004 MULTITHREADED PROCESSORrdquo IEE Computer Society MARCH-APRIL 2004 [2] R Kalla IBM System Group ldquoIBMrsquos POWER5 Design and Methodologyrdquo IBM [2] R Kalla IBM System Group ldquoIBMrsquos POWER5 Design and Methodologyrdquo IBM Corporation 2003Corporation 2003
64-way SMP Interconnection
bull Additional hellip
Interconnection exploits enhanced distributed switch
bull All chip interconnections operate at half processor frequency and scale with processor frequency
POWER4 and POWER5 Storage Hierarchybull Additional hellip
POWER4 POWER5
L2 Cache L2 Cache Capacity Line Size Capacity Line Size Associativity ReplacementAssociativity Replacement
144 MB 128 B Line 144 MB 128 B Line 8-way LRU8-way LRU
192 MB 128 B Line192 MB 128 B Line10 ndash way LRU10 ndash way LRU
Off-Chip L3 CacheOff-Chip L3 CacheCapacity Line Size Capacity Line Size Associativity ReplacementAssociativity Replacement
32 MB 512 B Line32 MB 512 B Line8-way LRU8-way LRU
36 MB 256 B Line 36 MB 256 B Line 12 ndash way LRU12 ndash way LRU
Chip interconnectChip interconnectType Type Intra-MCM data busesIntra-MCM data busesInter-MCM data busesInter-MCM data buses
Distributed SwitchDistributed Switchfrac12 processor speed frac12 processor speed frac12 processor speedfrac12 processor speed
Enhanced Distributed SwitchEnhanced Distributed SwitchProcessor speedProcessor speedProcessor speedProcessor speed
MemoryMemory 512 GB maximum512 GB maximum 1024 GB (1 TB) Maximum1024 GB (1 TB) Maximum
POWER Server Roadmap
bull Additional hellip
Summary
First dual core SMT microprocessor
Extended SMP to 64-way
Operating in laboratory
Power dynamically managed with no performance penalty
Implementation permits future technology scalability from
circuit and power perspective
Innovative approach leveraging technology with system
focus for high performance in a power efficient design
bull Summary hellip
Other ReferencesOther References
[1] R Kalla B Sinharoy J Tendler ldquoIBM POWER5 CHIP A DUAL-CORE [1] R Kalla B Sinharoy J Tendler ldquoIBM POWER5 CHIP A DUAL-CORE MULTITHREADED PROCESSORrdquo IEE Computer Society MARCH-APRIL 2004 MULTITHREADED PROCESSORrdquo IEE Computer Society MARCH-APRIL 2004 [2] R Kalla IBM System Group ldquoIBMrsquos POWER5 Design and Methodologyrdquo IBM [2] R Kalla IBM System Group ldquoIBMrsquos POWER5 Design and Methodologyrdquo IBM Corporation 2003Corporation 2003
POWER4 and POWER5 Storage Hierarchybull Additional hellip
POWER4 POWER5
L2 Cache L2 Cache Capacity Line Size Capacity Line Size Associativity ReplacementAssociativity Replacement
144 MB 128 B Line 144 MB 128 B Line 8-way LRU8-way LRU
192 MB 128 B Line192 MB 128 B Line10 ndash way LRU10 ndash way LRU
Off-Chip L3 CacheOff-Chip L3 CacheCapacity Line Size Capacity Line Size Associativity ReplacementAssociativity Replacement
32 MB 512 B Line32 MB 512 B Line8-way LRU8-way LRU
36 MB 256 B Line 36 MB 256 B Line 12 ndash way LRU12 ndash way LRU
Chip interconnectChip interconnectType Type Intra-MCM data busesIntra-MCM data busesInter-MCM data busesInter-MCM data buses
Distributed SwitchDistributed Switchfrac12 processor speed frac12 processor speed frac12 processor speedfrac12 processor speed
Enhanced Distributed SwitchEnhanced Distributed SwitchProcessor speedProcessor speedProcessor speedProcessor speed
MemoryMemory 512 GB maximum512 GB maximum 1024 GB (1 TB) Maximum1024 GB (1 TB) Maximum
POWER Server Roadmap
bull Additional hellip
Summary
First dual core SMT microprocessor
Extended SMP to 64-way
Operating in laboratory
Power dynamically managed with no performance penalty
Implementation permits future technology scalability from
circuit and power perspective
Innovative approach leveraging technology with system
focus for high performance in a power efficient design
bull Summary hellip
Other ReferencesOther References
[1] R Kalla B Sinharoy J Tendler ldquoIBM POWER5 CHIP A DUAL-CORE [1] R Kalla B Sinharoy J Tendler ldquoIBM POWER5 CHIP A DUAL-CORE MULTITHREADED PROCESSORrdquo IEE Computer Society MARCH-APRIL 2004 MULTITHREADED PROCESSORrdquo IEE Computer Society MARCH-APRIL 2004 [2] R Kalla IBM System Group ldquoIBMrsquos POWER5 Design and Methodologyrdquo IBM [2] R Kalla IBM System Group ldquoIBMrsquos POWER5 Design and Methodologyrdquo IBM Corporation 2003Corporation 2003
POWER Server Roadmap
bull Additional hellip
Summary
First dual core SMT microprocessor
Extended SMP to 64-way
Operating in laboratory
Power dynamically managed with no performance penalty
Implementation permits future technology scalability from
circuit and power perspective
Innovative approach leveraging technology with system
focus for high performance in a power efficient design
bull Summary hellip
Other ReferencesOther References
[1] R Kalla B Sinharoy J Tendler ldquoIBM POWER5 CHIP A DUAL-CORE [1] R Kalla B Sinharoy J Tendler ldquoIBM POWER5 CHIP A DUAL-CORE MULTITHREADED PROCESSORrdquo IEE Computer Society MARCH-APRIL 2004 MULTITHREADED PROCESSORrdquo IEE Computer Society MARCH-APRIL 2004 [2] R Kalla IBM System Group ldquoIBMrsquos POWER5 Design and Methodologyrdquo IBM [2] R Kalla IBM System Group ldquoIBMrsquos POWER5 Design and Methodologyrdquo IBM Corporation 2003Corporation 2003
Summary
First dual core SMT microprocessor
Extended SMP to 64-way
Operating in laboratory
Power dynamically managed with no performance penalty
Implementation permits future technology scalability from
circuit and power perspective
Innovative approach leveraging technology with system
focus for high performance in a power efficient design
bull Summary hellip
Other ReferencesOther References
[1] R Kalla B Sinharoy J Tendler ldquoIBM POWER5 CHIP A DUAL-CORE [1] R Kalla B Sinharoy J Tendler ldquoIBM POWER5 CHIP A DUAL-CORE MULTITHREADED PROCESSORrdquo IEE Computer Society MARCH-APRIL 2004 MULTITHREADED PROCESSORrdquo IEE Computer Society MARCH-APRIL 2004 [2] R Kalla IBM System Group ldquoIBMrsquos POWER5 Design and Methodologyrdquo IBM [2] R Kalla IBM System Group ldquoIBMrsquos POWER5 Design and Methodologyrdquo IBM Corporation 2003Corporation 2003
Other ReferencesOther References
[1] R Kalla B Sinharoy J Tendler ldquoIBM POWER5 CHIP A DUAL-CORE [1] R Kalla B Sinharoy J Tendler ldquoIBM POWER5 CHIP A DUAL-CORE MULTITHREADED PROCESSORrdquo IEE Computer Society MARCH-APRIL 2004 MULTITHREADED PROCESSORrdquo IEE Computer Society MARCH-APRIL 2004 [2] R Kalla IBM System Group ldquoIBMrsquos POWER5 Design and Methodologyrdquo IBM [2] R Kalla IBM System Group ldquoIBMrsquos POWER5 Design and Methodologyrdquo IBM Corporation 2003Corporation 2003